RandomCodeSpace · aksOps · Apr 28, 2026 · Apr 28, 2026
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -413,6 +413,65 @@ for that specific tag for the per-commit details.
     exists, format-conforms to `<64-hex>  <path>`, and excludes itself.
     Full suite: 3672 tests / 0 failures / 0 errors.
 
+- **Production-readiness PR 4 of 5 — observability.** Closes the missing-MDC,
+  hot-path-health-probe, MCP-error-leak, and structured-logging gaps.
+  - **`RequestIdFilter` (new).** Populates SLF4J `MDC.request_id` FIRST in
+    the security chain so every downstream filter, controller, MCP tool,
+    and exception handler sees the same correlation ID. Strict allow-list
+    on inbound `X-Request-Id` (8–64 hex/dash/underscore chars) prevents
+    log-forging; bad inputs are replaced with a generated UUID. Echoes
+    the ID back to the client in the `X-Request-Id` response header. MDC
+    is cleared in `finally` to prevent leak across pooled threads (both
+    Tomcat platform and virtual-thread carriers). Pre-PR-4 every
+    `MDC.get("request_id")` call returned null; the four downstream
+    consumers (BearerAuthFilter, RateLimitFilter, GraphController,
+    GlobalExceptionHandler) all generated synthetic UUIDs that never
+    correlated.
+  - **JSON-structured logging** (`logback-spring.xml`). Serving profile
+    switches the encoder from `%msg%n` plaintext to LogstashEncoder
+    (`logstash-logback-encoder` 9.0 — MIT). One JSON event per log line
+    with `ts`, `level`, `logger`, `thread`, `msg`, `stack`, all MDC
+    entries (`request_id`), and a static `application: codeiq` field for
+    multi-pod ingestion. Indexing/CLI profiles keep plaintext to avoid
+    JSON noise leaking into `codeiq index` output.
+  - **`GraphHealthIndicator` 30s TTL cache.** Pre-PR-4 every readiness
+    probe (k8s default ~1Hz) ran a `MATCH (n) RETURN count(n)` Cypher
+    query — wakes the page cache, competes with API traffic.
+    `AtomicReference<CachedHealth>` lock-free cache absorbs the flood;
+    one underlying probe per 30s regardless of caller concurrency.
+    Error response sanitized too: pre-PR-4 the `error` detail
+    surfaced `e.getMessage()` (CodeQL `java/error-message-exposure`,
+    permitAll endpoint = anonymous probers). Now only an `error_class`
+    indicator; full stack is logged at WARN.
+  - **Liveness/readiness groups** (`application.yml`). Pre-PR-4
+    `GraphHealthIndicator` contributed to BOTH probes — a graph-down
+    event would flap the pod (k8s killing it) instead of just routing
+    away. Pinned to readiness only:
+    `liveness: livenessState`, `readiness: readinessState,
+    graphHealthIndicator`.
+  - **Prometheus metrics** (`/actuator/prometheus`). Added
+    `micrometer-registry-prometheus` dep. Exposed under the bearer-
+    authenticated `/actuator/**` rule (NOT permitAll — full metrics
+    tree is reconnaissance data). Application tag `codeiq` for
+    multi-pod scraping. Step interval 10s.
+  - **Structured MCP error envelope.** Pre-PR-4 every MCP tool catch
+    block returned `toJson(Map.of("error", e.getMessage()))` — flat
+    string, no correlation. Refactored to a centralized
+    `errorEnvelope(code, e)` helper that returns
+    `{code, message, request_id, error}` (legacy `error` field
+    preserved for backwards-compat). Codes assigned per failure
+    category: `INTERNAL_ERROR`, `INVALID_INPUT`, `FILE_READ_FAILED`,
+    `SERIALIZATION_FAILED`. Full exception logged server-side with
+    request_id; only sanitized envelope reaches the client. `readFile`
+    no longer concatenates `e.getMessage()` into a string (CWE-209).
+  - **Tests:** new `RequestIdFilterTest` (7 cases — UUID generation,
+    header pass-through, control-char rejection, length bounds, MDC
+    clear-in-finally including throw path). `GraphHealthIndicatorTest`
+    extended with cache-hit assertion (3 calls → 1 underlying
+    `count()`) and updated for sanitized error fields.
+    `McpToolsTest#readFileShouldHandleMissingFile` updated for new
+    envelope contract. Full suite: 3680 tests / 0 failures / 0 errors.
+
 ## [0.1.0] - 2026-03-28
 
 First general-availability cut. See the

diff --git a/CLAUDE.md b/CLAUDE.md
@@ -452,6 +452,13 @@ bean for code paths that haven't been ported yet.
 - **`serve.sh` and `serve.bat` MUST NOT contain network calls.** Audit RAN-46 §3 — air-gapped deploy model. Pre-PR-3 these scripts had `curl -fL https://repo1.maven.org/...` to download the CLI JAR on first run; that's gone. Receivers must `--include-jar` when bundling or stage the JAR from an internal mirror. There's a regression test in `BundleCommandTest#bundleCreatesZipWithCorrectStructure` that asserts `serve.sh` contains neither `curl` nor `maven.org` — keep that test green.
 - **`.dockerignore` does NOT inherit `.gitignore`.** Docker resolves COPY against the build context, which includes uncommitted/untracked working-tree files. `.gitignore` only stops things being staged; it has no effect on what `docker build` sees. Mirror the secret-pattern globs explicitly in `.dockerignore` (`.env*`, `*.jks`, `id_rsa`, `credentials.{json,yaml}`, etc.). Pre-PR-3 the `.dockerignore` was 9 lines and would have shipped a `.env.prod` straight into a published image.
 - **Semgrep is pinned to `semgrep==1.161.0`** in `.github/workflows/security.yml`. Bumps go through Dependabot's pip ecosystem on a documented cadence — `pip install --upgrade semgrep` (floating) was previously flagged by Scorecard `Pinned-Dependencies`. Don't unpin to "always get latest"; a CI-time auto-bump on a security-scanner can break the build silently when the new release adds rules.
+- **`RequestIdFilter` MUST stay outermost in the security chain.** Filter chain order (set in `SecurityConfig#servingFilterChain`): RequestIdFilter → SecurityHeadersFilter → RateLimitFilter → BearerAuthFilter. If you reorder and put RequestIdFilter inside, the rate-limit reject and auth-reject log lines will fire BEFORE MDC is populated, and clients won't get a correlated `X-Request-Id` response header on 401/429. The filter is registered LAST in the `addFilterBefore(..., UPAFilter.class)` sequence because each `addFilterBefore` pushes the previously-inserted filter further from the target — so the LAST registration is the OUTERMOST in the actual chain.
+- **Inbound `X-Request-Id` is allow-list validated, NOT escaped.** `RequestIdFilter` accepts only `[A-Za-z0-9_-]{8,64}`; anything else is replaced with a generated UUID. Don't try to "be helpful" by lowercasing/trimming — the allow-list is the defense, not the sanitizer. CWE-117 log-forging via `X-Request-Id: \nINFO: granted access` is impossible because the value never reaches a logger before validation.
+- **MDC must be cleared in finally — even on throw.** `RequestIdFilter.doFilterInternal` puts MDC.request_id, runs the chain in a try-block, and clears in finally. There's a regression test (`RequestIdFilterTest#clearsMdcEvenWhenChainThrows`) that asserts the clear runs when the chain throws. Tomcat's NIO worker pool and virtual-thread carriers both reuse threads; a leaked MDC entry from request N is visible to request N+1 if you skip the finally.
+- **`GraphHealthIndicator` is on the readiness probe ONLY.** Configured via `management.endpoint.health.group.{liveness,readiness}` in `application.yml`. Liveness = "JVM up; restart on fail". Readiness = "graph loaded; route traffic only when up". Putting `graphHealthIndicator` on the liveness probe means a graph-down event flaps the pod (k8s killing it) instead of just routing away. If you add another HealthIndicator that's about app health (not infra dependencies), it goes on liveness; if it's about a specific dependency (DB, message bus), it goes on readiness.
+- **Logback `serving` profile uses JSON, others use plaintext.** `logback-spring.xml` defines a `JSON` appender (LogstashEncoder) used only by the `<springProfile name="serving">` block. CLI / indexing / default profiles use the `CONSOLE` (plaintext `%msg%n`) appender so that `codeiq index` doesn't dump JSON-shaped log lines into the user's terminal. Don't switch the global default appender to JSON — the CLI is interactive and the project deliberately uses `System.out` for status messages alongside the logger.
+- **MCP errors return a structured envelope `{code, message, request_id, error}`.** `McpTools#errorEnvelope(code, exception)` is the canonical wrapper. Codes: `INTERNAL_ERROR` (catch-all), `INVALID_INPUT` (IllegalArgumentException), `FILE_READ_FAILED` (file IO), `SERIALIZATION_FAILED` (Jackson). The legacy `error` field is preserved for backwards-compat with older MCP clients reading `error` directly — DON'T remove it without grepping for clients in this org first. Full exception always logged server-side at WARN with request_id; only sanitized envelope reaches the client.
+- **`/actuator/prometheus` is bearer-authenticated, NOT permitAll.** It matches the `/actuator/**` rule in `SecurityConfig`. Don't add `prometheus` to the permitAll list "for the scraper" — Prometheus's `bearer_token_file` config field exists for exactly this case. Anonymous metric scraping is reconnaissance data (request rates, error counts, JVM internals). The application tag `codeiq` is set via `management.metrics.tags.application` so multi-pod ingestion is filterable.
 
 ## Supply-chain observability (OpenSSF)
 

diff --git a/pom.xml b/pom.xml
@@ -170,6 +170,24 @@
             <version>8.18.0</version>
         </dependency>
 
+        <!-- Logstash JSON encoder (MIT). Drops a structured JSON line per log
+             event with timestamp, level, logger, thread, message, MDC entries
+             (request_id, etc.), and optional stack trace. Used in the serving
+             profile only — the indexing/CLI profiles keep the human-readable
+             %msg%n encoder. ~250 KB. -->
+        <dependency>
+            <groupId>net.logstash.logback</groupId>
+            <artifactId>logstash-logback-encoder</artifactId>
+            <version>9.0</version>
+        </dependency>
+
+        <!-- Micrometer Prometheus registry. Exposed at /actuator/prometheus
+             behind bearer auth. Spring Boot manages the version via its BOM. -->
+        <dependency>
+            <groupId>io.micrometer</groupId>
+            <artifactId>micrometer-registry-prometheus</artifactId>
+        </dependency>
+
         <!-- Neo4j Embedded (Community Edition) -->
         <dependency>
             <groupId>org.neo4j</groupId>

diff --git a/src/main/java/io/github/randomcodespace/iq/config/security/RequestIdFilter.java b/src/main/java/io/github/randomcodespace/iq/config/security/RequestIdFilter.java
@@ -0,0 +1,81 @@
+package io.github.randomcodespace.iq.config.security;
+
+import jakarta.servlet.FilterChain;
+import jakarta.servlet.ServletException;
+import jakarta.servlet.http.HttpServletRequest;
+import jakarta.servlet.http.HttpServletResponse;
+import org.slf4j.MDC;
+import org.springframework.context.annotation.Profile;
+import org.springframework.stereotype.Component;
+import org.springframework.web.filter.OncePerRequestFilter;
+
+import java.io.IOException;
+import java.util.UUID;
+import java.util.regex.Pattern;
+
+/**
+ * First filter in the {@code serving} chain — populates {@code request_id} in
+ * the SLF4J MDC for every request so downstream code (filters, controllers,
+ * MCP tools, exception handlers) can read it via {@link MDC#get(String)} and
+ * the JSON log encoder includes it on every emitted line.
+ *
+ * <p><b>ID source priority:</b>
+ * <ol>
+ *   <li>Inbound {@code X-Request-Id} header, if it matches a strict UUID-or-
+ *       hex/dash pattern (defense against header injection — see CWE-117).
+ *       This lets upstream load balancers / API gateways propagate trace IDs.</li>
+ *   <li>Generated {@link UUID#randomUUID()} otherwise.</li>
+ * </ol>
+ *
+ * <p>The same value is echoed back to the client in the {@code X-Request-Id}
+ * response header so a caller seeing a 401/429/500 can quote it in support
+ * channels and operators can grep their JSON logs for it.
+ *
+ * <p><b>MDC discipline.</b> The MDC is request-scoped via SLF4J's
+ * {@code ThreadLocal} — the {@code finally} block clears it so the next
+ * request on the same (virtual or platform) thread doesn't inherit a stale
+ * ID. Without this, a thread-pool reuse leaks IDs across requests.
+ *
+ * <p><b>Filter chain position.</b> Registered FIRST in
+ * {@code SecurityConfig#servingFilterChain} (before SecurityHeadersFilter,
+ * RateLimitFilter, BearerAuthFilter) so the rate-limiter and auth-rejection
+ * log lines all include the same {@code request_id} that the client receives
+ * back in the response. If you reorder filters, this MUST stay outermost.
+ */
+@Component
+@Profile("serving")
+public class RequestIdFilter extends OncePerRequestFilter {
+
+    static final String HEADER = "X-Request-Id";
+    static final String MDC_KEY = "request_id";
+
+    /**
+     * Strict allow-list for inbound {@code X-Request-Id} header values.
+     * Hex digits, dashes, underscores, and length 8–64 — accommodates UUIDs
+     * (with or without dashes), Stripe-style {@code req_*} (after stripping
+     * the prefix in upstream gateways), and short hex correlation IDs.
+     * Anything else gets replaced with a generated UUID — log forging via
+     * {@code X-Request-Id: \nINFO: granted access} is impossible.
+     */
+    private static final Pattern VALID_REQUEST_ID = Pattern.compile("^[A-Za-z0-9_-]{8,64}$");
+
+    @Override
+    protected void doFilterInternal(HttpServletRequest request,
+                                    HttpServletResponse response,
+                                    FilterChain chain) throws ServletException, IOException {
+        String inbound = request.getHeader(HEADER);
+        String requestId = (inbound != null && VALID_REQUEST_ID.matcher(inbound).matches())
+                ? inbound
+                : UUID.randomUUID().toString();
+        MDC.put(MDC_KEY, requestId);
+        response.setHeader(HEADER, requestId);
+        try {
+            chain.doFilter(request, response);
+        } finally {
+            // Always clear — virtual-thread carriers and Tomcat platform
+            // threads both pool, so a leaked MDC entry from request N is
+            // visible to request N+1 if we skip this.
+            MDC.remove(MDC_KEY);
+        }
+    }
+}
diff --git a/src/main/java/io/github/randomcodespace/iq/config/security/SecurityConfig.java b/src/main/java/io/github/randomcodespace/iq/config/security/SecurityConfig.java
@@ -40,7 +40,8 @@ public SecurityFilterChain servingFilterChain(
             HttpSecurity http,
             BearerAuthFilter bearerAuthFilter,
             SecurityHeadersFilter securityHeadersFilter,
-            RateLimitFilter rateLimitFilter) throws Exception {
+            RateLimitFilter rateLimitFilter,
+            RequestIdFilter requestIdFilter) throws Exception {
         http
                 // CSRF is suppressed for ALL paths via ignoringRequestMatchers("/**")
                 // (functionally equivalent to .csrf().disable() but avoids the literal
@@ -69,16 +70,22 @@ public SecurityFilterChain servingFilterChain(
                         .requestMatchers("/api/**", "/mcp/**", "/actuator/**").authenticated()
                         .anyRequest().denyAll())
                 // Filter chain order (outermost → innermost):
-                //   1. SecurityHeadersFilter — adds defensive response headers always.
-                //   2. RateLimitFilter      — 429 before any auth or DB work; throttles
+                //   1. RequestIdFilter      — populates MDC.request_id FIRST so every
+                //                             downstream log line (rate-limit reject,
+                //                             auth-reject, error envelope) carries the
+                //                             same ID and the client gets it back in
+                //                             X-Request-Id.
+                //   2. SecurityHeadersFilter — adds defensive response headers always.
+                //   3. RateLimitFilter      — 429 before any auth or DB work; throttles
                 //                             unauthenticated brute-force too.
-                //   3. BearerAuthFilter     — token validation; 401 if missing/wrong.
+                //   4. BearerAuthFilter     — token validation; 401 if missing/wrong.
                 // Each addFilterBefore(X, UsernamePasswordAuthenticationFilter.class) inserts
                 // X immediately before UPAFilter, pushing the previously-inserted filter farther
                 // from the target — so the registration order here IS the chain order.
                 .addFilterBefore(securityHeadersFilter, UsernamePasswordAuthenticationFilter.class)
                 .addFilterBefore(rateLimitFilter, UsernamePasswordAuthenticationFilter.class)
                 .addFilterBefore(bearerAuthFilter, UsernamePasswordAuthenticationFilter.class)
+                .addFilterBefore(requestIdFilter, UsernamePasswordAuthenticationFilter.class)
                 .formLogin(AbstractHttpConfigurer::disable)
                 .httpBasic(AbstractHttpConfigurer::disable)
                 .anonymous(AbstractHttpConfigurer::disable);