Skip to content

feat: prod-readiness PR 4 of 5 — observability (request tracing + JSON logs + structured errors)#109

Merged
aksOps merged 1 commit into
mainfrom
feat/prod-ready-4-observability
Apr 28, 2026
Merged

feat: prod-readiness PR 4 of 5 — observability (request tracing + JSON logs + structured errors)#109
aksOps merged 1 commit into
mainfrom
feat/prod-ready-4-observability

Conversation

@aksOps
Copy link
Copy Markdown
Contributor

@aksOps aksOps commented Apr 28, 2026

Summary

Fourth of 5 production-readiness PRs. Closes the missing-MDC, hot-path health probe, MCP error leak, and structured-logging gaps.

Why

Pre-PR-4 every MDC.get(\"request_id\") call returned null — the four consumers (BearerAuthFilter, RateLimitFilter, GraphController, GlobalExceptionHandler) all generated synthetic UUIDs that never correlated. The /actuator/health/readiness probe ran a Cypher count() on every k8s probe (~1Hz). MCP tools returned flat {error: \"...\"} strings with no correlation. Logs were plaintext, unparseable by Loki/Splunk.

Changes

  • RequestIdFilter (new) — outermost in security chain. Populates MDC.request_id, echoes in X-Request-Id response header, allow-list validates inbound ([A-Za-z0-9_-]{8,64}), clears MDC in finally.
  • JSON-structured logging in serving profile via logstash-logback-encoder 9.0 (MIT). Indexing/CLI profiles keep plaintext.
  • GraphHealthIndicator 30s TTL cache via AtomicReference<CachedHealth> (lock-free). One underlying count() per 30s regardless of probe rate. Error response sanitized (CWE-209).
  • Liveness/readiness groupsgraphHealthIndicator on readiness only.
  • /actuator/prometheusmicrometer-registry-prometheus added; exposed under bearer auth (NOT permitAll). Application tag codeiq.
  • Structured MCP error envelopeerrorEnvelope(code, e) returns {code, message, request_id, error}. Codes: INTERNAL_ERROR, INVALID_INPUT, FILE_READ_FAILED, SERIALIZATION_FAILED. Legacy error preserved.

Test plan

  • New RequestIdFilterTest (7 cases: UUID generation, header pass-through, control-char rejection, length bounds, MDC clear-in-finally including throw path)
  • GraphHealthIndicatorTest cache-hit assertion + sanitized error fields
  • McpToolsTest#readFileShouldHandleMissingFile updated for new envelope contract
  • Full suite: 3680 tests / 0 failures / 0 errors / 32 skipped
  • CI green (build + 6 OSS-CLI security jobs + CodeQL + Socket)
  • Verify auto-merge unblocks once CodeQL completes

🤖 Generated with Claude Code

…ON logs + structured errors

Fourth of 5 production-readiness PRs. Closes the missing-MDC, hot-path
health probe, MCP error leak, and structured-logging gaps.

Why
---
Pre-PR-4 every `MDC.get("request_id")` call across BearerAuthFilter,
RateLimitFilter, GraphController, and GlobalExceptionHandler returned
null — the four consumers all generated synthetic UUIDs that never
correlated to the same request. The /actuator/health/readiness probe
ran a Cypher count() on every probe (k8s default ~1Hz). MCP tools
returned flat `{error: "..."}` strings with no correlation field. Logs
were plaintext `%msg%n` — unparseable by Loki/Splunk.

Changes
-------
* **`RequestIdFilter` (new)** — outermost in the security chain.
  Populates `MDC.request_id` per request, echoes back in
  `X-Request-Id` response header, allow-list validates inbound
  ([A-Za-z0-9_-]{8,64}), clears MDC in finally to prevent leak
  across pooled threads (Tomcat platform + virtual-thread carriers).

* **JSON-structured logging** in serving profile via
  `logstash-logback-encoder` 9.0 (MIT). One JSON event per log line
  with ts/level/logger/thread/msg/stack + all MDC entries +
  `application: codeiq` tag. Indexing/CLI profiles keep plaintext.

* **`GraphHealthIndicator` 30s TTL cache** via
  `AtomicReference<CachedHealth>` (lock-free). One underlying
  count() per 30s regardless of probe rate. Error response sanitized
  — `e.getMessage()` no longer surfaces to the permitAll endpoint
  (CodeQL CWE-209 again); only `error_class` + log line.

* **Liveness/readiness groups** — `graphHealthIndicator` on
  readiness only. Pre-PR-4 it flapped the pod (k8s killing) on
  graph-down instead of just routing away.

* **`/actuator/prometheus`** — `micrometer-registry-prometheus`
  added; exposed under bearer auth (NOT permitAll — full metrics
  tree is reconnaissance). Application tag `codeiq` for multi-pod
  scraping. Step 10s.

* **Structured MCP error envelope** — `errorEnvelope(code, e)`
  helper returns `{code, message, request_id, error}` (legacy
  `error` preserved for backwards-compat). Codes: INTERNAL_ERROR,
  INVALID_INPUT, FILE_READ_FAILED, SERIALIZATION_FAILED. Full
  exception logged server-side; sanitized envelope to client.
  `readFile` no longer concatenates `e.getMessage()` (CWE-209).

Test coverage
-------------
* New `RequestIdFilterTest` — 7 cases (UUID generation, header
  pass-through, control-char rejection, length bounds, MDC
  clear-in-finally including throw path).
* `GraphHealthIndicatorTest` — added cache-hit assertion (3 calls
  → 1 underlying `count()`); updated for sanitized error fields.
* `McpToolsTest#readFileShouldHandleMissingFile` — updated for new
  envelope contract (asserts `code: FILE_READ_FAILED`).
* Full suite: 3680 / 0F / 0E / 32S.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@aksOps aksOps enabled auto-merge (squash) April 28, 2026 09:31
@socket-security
Copy link
Copy Markdown

Review the following changes in direct dependencies. Learn more about Socket for GitHub.

Diff Package Supply Chain
Security
Vulnerability Quality Maintenance License
Addedmaven/​net.logstash.logback/​logstash-logback-encoder@​9.03610090100100
Addedmaven/​io.micrometer/​micrometer-registry-prometheus@​1.16.510010090100100

View full report

@aksOps aksOps merged commit 5f3021a into main Apr 28, 2026
13 checks passed
@aksOps aksOps deleted the feat/prod-ready-4-observability branch April 28, 2026 09:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant