feat: prod-readiness PR 4 of 5 — observability (request tracing + JSON logs + structured errors)#109
Merged
Merged
Conversation
…ON logs + structured errors
Fourth of 5 production-readiness PRs. Closes the missing-MDC, hot-path
health probe, MCP error leak, and structured-logging gaps.
Why
---
Pre-PR-4 every `MDC.get("request_id")` call across BearerAuthFilter,
RateLimitFilter, GraphController, and GlobalExceptionHandler returned
null — the four consumers all generated synthetic UUIDs that never
correlated to the same request. The /actuator/health/readiness probe
ran a Cypher count() on every probe (k8s default ~1Hz). MCP tools
returned flat `{error: "..."}` strings with no correlation field. Logs
were plaintext `%msg%n` — unparseable by Loki/Splunk.
Changes
-------
* **`RequestIdFilter` (new)** — outermost in the security chain.
Populates `MDC.request_id` per request, echoes back in
`X-Request-Id` response header, allow-list validates inbound
([A-Za-z0-9_-]{8,64}), clears MDC in finally to prevent leak
across pooled threads (Tomcat platform + virtual-thread carriers).
* **JSON-structured logging** in serving profile via
`logstash-logback-encoder` 9.0 (MIT). One JSON event per log line
with ts/level/logger/thread/msg/stack + all MDC entries +
`application: codeiq` tag. Indexing/CLI profiles keep plaintext.
* **`GraphHealthIndicator` 30s TTL cache** via
`AtomicReference<CachedHealth>` (lock-free). One underlying
count() per 30s regardless of probe rate. Error response sanitized
— `e.getMessage()` no longer surfaces to the permitAll endpoint
(CodeQL CWE-209 again); only `error_class` + log line.
* **Liveness/readiness groups** — `graphHealthIndicator` on
readiness only. Pre-PR-4 it flapped the pod (k8s killing) on
graph-down instead of just routing away.
* **`/actuator/prometheus`** — `micrometer-registry-prometheus`
added; exposed under bearer auth (NOT permitAll — full metrics
tree is reconnaissance). Application tag `codeiq` for multi-pod
scraping. Step 10s.
* **Structured MCP error envelope** — `errorEnvelope(code, e)`
helper returns `{code, message, request_id, error}` (legacy
`error` preserved for backwards-compat). Codes: INTERNAL_ERROR,
INVALID_INPUT, FILE_READ_FAILED, SERIALIZATION_FAILED. Full
exception logged server-side; sanitized envelope to client.
`readFile` no longer concatenates `e.getMessage()` (CWE-209).
Test coverage
-------------
* New `RequestIdFilterTest` — 7 cases (UUID generation, header
pass-through, control-char rejection, length bounds, MDC
clear-in-finally including throw path).
* `GraphHealthIndicatorTest` — added cache-hit assertion (3 calls
→ 1 underlying `count()`); updated for sanitized error fields.
* `McpToolsTest#readFileShouldHandleMissingFile` — updated for new
envelope contract (asserts `code: FILE_READ_FAILED`).
* Full suite: 3680 / 0F / 0E / 32S.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
Review the following changes in direct dependencies. Learn more about Socket for GitHub.
|
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Fourth of 5 production-readiness PRs. Closes the missing-MDC, hot-path health probe, MCP error leak, and structured-logging gaps.
Why
Pre-PR-4 every
MDC.get(\"request_id\")call returned null — the four consumers (BearerAuthFilter, RateLimitFilter, GraphController, GlobalExceptionHandler) all generated synthetic UUIDs that never correlated. The /actuator/health/readiness probe ran a Cypher count() on every k8s probe (~1Hz). MCP tools returned flat{error: \"...\"}strings with no correlation. Logs were plaintext, unparseable by Loki/Splunk.Changes
RequestIdFilter(new) — outermost in security chain. Populates MDC.request_id, echoes inX-Request-Idresponse header, allow-list validates inbound ([A-Za-z0-9_-]{8,64}), clears MDC in finally.logstash-logback-encoder9.0 (MIT). Indexing/CLI profiles keep plaintext.GraphHealthIndicator30s TTL cache viaAtomicReference<CachedHealth>(lock-free). One underlyingcount()per 30s regardless of probe rate. Error response sanitized (CWE-209).graphHealthIndicatoron readiness only./actuator/prometheus—micrometer-registry-prometheusadded; exposed under bearer auth (NOT permitAll). Application tagcodeiq.errorEnvelope(code, e)returns{code, message, request_id, error}. Codes: INTERNAL_ERROR, INVALID_INPUT, FILE_READ_FAILED, SERIALIZATION_FAILED. Legacyerrorpreserved.Test plan
RequestIdFilterTest(7 cases: UUID generation, header pass-through, control-char rejection, length bounds, MDC clear-in-finally including throw path)GraphHealthIndicatorTestcache-hit assertion + sanitized error fieldsMcpToolsTest#readFileShouldHandleMissingFileupdated for new envelope contract🤖 Generated with Claude Code