feat(entity-caching-3): feature flag percentage based rollout router changes#2829
Conversation
|
Important Review skippedAuto reviews are disabled on base/target branches other than the default branch. Please check the settings in the CodeRabbit UI or the ⚙️ Run configurationConfiguration used: Organization UI Review profile: CHILL Plan: Pro Run ID: You can disable this status message by setting the Use the checkbox below for a quick retry:
Tip 💬 Introducing Slack Agent: The best way for teams to turn conversations into code.Slack Agent is built on CodeRabbit's deep understanding of your code, so your team can collaborate across the entire SDLC without losing context.
Built for teams:
One agent for your entire SDLC. Right inside Slack. Comment |
Codecov Report❌ Patch coverage is
❌ Your project check has failed because the head coverage (10.77%) is below the target coverage (30.00%). You can increase the head coverage or adjust the target coverage. Additional details and impacted files@@ Coverage Diff @@
## milinda/entity-caching-2-control-plane-ff-rollout #2829 +/- ##
=====================================================================================
+ Coverage 9.59% 10.77% +1.18%
=====================================================================================
Files 445 468 +23
Lines 56997 59045 +2048
Branches 905 905
=====================================================================================
+ Hits 5468 6362 +894
- Misses 51122 52215 +1093
- Partials 407 468 +61
🚀 New features to boost your workflow:
|
3294faa to
1af9b5f
Compare
The loop at feature_flag_rollout.go:53-58 read `!= nil` where it should
have been `== nil` — every flag that *carried* a traffic_percentage was
skipped, and only preview-only flags were appended. Net effect:
- Real rollout flags never registered → still client-pinnable via
X-Feature-Flag / cookie. The exact "header pin bypass" the design
promises is defeated; a client who knows the flag name escapes the
percentage gate entirely.
- Preview flags ended up in `rolloutFlags` because GetTrafficPercentage()
returns 0 when unset → isRolloutFlag(name)==true → graph_server.go's
pin-strip block fires → existing preview clients lose pin steering.
The doc-comment two lines up ("every flag whose execution config carries
a traffic_percentage participates") had the right intent; the code just
inverted it.
This single-line fix restores the security contract of the feature.
Enabling rollouts flips the routing semantics for every flag whose
execution config carries a traffic_percentage — header/cookie pins start
being ignored. That is a behavior change that should not happen on a
silent router upgrade.
Per router/CLAUDE.md ("Boolean defaults should be `false` when possible"),
flip envDefault to false, the schema's documented default to false, and
update config_defaults.json to match. fixtures/full.yaml keeps
`enabled: true` — it is a feature-on example, not a defaults snapshot.
… whole graph The previous logic returned an error and a nil selector when cumulative percentage exceeded 100. graph_server.go logs the error and proceeds with selector==nil — but selector==nil silently restores client-pinnability for every other rollout flag on the graph, because isRolloutFlag returns false for all of them. One operator typo (110% instead of 10%) blew up every sibling rollout's invariant. New semantics: - A flag whose own pct > 100 is always a typo — drop it, log Error, keep going. (Pre-fix this returned an error.) - Otherwise iterate the alphabetically-sorted flags; if the next would push cumulative past 100, drop *that* flag (log Error) and keep filling with the remaining ones. The dropped flag's traffic falls through to base. - Selector still returns nil only when nothing usable remains. This computes the budget gate inline as we build, so partial state is never exposed (pre-fix: rules were appended before the post-loop sum check, which would have leaked partial state on any future refactor that returned a half-built selector). Tests TestNewRolloutSelector_FailsClosedOnOverflow and TestNewRolloutSelector_RejectsPercentageAbove100 are renamed to …DropsOverflowingFlagButKeepsSiblings and …DropsAbove100PercentFlag and updated to assert the new drop-but-preserve-siblings behavior.
crypto/rand.Read is a getrandom(2) syscall on Linux — at 50k req/s that is 50k syscalls/s on the hot path purely for non-security bucketing. Bucketing has no security requirement: the bucket value is never visible to the client, and is not derived from request content under the current no-stickiness design (a determined client can already retry-grind via the X-Feature-Flag echo, which is the actual surface to address — separate fix). math/rand/v2.Uint32N is lock-free per goroutine, ~5ns, and rejection- samples to eliminate the modulo bias the previous `% rolloutBucketScale` introduced (≈1 in 430k buckets over-represented). Drops the crypto/rand fallback path (no longer needed) and the now-unused encoding/binary import.
… picks
The selector returns ("flag", "random", true) for traffic landing in a
rollout bucket; the request handler then ran ServeHTTP and unconditionally
set X-Feature-Flag: <picked> on the response. A determined client can
retry until that header reveals the variant, then pin via X-Feature-Flag
header / cookie — defeating the percentage gate the rollout selector
exists to enforce. Echo is preserved for explicit (preview-flag) pins
because the client already chose the flag.
Also drops the dead `_ string // versionSeed: unused` parameter from
newRolloutSelector and the corresponding routerConfig.GetVersion()
threading from graph_server.go. The parameter hinted at hash-based
stickiness that never landed; deleting it removes the cargo-culted
plumbing. If stickiness is wanted later, the right place is `pick`'s
already-existing *http.Request parameter.
Tests updated for the new signature; behavior unchanged where pickedSource
is exercised.
3ba5c01 to
04c9de5
Compare
Router-nonroot image scan passed✅ No security vulnerabilities found in image: |
… picks A previous fix in this branch stripped the X-Feature-Flag response echo when the selector picked the variant via random bucketing, on the theory that the echo lets a client retry until they grind onto the variant. In practice the variant must have an observably different response shape — otherwise the rollout has no purpose. The shape difference itself is a side-channel a grinder can use, so the strip doesn't actually prevent the attack it was meant to address. Meanwhile it: - Hides a useful debug signal from honest clients. - Breaks downstream caches that key on the flag. - Complicates integration tests that need to know which variant served (router-tests/operations/feature_flag_rollouts_test.go was relying on the header, and now correctly does so again). Reinstates the unconditional w.Header().Set(featureFlagHeader, ff) and drops the now-unused `source` capture from selector.pick. The Debug log for ignored pins (added in a prior commit) stays — that observability fix is unrelated and still useful.
res.Body is the raw response payload — a JSON string in which the inner double-quotes around field names are escaped as \". The discriminator constant was searching for `Cannot query field "productCount"` (plain quotes), which never matches the on-the-wire `Cannot query field \"productCount\"`. Every test that asserted the base graph rejected `productCount` (header- or cookie-pin-ignored variants, 0% rollout, >100% fail-closed) was silently broken — the require.Contains assertion passed only when the body genuinely lacked the substring, which is the opposite of the intended check. Now they pass when the base graph errors as expected.
Touches a path under proto/ to retrigger the Proto CI workflow on this branch. The previous Proto CI run on 3ba5c01 left a stale "uncommitted changes detected" sticky comment that never cleared because subsequent commits on this branch shared SHAs with branch 2 (where Proto CI ran once and succeeded). Locally `make generate-go` is clean against this proto state, so this push should produce a green Proto CI and clear the sticky.
…-feature-flag-rollout-router
…-feature-flag-rollout-router
…tion feature_flags.id has ON DELETE SET NULL into federated_graphs_to_feature_flag_schema_versions, so a flag composition's fgffsv row survives tear-down with feature_flag_id = NULL. The previous GetCacheCohortConfigVersions handler filtered the main cohort with feature_flag_id IS NULL, leaking those orphans into Before. Switch to graph_compositions.is_feature_flag_composition — set at composition time and never overwritten — and extend the test to seed an orphan row that the main cohort must exclude. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ffsv federated_graphs_to_feature_flag_schema_versions only stores feature-flag compositions. Main compositions never have a row there; they live in graph_compositions (joined to federated_graphs via schema_versions.target_id). The previous handler queried fgffsv for both cohorts and returned an empty array for main, which silently blanked the Rollout tab's Before card. Split the handler into two paths: - flag cohort: keep the fgffsv path, filtered by feature_flag_id and the is_feature_flag_composition mirror. - main cohort: query graph_compositions JOIN schema_versions JOIN federated_graphs with is_feature_flag_composition=false. Update the test fixture to mint main and flag compositions through the correct tables and clear pre-existing graph_compositions rows for the test graphs so strict equality assertions hold. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
…D_HASH events
The cache event builder hardcoded an empty string for the subgraph
parameter and dropped FieldPath when translating EntityFieldHash into
the proto event. This left the SubgraphId and FieldPath columns blank
in gql_cache_events_raw for every field_hash row, making downstream
analytics (heatmap, hub recommendations) unable to slice value-hash
stability or change rates by subgraph.
The upstream EntityFieldHash struct in graphql-go-tools already carries
both fields — DataSource is stamped per-entity-scope by the resolver
via EntityDataSource() with a plan-time SourceName fallback, and
FieldPath is the schema-name chain from the enclosing entity down to
the hashed leaf. Both have been available since the engine bump that
landed alongside the cache-analytics rework; the stale comment in this
file ("once those fields land upstream") was a TODO from before that.
Update the existing FIELD_HASH test to assert both columns propagate
correctly. The test entry now carries DataSource: "accounts" and
FieldPath: ["address"] (matching the docstring example of
User.address.street); the assertion block confirms ev.SubgraphId and
ev.FieldPath on the produced proto event.
No protocol change. The proto fields already existed (positions 14 and
93 on CacheEvent); the writer column already existed (columns 14 and
18 of gql_cache_events_raw); only the in-router translation was
dropping them.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This comment has been minimized.
This comment has been minimized.
When events export is enabled, the router writes one FIELD_SELECTION row per Object/Array accessor selected inside an entity scope, in addition to the existing per-leaf FIELD_HASH rows. Emission rides on EventsExport.Enabled — no separate sub-flag. Downstream consumers filter via the EventType column if they only want field_hash. The accessor row gives downstream coverage analytics (hub) the signal to count an accessor as observed even though cosmo's router flattens nested reads onto the parent entity and never emits a row keyed by the accessor name on its own. Wire change: - proto: new EventType.FIELD_SELECTION = 13, new string child_type_name field at slot 94 (reserved 95 for a future concrete_type_name). - engine flag plumbed through CachingOptions.EmitFieldSelections; router sets it to true whenever EventsExport.Enabled is true. - router builder mirrors the FIELD_HASH PII guard (drop on KeyHash=0). - writer appends ChildTypeName as the last column; new ClickHouse migration adds the column (additive, no row rewrite). - end-to-end router-test asserts FIELD_SELECTION for Warehouse.location emits with ChildTypeName=Location, non-zero KeyHash, zero FieldHash. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
controlplane - uncommitted changes detectedSeems like you forgot to commit some code. Possible causes:
Dirty files
|
@coderabbitai summary
Checklist
Open Source AI Manifesto
This project follows the principles of the Open Source AI Manifesto. Please ensure your contribution aligns with its principles.