Skip to content

feat(entity-caching-3): feature flag percentage based rollout router changes#2829

Draft
SkArchon wants to merge 23 commits into
milinda/entity-caching-2-control-plane-ff-rolloutfrom
milinda/entity-caching-3-feature-flag-rollout-router
Draft

feat(entity-caching-3): feature flag percentage based rollout router changes#2829
SkArchon wants to merge 23 commits into
milinda/entity-caching-2-control-plane-ff-rolloutfrom
milinda/entity-caching-3-feature-flag-rollout-router

Conversation

@SkArchon
Copy link
Copy Markdown
Contributor

@SkArchon SkArchon commented May 6, 2026

@coderabbitai summary

Checklist

  • I have discussed my proposed changes in an issue and have received approval to proceed.
  • I have followed the coding standards of the project.
  • Tests or benchmarks have been added or updated.
  • Documentation has been updated on https://github.com/wundergraph/docs-website.
  • I have read the Contributors Guide.

Open Source AI Manifesto

This project follows the principles of the Open Source AI Manifesto. Please ensure your contribution aligns with its principles.

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented May 6, 2026

Important

Review skipped

Auto reviews are disabled on base/target branches other than the default branch.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: df0b061e-2895-4920-af74-3b65b084c404

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

  • 🔍 Trigger review

Tip

💬 Introducing Slack Agent: The best way for teams to turn conversations into code.

Slack Agent is built on CodeRabbit's deep understanding of your code, so your team can collaborate across the entire SDLC without losing context.

  • Generate code and open pull requests
  • Plan features and break down work
  • Investigate incidents and troubleshoot customer tickets together
  • Automate recurring tasks and respond to alerts with triggers
  • Summarize progress and report instantly

Built for teams:

  • Shared memory across your entire org—no repeating context
  • Per-thread sandboxes to safely plan and execute work
  • Governance built-in—scoped access, auditability, and budget controls

One agent for your entire SDLC. Right inside Slack.

👉 Get started


Comment @coderabbitai help to get the list of available commands and usage tips.

@github-actions github-actions Bot added the router label May 6, 2026
@SkArchon SkArchon changed the title feat: ff percentage based rollout router feat(entity-caching-3): feature flag percentage based rollout router changes May 6, 2026
@codecov
Copy link
Copy Markdown

codecov Bot commented May 6, 2026

Codecov Report

❌ Patch coverage is 14.28571% with 6 lines in your changes missing coverage. Please review.
✅ Project coverage is 10.77%. Comparing base (5e1b5d7) to head (cbd541c).

Files with missing lines Patch % Lines
...en/proto/wg/cosmo/cacheevents/v1/cacheevents.pb.go 0.00% 4 Missing ⚠️
graphqlmetrics/cacheevents/schema.go 0.00% 2 Missing ⚠️

❌ Your project check has failed because the head coverage (10.77%) is below the target coverage (30.00%). You can increase the head coverage or adjust the target coverage.

Additional details and impacted files
@@                                  Coverage Diff                                  @@
##           milinda/entity-caching-2-control-plane-ff-rollout    #2829      +/-   ##
=====================================================================================
+ Coverage                                               9.59%   10.77%   +1.18%     
=====================================================================================
  Files                                                    445      468      +23     
  Lines                                                  56997    59045    +2048     
  Branches                                                 905      905              
=====================================================================================
+ Hits                                                    5468     6362     +894     
- Misses                                                 51122    52215    +1093     
- Partials                                                 407      468      +61     
Files with missing lines Coverage Δ
graphqlmetrics/cacheevents/writer.go 91.00% <100.00%> (ø)
graphqlmetrics/cacheevents/schema.go 97.05% <0.00%> (ø)
...en/proto/wg/cosmo/cacheevents/v1/cacheevents.pb.go 5.93% <0.00%> (ø)

... and 20 files with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@SkArchon SkArchon force-pushed the milinda/entity-caching-2-control-plane-ff-rollout branch from 3294faa to 1af9b5f Compare May 6, 2026 15:20
SkArchon added 7 commits May 6, 2026 22:54
The loop at feature_flag_rollout.go:53-58 read `!= nil` where it should
have been `== nil` — every flag that *carried* a traffic_percentage was
skipped, and only preview-only flags were appended. Net effect:

- Real rollout flags never registered → still client-pinnable via
  X-Feature-Flag / cookie. The exact "header pin bypass" the design
  promises is defeated; a client who knows the flag name escapes the
  percentage gate entirely.
- Preview flags ended up in `rolloutFlags` because GetTrafficPercentage()
  returns 0 when unset → isRolloutFlag(name)==true → graph_server.go's
  pin-strip block fires → existing preview clients lose pin steering.

The doc-comment two lines up ("every flag whose execution config carries
a traffic_percentage participates") had the right intent; the code just
inverted it.

This single-line fix restores the security contract of the feature.
Enabling rollouts flips the routing semantics for every flag whose
execution config carries a traffic_percentage — header/cookie pins start
being ignored. That is a behavior change that should not happen on a
silent router upgrade.

Per router/CLAUDE.md ("Boolean defaults should be `false` when possible"),
flip envDefault to false, the schema's documented default to false, and
update config_defaults.json to match. fixtures/full.yaml keeps
`enabled: true` — it is a feature-on example, not a defaults snapshot.
… whole graph

The previous logic returned an error and a nil selector when cumulative
percentage exceeded 100. graph_server.go logs the error and proceeds with
selector==nil — but selector==nil silently restores client-pinnability for
every other rollout flag on the graph, because isRolloutFlag returns false
for all of them. One operator typo (110% instead of 10%) blew up every
sibling rollout's invariant.

New semantics:
- A flag whose own pct > 100 is always a typo — drop it, log Error, keep
  going. (Pre-fix this returned an error.)
- Otherwise iterate the alphabetically-sorted flags; if the next would push
  cumulative past 100, drop *that* flag (log Error) and keep filling with
  the remaining ones. The dropped flag's traffic falls through to base.
- Selector still returns nil only when nothing usable remains.

This computes the budget gate inline as we build, so partial state is never
exposed (pre-fix: rules were appended before the post-loop sum check, which
would have leaked partial state on any future refactor that returned a
half-built selector).

Tests TestNewRolloutSelector_FailsClosedOnOverflow and
TestNewRolloutSelector_RejectsPercentageAbove100 are renamed to
…DropsOverflowingFlagButKeepsSiblings and …DropsAbove100PercentFlag and
updated to assert the new drop-but-preserve-siblings behavior.
crypto/rand.Read is a getrandom(2) syscall on Linux — at 50k req/s that
is 50k syscalls/s on the hot path purely for non-security bucketing.
Bucketing has no security requirement: the bucket value is never visible
to the client, and is not derived from request content under the current
no-stickiness design (a determined client can already retry-grind via the
X-Feature-Flag echo, which is the actual surface to address — separate
fix).

math/rand/v2.Uint32N is lock-free per goroutine, ~5ns, and rejection-
samples to eliminate the modulo bias the previous `% rolloutBucketScale`
introduced (≈1 in 430k buckets over-represented).

Drops the crypto/rand fallback path (no longer needed) and the now-unused
encoding/binary import.
… picks

The selector returns ("flag", "random", true) for traffic landing in a
rollout bucket; the request handler then ran ServeHTTP and unconditionally
set X-Feature-Flag: <picked> on the response. A determined client can
retry until that header reveals the variant, then pin via X-Feature-Flag
header / cookie — defeating the percentage gate the rollout selector
exists to enforce. Echo is preserved for explicit (preview-flag) pins
because the client already chose the flag.

Also drops the dead `_ string // versionSeed: unused` parameter from
newRolloutSelector and the corresponding routerConfig.GetVersion()
threading from graph_server.go. The parameter hinted at hash-based
stickiness that never landed; deleting it removes the cargo-culted
plumbing. If stickiness is wanted later, the right place is `pick`'s
already-existing *http.Request parameter.

Tests updated for the new signature; behavior unchanged where pickedSource
is exercised.
@SkArchon SkArchon closed this May 6, 2026
@SkArchon SkArchon force-pushed the milinda/entity-caching-3-feature-flag-rollout-router branch from 3ba5c01 to 04c9de5 Compare May 6, 2026 17:50
@SkArchon SkArchon reopened this May 6, 2026
@github-actions
Copy link
Copy Markdown

github-actions Bot commented May 6, 2026

Router-nonroot image scan passed

✅ No security vulnerabilities found in image:

ghcr.io/wundergraph/cosmo/router:sha-10af5c602db85b6585758340bc97514087e57a26-nonroot

SkArchon added 3 commits May 6, 2026 23:36
… picks

A previous fix in this branch stripped the X-Feature-Flag response echo
when the selector picked the variant via random bucketing, on the theory
that the echo lets a client retry until they grind onto the variant.

In practice the variant must have an observably different response shape
— otherwise the rollout has no purpose. The shape difference itself is a
side-channel a grinder can use, so the strip doesn't actually prevent the
attack it was meant to address. Meanwhile it:

- Hides a useful debug signal from honest clients.
- Breaks downstream caches that key on the flag.
- Complicates integration tests that need to know which variant served
  (router-tests/operations/feature_flag_rollouts_test.go was relying on
  the header, and now correctly does so again).

Reinstates the unconditional w.Header().Set(featureFlagHeader, ff) and
drops the now-unused `source` capture from selector.pick. The Debug log
for ignored pins (added in a prior commit) stays — that observability
fix is unrelated and still useful.
res.Body is the raw response payload — a JSON string in which the inner
double-quotes around field names are escaped as \". The discriminator
constant was searching for `Cannot query field "productCount"` (plain
quotes), which never matches the on-the-wire `Cannot query field \"productCount\"`.

Every test that asserted the base graph rejected `productCount` (header-
or cookie-pin-ignored variants, 0% rollout, >100% fail-closed) was
silently broken — the require.Contains assertion passed only when the
body genuinely lacked the substring, which is the opposite of the
intended check. Now they pass when the base graph errors as expected.
Touches a path under proto/ to retrigger the Proto CI workflow on this
branch. The previous Proto CI run on 3ba5c01 left a stale "uncommitted
changes detected" sticky comment that never cleared because subsequent
commits on this branch shared SHAs with branch 2 (where Proto CI ran
once and succeeded). Locally `make generate-go` is clean against this
proto state, so this push should produce a green Proto CI and clear
the sticky.
SkArchon and others added 10 commits May 7, 2026 00:10
…tion

feature_flags.id has ON DELETE SET NULL into
federated_graphs_to_feature_flag_schema_versions, so a flag composition's
fgffsv row survives tear-down with feature_flag_id = NULL. The previous
GetCacheCohortConfigVersions handler filtered the main cohort with
feature_flag_id IS NULL, leaking those orphans into Before. Switch to
graph_compositions.is_feature_flag_composition — set at composition time and
never overwritten — and extend the test to seed an orphan row that the main
cohort must exclude.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ffsv

federated_graphs_to_feature_flag_schema_versions only stores feature-flag
compositions. Main compositions never have a row there; they live in
graph_compositions (joined to federated_graphs via schema_versions.target_id).
The previous handler queried fgffsv for both cohorts and returned an empty
array for main, which silently blanked the Rollout tab's Before card.

Split the handler into two paths:
- flag cohort: keep the fgffsv path, filtered by feature_flag_id and the
  is_feature_flag_composition mirror.
- main cohort: query graph_compositions JOIN schema_versions JOIN
  federated_graphs with is_feature_flag_composition=false.

Update the test fixture to mint main and flag compositions through the
correct tables and clear pre-existing graph_compositions rows for the test
graphs so strict equality assertions hold.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@github-actions

This comment has been minimized.

@github-actions

This comment has been minimized.

…D_HASH events

The cache event builder hardcoded an empty string for the subgraph
parameter and dropped FieldPath when translating EntityFieldHash into
the proto event. This left the SubgraphId and FieldPath columns blank
in gql_cache_events_raw for every field_hash row, making downstream
analytics (heatmap, hub recommendations) unable to slice value-hash
stability or change rates by subgraph.

The upstream EntityFieldHash struct in graphql-go-tools already carries
both fields — DataSource is stamped per-entity-scope by the resolver
via EntityDataSource() with a plan-time SourceName fallback, and
FieldPath is the schema-name chain from the enclosing entity down to
the hashed leaf. Both have been available since the engine bump that
landed alongside the cache-analytics rework; the stale comment in this
file ("once those fields land upstream") was a TODO from before that.

Update the existing FIELD_HASH test to assert both columns propagate
correctly. The test entry now carries DataSource: "accounts" and
FieldPath: ["address"] (matching the docstring example of
User.address.street); the assertion block confirms ev.SubgraphId and
ev.FieldPath on the produced proto event.

No protocol change. The proto fields already existed (positions 14 and
93 on CacheEvent); the writer column already existed (columns 14 and
18 of gql_cache_events_raw); only the in-router translation was
dropping them.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@github-actions

This comment has been minimized.

When events export is enabled, the router writes one FIELD_SELECTION
row per Object/Array accessor selected inside an entity scope, in
addition to the existing per-leaf FIELD_HASH rows. Emission rides on
EventsExport.Enabled — no separate sub-flag. Downstream consumers
filter via the EventType column if they only want field_hash.

The accessor row gives downstream coverage analytics (hub) the signal
to count an accessor as observed even though cosmo's router flattens
nested reads onto the parent entity and never emits a row keyed by the
accessor name on its own.

Wire change:
- proto: new EventType.FIELD_SELECTION = 13, new string child_type_name
  field at slot 94 (reserved 95 for a future concrete_type_name).
- engine flag plumbed through CachingOptions.EmitFieldSelections;
  router sets it to true whenever EventsExport.Enabled is true.
- router builder mirrors the FIELD_HASH PII guard (drop on KeyHash=0).
- writer appends ChildTypeName as the last column; new ClickHouse
  migration adds the column (additive, no row rewrite).
- end-to-end router-test asserts FIELD_SELECTION for Warehouse.location
  emits with ChildTypeName=Location, non-zero KeyHash, zero FieldHash.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@github-actions
Copy link
Copy Markdown

controlplane - uncommitted changes detected

Seems like you forgot to commit some code. Possible causes:

  • Generated code not part of the PR, fix with: make generate and commit the changes
  • Dependency mismatch for tools (protoc, etc). Ensure your local machine has same versions of tools as CI does
  • Formatting drift, fix with make format controlplane / pnpm format controlplane

Dirty files
  • controlplane/src/core/bufservices/feature-flag/getCacheCohortConfigVersions.ts
  • controlplane/test/feature-flag/get-cache-cohort-config-versions.test.ts

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant