Skip to content

feat(encryption): Stage 6B-2 — KEK plumbing + --encryption-enabled flag + mutator-RPC gate#776

Open
bootjp wants to merge 4 commits into
mainfrom
feat/encryption-stage6b2
Open

feat(encryption): Stage 6B-2 — KEK plumbing + --encryption-enabled flag + mutator-RPC gate#776
bootjp wants to merge 4 commits into
mainfrom
feat/encryption-stage6b2

Conversation

@bootjp
Copy link
Copy Markdown
Owner

@bootjp bootjp commented May 16, 2026

Summary

Stage 6B-2 per the PR #762 plan. Completes Stage 6B with the operator-facing surface: --kekFile / --encryption-enabled flags, KEK loader at startup, applier threading, and the re-enabled mutator-RPC wiring in registerEncryptionAdminServer gated on (--encryption-enabled AND --kekFile non-empty).

Builds on Stage 6B-1 (PR #768) which landed the real ApplyBootstrap / ApplyRotation paths. This PR wires those paths into production by threading KEK + Keystore + sidecar path through the FSM construction and re-enabling the EncryptionAdmin mutator RPCs under a double gate.

New flags

Flag Default Purpose
--kekFile "" §5.1 KEK file path (32 raw bytes, owner-only mode). Constructs kek.FileWrapper at startup; threaded into the §6.3 EncryptionApplier. --kekUri deferred to Stage 9 (KMS providers).
--encryption-enabled false §6.5 opt-in to the mutating EncryptionAdmin RPCs. Required (alongside --kekFile) for BootstrapEncryption/RotateDEK/RegisterEncryptionWriter to wire.

Double-gate

Both flags are independently necessary for the mutator wiring:

func encryptionMutatorsEnabled() bool {
    return *encryptionEnabled && *kekFile != ""
}
  • --encryption-enabled alone → no KEK loaded → mutators wired but ApplyBootstrap would HaltApply with ErrKEKNotConfigured on every replica. The RPC-layer gate refuses BEFORE the proposal commits.
  • --kekFile alone → no operator opt-in → mutators stay refused at the RPC boundary.
  • BOTH set → WithEncryptionAdminProposer + WithEncryptionAdminLeaderView installed and mutators reach the §6.3 applier through the shard's engine.

Caller audit (semantic changes — signatures)

  • buildShardGroups gains 3 new tail params (kekWrapper *kek.FileWrapper, keystore *encryption.Keystore, sidecarPath string). 1 production caller (main.go run()), 3 test callers (multiraft_runtime_test.go ×2, main_bootstrap_e2e_test.go ×1) — all updated to pass nil, nil, "" for the no-encryption posture.

  • registerEncryptionAdminServer gains 2 new tail params (enableMutators bool, engine encryptionAdminEngine). 1 production caller (main.go startRaftServers), 2 test callers (main_encryption_admin_test.go) — all updated.

  • The mutator gate is fail-CLOSED by default at every site that does not explicitly opt in.

Test surface

Test Pins
TestEncryptionAdmin_MutatingRPCRefusedWhenGateOff (3 sub-cases) All 3 corners of (enableMutators, engine) with at least one false → FailedPrecondition
TestEncryptionAdmin_MutatingRPCEnabledWhenGateOn The (true, non-nil) corner — asserts the FailedPrecondition gate is NO LONGER firing

A stubEncryptionAdminEngine satisfies the encryptionAdminEngine interface for the gate-on test (no actual Raft proposal — the stub Propose() returns success).

Stage 5D safety boundary preserved

The Stage 5D regression — operators who set --encryptionSidecarPath alone must not see mutators wired — is preserved by the double gate. sidecarPath alone is strictly the capability surface; the mutator wiring is gated on the AND of two SEPARATE flags.

Five-lens self-review

  1. Data loss — no data path touched. The applier remains fail-closed under HaltApply for any ErrEncryptionApply return. Pebble Sync semantics unchanged.
  2. Concurrency / distributed failures — shared Keystore internally locked; KEK wrapper documented safe for concurrent use; per-shard applier reads from shared instances.
  3. Performance — no hot-path change. KEK loading is a single file read at startup. Per-shard applier construction adds one slice allocation per group (0–3 element slice).
  4. Data consistency — mutator gate is the load-bearing safety boundary. With both gate inputs true, the applier still has Stage 6B-1's input-validation guards (bootstrap idempotency, foreign-DEK batch rejection, proposer-DEK mismatch, etc.).
  5. Test coverage — 4 gate corners + the existing applier tests from 6B-1 + the buildShardGroups callers.

Test plan

  • go test -race -timeout=120s -run 'TestEncryptionAdmin|TestRegisterEncryptionAdminServer|TestApply|TestNewApplier|TestWriterRegistry' ./internal/encryption/... ./store/... ./kv/... . — PASS
  • go build ./... — PASS
  • golangci-lint run ./... — 0 issues on touched files (cyclop / mnd under budget via factored loadKEKWrapperFromFlag + applierOptionsFor + encryptionMutatorsEnabled helpers)
  • Full Jepsen suite — not run in this PR (no envelope cutover happens until Stage 6D; the mutator path is reachable but the operator must hand-craft a BootstrapEncryption RPC, which is not part of the Jepsen workload set)

Plan

Stage 6B is now complete. The next stages per PR #762:

  • 6C — §9.1 startup refusal guards
  • 6Denable-storage-envelope admin RPC + §7.1 Phase-1 cutover
  • 6Eenable-raft-envelope admin RPC + §7.1 Phase-2 cutover
  • 6F--encryption-rotate-on-startup ergonomics

Summary by CodeRabbit

Release Notes

  • New Features

    • Added encryption opt-in configuration with new command-line flags for enabling encryption support.
    • Encryption admin mutations are now conditionally gated and fail safely when encryption is not properly configured.
  • Tests

    • Updated test coverage for encryption admin server gating behavior.

Review Change Stack

…ed flag + mutator-RPC gate

Stage 6B-2 per the PR #762 plan. Completes Stage 6B by wiring
the operator-facing surface: --kekFile / --encryption-enabled
flags, KEK loader at startup, applier threading, and the
re-enabled mutator-RPC wiring in registerEncryptionAdminServer
gated on (--encryption-enabled AND --kekFile non-empty).

## What this commit ships

### New flags (main.go)

- --kekFile (default "") — §5.1 KEK file path (32 raw bytes,
  owner-only mode). When set, the file-backed kek.Wrapper is
  constructed at startup and threaded into the §6.3
  EncryptionApplier so ApplyBootstrap / ApplyRotation can
  KEK-unwrap. --kekUri is reserved for KMS providers in
  Stage 9; only file backing ships in 6B-2.

- --encryption-enabled (default false) — §6.5 opt-in to the
  mutating EncryptionAdmin RPCs. Operator-explicit gate: an
  unset flag means the cluster has not committed to the §7.1
  rollout, so mutators MUST refuse even on a fully-keyed
  binary.

### Startup wiring (main.go run())

- loadKEKWrapperFromFlag() constructs the file-backed wrapper
  from --kekFile (or returns nil if empty). Extracted to keep
  run() under the cyclop budget.

- A single shared *encryption.Keystore is created once per
  process. Both Stage 6A (applier) and Stage 6D (storage
  cipher; not in this PR) read from the same instance so
  post-bootstrap DEKs are visible cluster-wide.

- buildShardGroups gains three new tail parameters:
  (kekWrapper, keystore, sidecarPath). These thread through
  to the per-shard FSM construction.

### Per-shard applier construction (main.go buildShardGroups)

applierOptionsFor() helper assembles the variadic
ApplierOption slice based on which Stage 6B-2 dependencies
the operator wired:

  - kekWrapper != nil → WithKEK(kekWrapper)
  - keystore != nil   → WithKeystore(keystore)
  - sidecarPath != "" → WithSidecarPath(sidecarPath)

Without any of them, the applier stays in the Stage 6A
posture (ApplyBootstrap / ApplyRotation return
ErrKEKNotConfigured at apply time). This is the desired
fail-closed behaviour for clusters that have not opted in.

### Mutator gate (main_encryption_admin.go)

registerEncryptionAdminServer regains its (engine, enableMutators)
parameters that Stage 5D removed. When enableMutators=true
AND engine!=nil, WithEncryptionAdminProposer +
WithEncryptionAdminLeaderView are wired and mutators reach
Raft. When either condition is false, both options stay off
and mutators continue to refuse with FailedPrecondition at
the RPC boundary.

encryptionMutatorsEnabled() is the readback: returns true
iff *encryptionEnabled && *kekFile != "". Both conditions
are independently necessary:

  - --encryption-enabled is the explicit operator opt-in.
  - --kekFile being non-empty means a KEK source is loaded;
    without it, a mutator that committed would land in the
    applier with no KEK and HaltApply on every replica. The
    RPC-layer gate keeps that halt unreachable.

The encryptionAdminEngine local interface is re-introduced
(raftengine.Proposer ∩ raftengine.LeaderView) so the
registration helper does not pull in the full engine type.

### Tests (main_encryption_admin_test.go)

Existing TestEncryptionAdmin_MutatingRPCRefusedUntilStage6
replaced with two complementary tests covering all 4 corners
of the (enableMutators, engine) gate matrix:

- TestEncryptionAdmin_MutatingRPCRefusedWhenGateOff: 3
  sub-cases (flag_off_engine_nil, flag_off_engine_set,
  flag_on_engine_nil) — all return FailedPrecondition.

- TestEncryptionAdmin_MutatingRPCEnabledWhenGateOn: the
  remaining (flag_on_engine_set) corner — asserts that the
  FailedPrecondition gate is NO LONGER firing (deeper status
  codes from the empty-payload validation are fine; what
  matters is the gate boundary).

A stubEncryptionAdminEngine fake satisfies the
encryptionAdminEngine interface for the gate-on test.

### Tests updated for buildShardGroups signature

multiraft_runtime_test.go (2 sites) and
main_bootstrap_e2e_test.go (1 site) updated to pass nil,
nil, "" for the 3 new tail parameters. These tests do not
exercise encryption paths; the no-options posture leaves
the FSM in Stage 6A behavior.

## Caller audit (semantic changes — signatures)

- buildShardGroups: 3 new tail parameters. 1 production
  caller (main.go run() at line 337), 3 test callers
  updated. All updated together.

- registerEncryptionAdminServer: 2 new tail parameters
  (enableMutators bool, engine encryptionAdminEngine).
  1 production caller (main.go startRaftServers at line
  ~1390), 2 test callers updated.

- The mutator gate is fail-CLOSED by default: enableMutators
  defaults to false at every test site (no fake engine
  supplied), so the operator-surface posture is identical
  to Stage 5D / 6A in the absence of explicit opt-in.

- main.go run() now constructs a process-wide *Keystore and
  optionally a *kek.FileWrapper. Both live for the process
  lifetime and are threaded to every shard's applier.

## Stage 5D safety boundary

The Stage 5D regression — operators who set
--encryptionSidecarPath alone must not see mutators wired —
is preserved by the double gate. sidecarPath alone is
strictly the capability surface; the mutator wiring is gated
on the AND of two SEPARATE flags.

## Five-lens self-review

1. Data loss — no data path touched. The applier remains
   fail-closed under HaltApply for any ErrEncryptionApply
   return. Pebble Sync semantics unchanged.

2. Concurrency / distributed failures — the shared Keystore
   is internally locked (sync.RWMutex per encryption.Keystore).
   The KEK wrapper is documented as safe for concurrent use.
   Per-shard applier reads from the shared instances.

3. Performance — no hot-path change. KEK loading is a single
   file read at startup. Per-shard applier construction adds
   one slice allocation per group (applierOptionsFor returns
   a 0-3 element slice).

4. Data consistency — the mutator gate is the load-bearing
   safety boundary. With either gate input false, the RPC
   layer refuses; with both true, the applier still has the
   §5.6 / §5.2 input-validation guards from Stage 6B-1
   (bootstrap idempotency, foreign-DEK batch rejection,
   proposer-DEK mismatch, etc.).

5. Test coverage — 4 gate corners + the existing applier
   tests + the buildShardGroups callers. The gate matrix is
   exhaustively tested.

## Verification

- go test -race -timeout=120s -run 'TestEncryptionAdmin|TestRegisterEncryptionAdminServer|TestApply|TestNewApplier|TestWriterRegistry' ./internal/encryption/... ./store/... ./kv/... . — PASS
- go build ./... — PASS
- golangci-lint run ./... — 0 issues on touched files (cyclop / mnd under budget via factored helpers)

## Plan

Stage 6B is now complete (6B-1 applier internals + 6B-2 main.go
wiring + flag + gate). The next stages per PR #762:

- 6C: §9.1 startup refusal guards (sidecar present without
  flag, KEK mismatch, sidecar/raft index gap, fsync support,
  local_epoch rollback / exhaustion, node_id collision).
- 6D: §6.6 enable-storage-envelope admin RPC + §7.1 Phase-1
  storage cutover.
- 6E: §6.6 enable-raft-envelope admin RPC + §7.1 Phase-2
  raft cutover + ErrRaftUnwrapFailed.
- 6F: §6.5 --encryption-rotate-on-startup ergonomics.
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 16, 2026

Warning

Rate limit exceeded

@bootjp has exceeded the limit for the number of commits that can be reviewed per hour. Please wait 37 minutes and 35 seconds before requesting another review.

You’ve run out of usage credits. Purchase more in the billing tab.

⌛ How to resolve this issue?

After the wait time has elapsed, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout.

Please see our FAQ for further information.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: f3903ccc-6d12-4a4a-b2d7-63e6e6021918

📥 Commits

Reviewing files that changed from the base of the PR and between 1e15568 and aad7325.

📒 Files selected for processing (3)
  • main.go
  • main_encryption_admin.go
  • main_encryption_admin_test.go
📝 Walkthrough

Walkthrough

Adds encryption opt-in wiring for Stage 6B-2: new CLI flags (--encryption-enabled, --kekFile) define mutator capability gating; startup loads KEK wrapper and keystore and threads them into buildShardGroups, where shard appliers receive conditional Stage 6B-2 options; registerEncryptionAdminServer now gates mutating RPCs behind encryptionMutatorsEnabled() and passes the enablement flag alongside the engine; tests verify gRPC boundary behavior with new double-gate tests.

Changes

Stage 6B-2 Encryption Opt-In and Admin Gating

Layer / File(s) Summary
Encryption configuration flags and startup initialization
main.go
Introduces CLI flags --encryption-enabled, --kekFile, and updates --encryptionSidecarPath semantics; imports KEK wrapper type; loads KEK wrapper and shared keystore at startup; provides helper functions for nil-safe KEK loading and conditional applier option assembly.
Shard group encryption applier wiring with Stage 6B-2 options
main.go
Updates buildShardGroups function signature to accept KEK wrapper, keystore, and sidecar path; threads these into the call site; constructs shard encryption applier with conditional WithKEK, WithKeystore, and WithSidecarPath options only when supplied, leaving applier in fail-closed Stage 6A posture otherwise.
EncryptionAdmin mutator gating and server registration
main_encryption_admin.go, main.go
Adds encryptionAdminEngine interface and encryptionMutatorsEnabled() helper to gate mutator wiring; updates registerEncryptionAdminServer to accept enablement flag and engine, conditionally wiring WithEncryptionAdminProposer and WithEncryptionAdminLeaderView only when gating is on; updates startup wiring in startRaftServers to compute and pass mutator enablement alongside sidecar path and engine.
Test infrastructure and gating validation
main_encryption_admin_test.go, main_bootstrap_e2e_test.go, multiraft_runtime_test.go
Adds stubEncryptionAdminEngine stub for testing; introduces TestEncryptionAdmin_MutatingRPCRefusedWhenGateOff and TestEncryptionAdmin_MutatingRPCEnabledWhenGateOn to verify gRPC BootstrapEncryption fails with FailedPrecondition when mutators are disabled or engine is nil, and succeeds when gate is on; adds callBootstrapAgainstNewServer helper; updates test call sites to use updated buildShardGroups signature with new encryption dependency arguments.

Estimated Code Review Effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Possibly Related PRs

  • bootjp/elastickv#768: Wires encryption applier using WithKEK, WithKeystore, and WithSidecarPath, which depends on Stage 6B ApplyBootstrap/ApplyRotation implementation and KEK/sidecar option plumbing from this PR's dependency.
  • bootjp/elastickv#760: Introduces encryption admin server contract and mutator wiring; this PR builds on it by gating registerEncryptionAdminServer mutators with enableMutators flag based on new --encryption-enabled/--kekFile semantics.
  • bootjp/elastickv#759: Modifies EncryptionAdmin BootstrapEncryption mutating RPC adapter-side validation; this PR updates the server-side gating logic and test expectations around FailedPrecondition behavior when mutators are gated off.

Poem

🐰 With flags now set and wrappers bound,
The KEK awakens with not a sound.
Stage 6B-2 takes its rightful place—
Encryption gated with style and grace.
The tests confirm the gate holds true,
When options flow, the appliers brew! 🔐

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 41.18% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title clearly and specifically summarizes the main change: Stage 6B-2 encryption implementation with KEK plumbing, a new encryption-enabled flag, and mutator-RPC gating logic.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch feat/encryption-stage6b2

Tip

💬 Introducing Slack Agent: The best way for teams to turn conversations into code.

Slack Agent is built on CodeRabbit's deep understanding of your code, so your team can collaborate across the entire SDLC without losing context.

  • Generate code and open pull requests
  • Plan features and break down work
  • Investigate incidents and troubleshoot customer tickets together
  • Automate recurring tasks and respond to alerts with triggers
  • Summarize progress and report instantly

Built for teams:

  • Shared memory across your entire org—no repeating context
  • Per-thread sandboxes to safely plan and execute work
  • Governance built-in—scoped access, auditability, and budget controls

One agent for your entire SDLC. Right inside Slack.

👉 Get started


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@bootjp
Copy link
Copy Markdown
Owner Author

bootjp commented May 16, 2026

@claude review

Stage 6B-2 — completes Stage 6B with the operator-facing surface (--kekFile + --encryption-enabled flags + KEK loader + applier wiring + mutator-RPC gate).

Looking for review on:

  1. Double-gate semanticsencryptionMutatorsEnabled() = *encryptionEnabled && *kekFile != "". Both conditions argued independently necessary: flag-only → ApplyBootstrap halts on KEK; KEK-only → no operator opt-in. Sound?

  2. Test matrix coverage — 4 corners of (enableMutators, engine) exhaustively pinned: 3 corners refuse with FailedPrecondition, 1 corner (gate-on) asserts the gate is no longer firing. Anything missing?

  3. Shared Keystore lifetime — single process-wide instance, threaded to every shard's applier. Stage 6D will plumb the same instance into the storage cipher (§6.2). Lifetime + concurrency reasoning sound?

  4. buildShardGroups signature change — added 3 new tail params (kek, keystore, sidecarPath). 1 prod + 3 test callers all updated. Test callers pass nil, nil, "" preserving the no-encryption posture. Caller audit complete.

  5. registerEncryptionAdminServer re-signature(engine encryptionAdminEngine, enableMutators bool) re-added. Stage 5D's encryptionAdminEngine interface re-introduced. Wiring gated; tests cover both gate sides.

@claude
Copy link
Copy Markdown

claude Bot commented May 16, 2026

Claude finished @bootjp's task in 3m 36s —— View job


Code Review: Stage 6B-2 — KEK plumbing + --encryption-enabled flag + mutator-RPC gate

Todo:

  • Set up review task list
  • Read main.go changes
  • Read main_encryption_admin.go changes
  • Read main_encryption_admin_test.go changes
  • Read main_bootstrap_e2e_test.go and multiraft_runtime_test.go
  • Analyze double-gate semantics
  • Analyze test matrix coverage
  • Analyze shared Keystore lifetime / concurrency
  • Analyze buildShardGroups signature change
  • Analyze registerEncryptionAdminServer re-signature
  • Post review findings

Finding 1 — Missing third gate condition: --encryptionSidecarPath (Medium)

encryptionMutatorsEnabled() (main_encryption_admin.go) checks *encryptionEnabled && *kekFile != "" but does not check *encryptionSidecarPath != "". The applier's bootstrapAndRotationConfigured() (applier.go:178) requires all three: kek != nil && keystore != nil && sidecarPath != "".

A misconfigured operator who sets --encryption-enabled --kekFile=/k but omits --encryptionSidecarPath will:

  1. Pass the RPC gate (enableMutators=true, engine non-nil) ✓
  2. Commit a BootstrapEncryption Raft entry ✓
  3. Hit the applier with sidecarPath == ""bootstrapAndRotationConfigured() false → return ErrKEKNotConfigured wrapped with ErrEncryptionApply
  4. Trigger HaltApply on every replica — cluster's apply loop stops

This is exactly the halting-scenario the double-gate was designed to prevent, but it leaks through when --encryptionSidecarPath is omitted. Either:

  • Option A (preferred now): make encryptionMutatorsEnabled() a triple-gate:
    func encryptionMutatorsEnabled() bool {
        return *encryptionEnabled && *kekFile != "" && *encryptionSidecarPath != ""
    }
  • Option B (acceptable if Stage 6C is close): add an explicit startup validation that refuses to start if --encryption-enabled && --kekFile != "" && --encryptionSidecarPath == "", with a clear error message. Fix this →

Finding 2 — applierOptionsFor: keystore != nil guard is asymmetric with production behavior (Low, style)

In main.go:830:

if keystore != nil {
    opts = append(opts, encryption.WithKeystore(keystore))
}

In production, keystore is unconditionally encryption.NewKeystore() (line ~348 in run()), so this guard is dead code in production and exists solely to keep the 3-test-caller nil, nil, "" pattern consistent. The nil branch is only reachable from tests.

This isn't a bug, but it creates a latent divergence: the production applier always has a Keystore wired, the test posture doesn't, and bootstrapAndRotationConfigured() observes the difference. A future caller who passes nil for keystore thinking it's "the no-encryption posture" would actually get a partially-wired applier (KEK set, keystore nil) that behaves identically to the no-KEK posture only by accident. A code comment documenting this asymmetry would prevent the confusion.


Finding 3 — Gate-on test: assertion is a negative property, not a positive one (Low, nit)

TestEncryptionAdmin_MutatingRPCEnabledWhenGateOn (main_encryption_admin_test.go:~113) asserts:

if got := status.Code(err); got == codes.FailedPrecondition {
    t.Errorf(...)
}

This is a negative assertion ("the gate did not fire") rather than a positive one ("the call reached the Raft propose path"). The stub Propose() returns &raftengine.ProposalResult{}, nil, so the RPC does proceed past the gate — but the test does not confirm what happens next (codes.OK, InvalidArgument, etc.). This is acceptable since the PR's stated goal is pinning the gate boundary and deeper behavior is covered by applier tests. But it's worth a comment noting this is intentional: the test pin is "FailedPrecondition MUST NOT appear", not "codes.OK MUST appear".

Also: the test only exercises BootstrapEncryption, not RotateDEK or RegisterEncryptionWriter. Since all three go through the same Proposer gate, this is fine as-is, but a one-liner comment would prevent a future reviewer from asking "why not the other two?".


Finding 4 — Keystore not serialized in FSM snapshots (Note, pre-existing, but relevant to Stage 6D planning)

The in-memory Keystore is populated by replaying Raft log entries (ApplyBootstrap / ApplyRotation). After a snapshot, the log before the snapshot boundary is truncated. On restart, the FSM restores from snapshot — but the Keystore is not part of the snapshot. If the bootstrap entry is before the snapshot cutoff, it will not be replayed on restart, leaving the Keystore empty.

Stage 6D's storage cipher will require a populated Keystore to decrypt data. This pre-existing gap needs to be closed — likely by either:

  1. Serializing the Keystore (wrapped DEKs from the sidecar) into the FSM snapshot, or
  2. Populating the Keystore from the sidecar at startup before apply begins

Not introduced by this PR, but worth calling out explicitly in the Stage 6D design doc so it's not discovered during implementation.


Questions on the five review dimensions

1. Double-gate semantics — Mostly sound, but Finding 1 is a real gap. The two-condition gate is argued correctly for the described cases (flag-only, KEK-only). The third necessary condition (sidecarPath) is missing from the gate expression.

2. Test matrix coverage — The (enableMutators, engine) matrix is exhaustive for the gate dimensions. One note: the test helper callBootstrapAgainstNewServer always passes sidecarPath: "", so the gate-on test exercises the gate layer but not the applier layer. This is intentional (per the PR description) and appropriate scope for this PR.

3. Shared Keystore lifetime — The atomic.Pointer[map[uint32]*keyEntry] copy-on-write design in keystore.go is correct. The CAS loop in Set makes concurrent writes safe. Reads via AEAD() take a single atomic load on an immutable snapshot — correct and minimal contention. Lifetime spans run() → shard construction → process exit; the pointer is not retained beyond that scope. Sound.

4. buildShardGroups signature change — Correct. Nil-guard in applierOptionsFor ensures the no-encryption posture is preserved for test callers passing nil, nil, "". All callers updated. The only nit is Finding 2 (keystore nil asymmetry).

5. registerEncryptionAdminServer re-signature — Clean design. Local encryptionAdminEngine interface (Proposer ∩ LeaderView) correctly decouples the helper from the concrete engine. The && gate guarantees both options wired together or both absent, satisfying Validate()'s invariant by construction. enableMutators computed once before the per-shard loop is correct — process-level gate, per-shard engine (each shard proposes through its own Raft group). Sound.


Overall assessment

The core wiring logic is sound and the test matrix correctly pins the gate boundary. Finding 1 is the only blocking concern: the two-condition gate has a real cluster-halting gap when --encryptionSidecarPath is omitted. Findings 2–4 are low-severity notes or pre-existing concerns. Recommend resolving Finding 1 before merge, either by extending the gate expression or adding an explicit startup validation.

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 1e15568f3f

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread main_encryption_admin.go Outdated
// is colocated with the registerEncryptionAdminServer helper
// that consumes it.
func encryptionMutatorsEnabled() bool {
return *encryptionEnabled && *kekFile != ""
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Include sidecar-path in mutator gate

This gate enables mutating EncryptionAdmin RPCs when --encryption-enabled and --kekFile are set, but it ignores whether --encryptionSidecarPath is configured. BootstrapEncryption/RotateDEK can then propose entries that reach apply with only partial applier wiring; internal/encryption/applier.go requires KEK+keystore+sidecar together (bootstrapAndRotationConfigured) and otherwise returns ErrKEKNotConfigured, which is a halt-apply error path. In practice, a node started with encryption enabled + KEK file but empty sidecar path can accept mutator RPCs and trigger cluster apply halt once those entries commit.

Useful? React with 👍 / 👎.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (1)
main_encryption_admin_test.go (1)

144-149: ⚡ Quick win

Use a bounded context for the helper RPC call.

The helper uses a background context; if the request path stalls, this can hang the test. Use context.WithTimeout for deterministic failure.

Suggested timeout guard
-	_, err = client.BootstrapEncryption(context.Background(), &pb.BootstrapEncryptionRequest{
+	ctx, cancel := context.WithTimeout(context.Background(), 2*time.Second)
+	defer cancel()
+	_, err = client.BootstrapEncryption(ctx, &pb.BootstrapEncryptionRequest{
 		StorageDekId:      1,
 		RaftDekId:         2,
 		WrappedStorageDek: []byte("w"),
 		WrappedRaftDek:    []byte("w"),
 	})
 import (
 	"context"
 	"net"
 	"testing"
+	"time"
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@main_encryption_admin_test.go` around lines 144 - 149, The test calls
client.BootstrapEncryption using context.Background(), which can hang; change
the call to use a bounded context via context.WithTimeout (e.g., short timeout
like a few seconds), defer the cancel, and pass that context into
client.BootstrapEncryption so the helper RPC fails deterministically on stalls;
update the invocation site where client.BootstrapEncryption is called in
main_encryption_admin_test.go accordingly.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@main_encryption_admin_test.go`:
- Around line 118-122: The test currently only fails when status.Code(err) ==
codes.FailedPrecondition which lets unrelated transport/setup errors pass;
update the assertion around the BootstrapEncryption result so that when err !=
nil you check the returned status code (via status.Code(err)) and fail if it is
codes.FailedPrecondition, and otherwise require the code to be one of the
expected infra statuses (e.g., codes.Unavailable or codes.DeadlineExceeded);
implement this by replacing the current if-block with logic that computes code
:= status.Code(err) and calls t.Fatalf/t.Errorf if code ==
codes.FailedPrecondition or if code is not in the allowed set
(codes.Unavailable, codes.DeadlineExceeded), referencing BootstrapEncryption,
status.Code, and the codes constants.

---

Nitpick comments:
In `@main_encryption_admin_test.go`:
- Around line 144-149: The test calls client.BootstrapEncryption using
context.Background(), which can hang; change the call to use a bounded context
via context.WithTimeout (e.g., short timeout like a few seconds), defer the
cancel, and pass that context into client.BootstrapEncryption so the helper RPC
fails deterministically on stalls; update the invocation site where
client.BootstrapEncryption is called in main_encryption_admin_test.go
accordingly.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: bfd57482-669a-4167-ae9c-82f2d1cfa13e

📥 Commits

Reviewing files that changed from the base of the PR and between cae5bf8 and 1e15568.

📒 Files selected for processing (5)
  • main.go
  • main_bootstrap_e2e_test.go
  • main_encryption_admin.go
  • main_encryption_admin_test.go
  • multiraft_runtime_test.go

Comment thread main_encryption_admin_test.go
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a double-gate mechanism for encryption mutators, requiring both an explicit --encryption-enabled flag and a valid --kekFile path to enable mutating RPCs. The changes include refactoring the startup sequence to initialize a shared KEK wrapper and keystore, updating the shard group construction to thread these dependencies, and enhancing the gRPC server registration to enforce the new gating logic. Feedback focuses on improving maintainability by using interfaces instead of concrete types for the KEK wrapper in helper functions, which will facilitate future KMS integrations.

Comment thread main.go Outdated
Comment on lines +807 to +815
func loadKEKWrapperFromFlag() (*kek.FileWrapper, error) {
if *kekFile == "" {
return nil, nil
}
w, err := kek.NewFileWrapper(*kekFile)
if err != nil {
return nil, errors.Wrapf(err, "failed to load KEK from %s", *kekFile)
}
return w, nil
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Consider returning the kek.Wrapper interface instead of the concrete *kek.FileWrapper to simplify future KMS integrations. This is a provider-side interface; application code should not depend on or type-assert it to ensure maintainability and prevent unintended dependencies.

Suggested change
func loadKEKWrapperFromFlag() (*kek.FileWrapper, error) {
if *kekFile == "" {
return nil, nil
}
w, err := kek.NewFileWrapper(*kekFile)
if err != nil {
return nil, errors.Wrapf(err, "failed to load KEK from %s", *kekFile)
}
return w, nil
func loadKEKWrapperFromFlag() (kek.Wrapper, error) {
if *kekFile == "" {
return nil, nil
}
w, err := kek.NewFileWrapper(*kekFile)
if err != nil {
return nil, errors.Wrapf(err, "failed to load KEK from %s", *kekFile)
}
return w, nil
}
References
  1. When an interface is designed as an optional extension for internal use by a backend or adapter, clarify its role as a 'provider-side interface' and explicitly state that application code should not depend on or type-assert it.

Comment thread main.go Outdated
// suppresses its option, leaving the applier in the Stage 6A
// posture for that axis. Extracted from buildShardGroups so the
// per-shard loop stays under the cyclop complexity budget.
func applierOptionsFor(kekWrapper *kek.FileWrapper, keystore *encryption.Keystore, sidecarPath string) []encryption.ApplierOption {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Using the encryption.KEKUnwrapper interface for the kekWrapper parameter makes this helper more flexible. As this is a provider-side interface, application code should not depend on or type-assert it, improving maintainability and aligning with the requirements of encryption.NewApplier.

Suggested change
func applierOptionsFor(kekWrapper *kek.FileWrapper, keystore *encryption.Keystore, sidecarPath string) []encryption.ApplierOption {
func applierOptionsFor(kekWrapper encryption.KEKUnwrapper, keystore *encryption.Keystore, sidecarPath string) []encryption.ApplierOption {
References
  1. When an interface is designed as an optional extension for internal use by a backend or adapter, clarify its role as a 'provider-side interface' and explicitly state that application code should not depend on or type-assert it.

…um x2 + coderabbit minor

4 findings on round-1 (commit 1e15568), all addressed.

## codex P1 — Include sidecar-path in mutator gate

The Stage 6B-2 double-gate (--encryption-enabled AND --kekFile)
let an operator with both flags set but --encryptionSidecarPath
empty trigger a cluster halt:

  1. Mutator RPC accepted (Proposer + LeaderView wired)
  2. Proposal commits
  3. Every replica's applier rejects the entry because
     bootstrapAndRotationConfigured() requires WithKEK +
     WithKeystore + WithSidecarPath together; sidecarPath was
     empty so WithSidecarPath was never installed
  4. HaltApply fires on every node → cluster halt

Fix: extend encryptionMutatorsEnabled() to a triple gate:
  return *encryptionEnabled && *kekFile != "" && *encryptionSidecarPath != ""

Without all three, the RPC layer refuses with FailedPrecondition
before any proposal commits. The applier-side
bootstrapAndRotationConfigured() halt becomes unreachable in
practice (still load-bearing as defense-in-depth).

Flag help on --encryptionSidecarPath updated to mention the
triple gate.

## gemini medium (line 815) — loadKEKWrapperFromFlag interface

Return type changed from *kek.FileWrapper (concrete) to
kek.Wrapper (interface). Stage 9 KMS providers will satisfy
the same interface; the call site stays decoupled from the
file-mode provider.

## gemini medium (line 824) — applierOptionsFor parameter interface

Parameter type changed from *kek.FileWrapper to
encryption.KEKUnwrapper. The applier only needs Unwrap, and
any kek.Wrapper satisfies encryption.KEKUnwrapper structurally
because both declare Unwrap with the same signature.

buildShardGroups parameter type updated in step with these
two: now takes kek.Wrapper (so the variable flowing from
loadKEKWrapperFromFlag stays the same interface end-to-end).

## coderabbit minor — Tighten gate-on assertion

TestEncryptionAdmin_MutatingRPCEnabledWhenGateOn previously
accepted any non-FailedPrecondition status, which would mask
a transport failure (bufconn dial timeout, deadline exceeded,
canceled context) as a 'gate is no longer firing' pass. Added
explicit checks for codes.Unavailable / DeadlineExceeded /
Canceled — those signal infra failure, not gate behavior.

## Caller audit

- encryptionMutatorsEnabled: stricter — strictly fewer paths
  produce true. Operator-surface fail-CLOSED at the additional
  axis. Production caller (startRaftServers) is unchanged at
  the call site; the readback is purely stricter.

- loadKEKWrapperFromFlag return type: kek.Wrapper is the
  super-interface of *kek.FileWrapper. The one production
  caller (run()) assigns to a local var of inferred type, so
  no breakage.

- applierOptionsFor parameter type: encryption.KEKUnwrapper is
  satisfied by kek.Wrapper. The one production caller
  (buildShardGroups) passes the kek.Wrapper variable directly.

- buildShardGroups parameter type: kek.Wrapper supersedes
  *kek.FileWrapper. The one production caller passes the var
  from loadKEKWrapperFromFlag. The 3 test callers pass nil,
  which still satisfies the interface.

## Verification

- go test -race -timeout=60s -run 'TestEncryptionAdmin|TestRegisterEncryptionAdminServer' . — PASS
- go build ./... — PASS
- golangci-lint run ./... — 0 issues on touched files
@bootjp
Copy link
Copy Markdown
Owner Author

bootjp commented May 16, 2026

@claude review

Round-2 (commit 0f946c15): 4 findings on round-1 addressed.

# Source Finding Fix
1 codex P1 Mutator gate ignored sidecarPath → cluster halt possible if --encryption-enabled + --kekFile set but --encryptionSidecarPath empty encryptionMutatorsEnabled() now triples gate: all three flags must be non-empty
2 gemini medium (line 815) loadKEKWrapperFromFlag returned concrete *kek.FileWrapper Returns kek.Wrapper interface (Stage 9 KMS providers slot in without rewriting call site)
3 gemini medium (line 824) applierOptionsFor parameter was concrete Now takes encryption.KEKUnwrapper (applier-side narrow interface); kek.Wrapper satisfies it structurally
4 coderabbit minor Gate-on test accepted any non-FailedPrecondition status (could mask transport failures) Added explicit Unavailable / DeadlineExceeded / Canceled failure clauses

buildShardGroups param type updated in step: now takes kek.Wrapper (same interface end-to-end from loadKEKWrapperFromFlagbuildShardGroupsapplierOptionsFor).

Caller audit

  • Triple gate is stricter — strictly fewer paths produce true. Operator-surface fail-CLOSED at the additional axis. Production caller (startRaftServers) unchanged at call site.
  • Interface type changes are super-interface upgrades. All production callers (1 each) work without modification. 3 test callers pass nil which satisfies the interface.

Verification

  • go test -race -timeout=60s -run 'TestEncryptionAdmin|TestRegisterEncryptionAdminServer' . — PASS
  • go build ./... — PASS
  • golangci-lint run ./... — 0 issues on touched files

@claude
Copy link
Copy Markdown

claude Bot commented May 16, 2026

Claude finished @bootjp's task in 2m 2s —— View job


Claude's Round-2 Review

Todo:


Round-2 Fix Verification

All four claimed fixes are confirmed correct.

Fix #1 — Triple gate (main_encryption_admin.go)

encryptionMutatorsEnabled() now returns *encryptionEnabled && *kekFile != "" && *encryptionSidecarPath != "". This exactly mirrors applier.go:178:

func (a *Applier) bootstrapAndRotationConfigured() bool {
    return a.kek != nil && a.keystore != nil && a.sidecarPath != ""
}

The gate is now a structural mirror of the applier's own readiness check. The cluster-halt scenario from round-1 Finding 1 is closed. ✓

Fix #2loadKEKWrapperFromFlag returns kek.Wrapper (main.go:815)

Return type is now the interface. The docstring correctly explains the KMS-provider extensibility rationale (Stage 9 slots in without touching buildShardGroups). ✓

Fix #3applierOptionsFor takes encryption.KEKUnwrapper (main.go:824)

The narrow applier-side interface is used. kek.Wrapper satisfies it structurally — kek.Wrapper declares Unwrap([]byte) ([]byte, error) and KEKUnwrapper is exactly that single method. The applier.go:85–90 comment confirms the cycle-safe reason for declaring a local interface rather than importing kek. ✓

Fix #4 — Gate-on test assertion tightening (main_encryption_admin_test.go:118–131)

The test now has two failure branches: FailedPrecondition → t.Errorf (gate fired, wrong) and Unavailable/DeadlineExceeded/Canceled → t.Fatalf (infra noise, wrong). Either masking path now fails loud. ✓

buildShardGroups param type — Now takes kek.Wrapper. All 3 test callers pass nil (satisfies the interface). Production caller passes the result of loadKEKWrapperFromFlag(), also kek.Wrapper. ✓


Remaining Notes (non-blocking)

1. callBootstrapAgainstNewServer still uses context.Background() (low, carry-over)

CodeRabbit's round-1 nitpick on main_encryption_admin_test.go:144–149 was not addressed. In a bufconn test this is extremely unlikely to hang (no network latency, no real Raft round-trip), but if the EncryptionAdminServer's internal validation path ever blocks, the test would hang indefinitely. A two-second context.WithTimeout would make failures deterministic. Not blocking — bufconn tests in this codebase consistently use context.Background().

2. keystore != nil guard in applierOptionsFor is production-dead-code (low, carry-over)

keystore := encryption.NewKeystore() is unconditionally called in run() before buildShardGroups, so the if keystore != nil branch in applierOptionsFor (main.go:839) is only reachable from tests. The behavior is correct but a silent partial-wiring is possible if a future caller passes nil for keystore while passing a non-nil KEK (the applier gets kek but no keystore, bootstrapAndRotationConfigured() returns false, no cluster halt — just a surprising silent fail-closed). The existing comment on applierOptionsFor explains the nil-guarding is for test callers, so this is acceptably documented. Not blocking.

3. Pre-existing: Keystore not in FSM snapshots (carry-over from round-1 Finding 4)

Still relevant for Stage 6D planning. After a snapshot, if the bootstrap Raft entry falls before the snapshot cutoff it won't replay, leaving the Keystore empty on restart. The Stage 6D storage-cipher plumbing will need either snapshot serialization or sidecar-based Keystore hydration at startup. Worth adding explicitly to the Stage 6D design doc before implementation starts.


Overall Assessment

All round-1 blocking and medium findings are resolved. The triple gate correctly mirrors the applier's own readiness predicate, the interface-return changes are structurally sound (and the cycle-safe rationale is documented), and the test assertions now fail loudly on infra noise. The remaining notes are non-blocking carry-overs. Ready to merge.

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 0f946c1590

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread main_encryption_admin.go
// is colocated with the registerEncryptionAdminServer helper
// that consumes it.
func encryptionMutatorsEnabled() bool {
return *encryptionEnabled && *kekFile != "" && *encryptionSidecarPath != ""
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Block mutators until cluster-wide readiness is confirmed

encryptionMutatorsEnabled() only inspects local flags, so the first node restarted with all three settings will wire mutating RPCs even if peers are still on old config. In that mixed rollout window, a committed BootstrapEncryption/RotateDEK entry can reach followers whose appliers were built without full KEK+keystore+sidecar wiring (buildShardGroups/applierOptionsFor), causing ApplyBootstrap/ApplyRotation to return ErrKEKNotConfigured; the FSM marks that as ErrEncryptionApply, which halts apply on those replicas. This makes a single early admin call capable of stalling the cluster during staged restarts, so mutator enablement should be gated on cluster-wide capability/readiness rather than node-local flags alone.

Useful? React with 👍 / 👎.

…e scope limitation

codex r2 P1 on round-2 (commit 0f946c1): the triple-gate
encryptionMutatorsEnabled() is purely node-local; during a
staged rolling upgrade where some peers still run the pre-6B-2
binary or have the flags unset, a mutator proposal committed
by a fully-configured leader would reach those replicas'
appliers and HaltApply on the under-configured side.

## Resolution

This is a real concern but the cluster-wide readiness check
belongs at the §7.1 Phase-1 cutover RPC layer
(enable-storage-envelope) in Stage 6D, not in the node-local
flag gate. PR #762's milestone breakdown explicitly assigns
the Voters ∪ Learners capability fan-out gate to Stage 6D.

Adding a partial cluster-wide check in Stage 6B-2 would
duplicate the §7.1 capability gate and ship before the
operator-facing GetCapability fan-out is wired into the
admin CLI. The right resolution is to document the scope
limitation explicitly so:

  1. Operators reading the code understand the gate's scope
     and the rolling-upgrade discipline expected at this
     layer (same constraint as PR #765's 6A→6B caveat).
  2. The Stage 6D PR has a clear handoff point — the
     cluster-wide gate is its scope, and this docstring
     points forward to that.

## Doc-only

No code paths changed; the gate's three-flag readback is
unchanged. Single 30-line docstring expansion explaining:

  - the gate is THIS NODE local
  - what cluster-wide readiness means at §7.1
  - why Stage 6B-2 deliberately defers it to Stage 6D
  - the operator-discipline constraint matching PR #765

## Caller audit

Pure doc edit. No callers to audit.

## Verification

- go build ./... — PASS
- go test -race -timeout=60s -run 'TestEncryptionAdmin|TestRegisterEncryptionAdminServer' . — PASS
- golangci-lint run ./... — 0 issues on touched files
@bootjp
Copy link
Copy Markdown
Owner Author

bootjp commented May 16, 2026

@claude review

Round-3 (commit 44b2bb1f): documenting codex r2 P1's scope-boundary concern.

codex r2 P1

[encryptionMutatorsEnabled()] only inspects local flags, so the first node restarted with all three settings will wire mutating RPCs even if peers are still on old config. In that mixed rollout window, a committed BootstrapEncryption/RotateDEK entry can reach followers whose appliers were built without full KEK+keystore+sidecar wiring...

Resolution

This is a real concern, but the cluster-wide readiness check belongs at the §7.1 Phase-1 cutover RPC layer (enable-storage-envelope) in Stage 6D, not in the node-local flag gate.

PR #762's plan explicitly assigns the Voters ∪ Learners capability fan-out gate to Stage 6D:

6D | §6.6 enable-storage-envelope admin RPC + §7.1 Phase-1 storage cutover (§6.2 toggle ON) + Voters ∪ Learners capability gate ...

Adding a partial cluster-wide check at the node-local flag-gate layer in Stage 6B-2 would duplicate the §7.1 capability gate and ship before the operator-facing GetCapability fan-out is wired into the admin CLI. The decomposition argued at the design-doc level is:

  • Stage 6B-2 (this PR): node-local operator opt-in surface. The gate refuses RPCs when this NODE is not fully configured; this is the operator-controls layer.
  • Stage 6D: cluster-wide cutover gate via GetCapability fan-out. Refuses if any voter/learner is not encryption_capable; this is the cluster-safety layer.

The two layers are complementary. The rolling-upgrade discipline ("don't enable any mutator RPC until every member reports encryption_capable") is the same operator-discipline constraint documented in PR #765 (Stage 6A) for the 6A→6B caveat.

Fix: 30-line docstring expansion

The encryptionMutatorsEnabled() godoc now explicitly:

Operators reading the code now understand the gate's scope. The Stage 6D PR has a clear handoff point.

Caller audit

Doc-only. No code paths changed.

Other r2 notes (non-blocking, claude r2 verdict was 'Ready to merge')

  • context.Background() in test (claude r2 carry-over) — bufconn-pattern consistent with rest of codebase; skip.
  • keystore != nil dead-code branch (claude r2 carry-over) — production-unreachable, documented as test-callable; skip.
  • Stage 6D Keystore-snapshot concern (claude r2 carry-over) — design-doc concern for 6D scope; will address in 6D PR description.

Verification

  • go build ./... — PASS
  • go test -race -timeout=60s -run 'TestEncryptionAdmin|TestRegisterEncryptionAdminServer' . — PASS
  • golangci-lint run ./... — 0 issues on touched files

@claude
Copy link
Copy Markdown

claude Bot commented May 16, 2026

Claude finished @bootjp's task in 1m 11s —— View job


Claude's Round-3 Review

Todo:

  • Set up review task list
  • Read main_encryption_admin.go — verify godoc expansion and scope-boundary argument
  • Confirm no code paths changed (doc-only claim)
  • Assess adequacy of scope-boundary documentation
  • Verify round-2 carry-over items still correctly deferred
  • Post final review

Round-3 Change Verification

Scope of change confirmed: commit 44b2bb1f touches only main_encryption_admin.go — one godoc block expanded, zero code paths altered. The encryptionMutatorsEnabled() function body is unchanged (return *encryptionEnabled && *kekFile != "" && *encryptionSidecarPath != ""). ✓


Scope-Boundary Argument Assessment

The expanded godoc addresses codex r2 P1 by documenting what the gate is, what it deliberately is not, and why. Evaluating each element:

"THIS NODE" emphasis at the function header — correct and prominent. An operator reading the function won't have to dig to find that this is a local check. ✓

HaltApply scenario reformulation — the revised text (bootstrapAndRotationConfigured() returns false → ApplyBootstrap halts apply with ErrKEKNotConfigured) is more accurate than the previous version and directly points to the concrete call site in applier.go. ✓

Three-bullet rationale for deferring to Stage 6D — each bullet is defensible:

  1. §7.1 Voters ∪ Learners fan-out via GetCapability — this is the structurally correct layer for cluster-wide capability gating. Adding an incomplete version here before GetCapability is wired would be both premature and redundant. ✓
  2. Rolling-restart discipline = same operator constraint as Stage 6A (PR feat(encryption): Stage 6A — §6.3 Applier scaffolding + FSM WithEncryption wiring #765) — the precedent exists; this is not introducing a new operator discipline. ✓
  3. Partial cluster-wide probe would duplicate §7.1 gate — correct. BootstrapEncryption is an operator-initiated, deliberate action (not an automated path); an operator who misconfigures a mixed-binary rolling upgrade violates the same discipline that already exists. ✓

PR #776 round-2 codex P1 cited as trigger — the audit trail is preserved in the docstring itself. Future readers can find the design decision without traversing git blame. ✓

One minor note (non-blocking): The PR #765 cross-reference is a PR number, not a design doc path. PR links can become stale (closed PRs stay accessible, but the context is less obvious than a docs/design/ file). If Stage 6A's rolling-upgrade discipline is captured in a design doc (e.g., docs/design/2026_*_partial_encryption*.md), a secondary reference to that path would make the constraint findable without GitHub access. This is a "nice to have" and is not blocking.


Round-2 Carry-Overs

All three deferred items remain correctly scoped:

Carry-over Status
context.Background() in callBootstrapAgainstNewServer Correctly deferred — bufconn pattern; deterministic in practice.
keystore != nil dead-code guard in applierOptionsFor Correctly deferred — production-unreachable, documented as test-callable.
Keystore not in FSM snapshots Correctly deferred to Stage 6D design doc.

Overall Assessment

Round-3 is a clean, doc-only change that directly and accurately addresses the codex r2 P1 concern. The scope-boundary argument is technically sound, the rationale is captured where it is actionable (at the function declaration), and the Stage 6D handoff point is clearly named. The carry-over items from round 2 are correctly deferred with documented rationale.

Ready to merge.

…eference

claude r3 verdict was 'Ready to merge' with one nice-to-have
note: the scope-boundary docstring cited PR #765 by number
but not the underlying design doc path. PR links can become
stale; a path to the docs/design/ file makes the constraint
findable without GitHub access.

Added a reference to docs/design/2026_04_29_partial_data_at_rest_encryption.md
(specifically the 6A rationale's 'Rolling 6A→6B upgrade
caveat' section). The PR #765 reference is preserved as the
historical trigger; the design doc path is the canonical
source of the rolling-upgrade discipline going forward.

Doc-only — no code paths touched.
@bootjp
Copy link
Copy Markdown
Owner Author

bootjp commented May 16, 2026

@claude review

Round-4 (commit aad73259): your r3 'nice to have' addressed — added a reference to docs/design/2026_04_29_partial_data_at_rest_encryption.md (specifically the 6A rationale's 'Rolling 6A→6B upgrade caveat' section) alongside the PR #765 reference.

Doc-only, 3 lines added, no code touched. Per your r3 verdict the PR is otherwise ready to merge; this just makes the rolling-upgrade discipline findable without GitHub access.

Verification

  • go build ./... — PASS (0 issues)

@claude
Copy link
Copy Markdown

claude Bot commented May 16, 2026

Claude finished @bootjp's task in 48s —— View job


Round-4 Review

Todo:

  • Set up review task list
  • Read main_encryption_admin.go — verify design doc reference added correctly
  • Read docs/design/2026_04_29_partial_data_at_rest_encryption.md — confirm referenced section exists
  • Post final review

Round-4 Change Verification

Scope of change confirmed: commit aad73259 replaces 1 line with 3 lines in the encryptionMutatorsEnabled() godoc — exactly as described. Zero code paths altered. ✓

Round-4 diff (in full):

-//     #765 (Stage 6A) for the 6A→6B rolling upgrade.
+//     #765 (Stage 6A) and in the Stage 6 plan in
+//     docs/design/2026_04_29_partial_data_at_rest_encryption.md
+//     (6A rationale, "Rolling 6A→6B upgrade caveat").

Design doc reference verified: docs/design/2026_04_29_partial_data_at_rest_encryption.md exists and the section header **Rolling 6A→6B upgrade caveat.** appears at line 84 of the doc — exactly where the godoc now points. The constraint text matches the scope-boundary argument in the godoc: operators must not set --encryption-enabled on a 6B+ node until every voter AND learner is on 6B+. ✓

The r3 "nice to have" is addressed accurately. The rolling-upgrade discipline is now findable without GitHub access, and the cross-reference path (docs/design/ rather than a PR link) survives in a clone without remote connectivity.


Overall Assessment

All four rounds of review complete. The change is doc-only, the referenced section exists and matches, and no carry-over items require action before merge. Ready to merge.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant