GRA-216: Add graduation threshold experiment flag shim#188
Conversation
…NG (GRA-113 cycle 3) Raises the low-signal edit-distance floor from 0.04 to 0.07 for FORMAT and DRAFTING categories so synonym-swap and minor phrasing edits (0.04 ≤ ed < 0.07) are filtered before they create lessons. SECURITY and ACCURACY corrections are unaffected — they already bypass the gate via _is_meaningful_low_signal_change. - Add _FORMAT_DRAFTING_EDIT_DISTANCE_FLOOR = 0.07 and _FORMAT_DRAFTING_CATEGORIES - Gate logic now selects a category-specific floor before the low_signal_filtered check - 6 new tests in test_dedup.py covering constants, filter boundary, pass-through for SECURITY/ACCURACY, and non-FORMAT/DRAFTING categories retaining the 0.04 floor Co-Authored-By: Paperclip <noreply@paperclip.ing>
There was a problem hiding this comment.
Your free trial has ended. If you'd like to continue receiving code reviews, you can add a payment method here.
|
ℹ️ Recent review info⚙️ Run configurationConfiguration used: Organization UI Review profile: ASSERTIVE Plan: Pro Run ID: 📒 Files selected for processing (1)
📜 Recent review details⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (7)
🧰 Additional context used📓 Path-based instructions (1)Gradata/tests/**/*.py📄 CodeRabbit inference engine (Gradata/AGENTS.md)
Files:
🔇 Additional comments (1)
📝 Walkthrough
WalkthroughThis PR introduces three independent feature additions: a weekly correction snapshot CLI tool that aggregates NDJSON event streams into deterministic JSON statistics; low-signal filtering logic in the core correction pipeline to suppress tiny, non-meaningful corrections while auto-exporting learned agents; and a refactoring of Beta-LB graduation threshold parsing into a dedicated module. ChangesWeekly Correction Snapshot Feature
Low-Signal Filtering in Correction Pipeline
Beta-LB Graduation Threshold Refactoring
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes Possibly related PRs
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Warning There were issues while running some tools. Please review the errors and either fix the tool's configuration or disable the tool if it's a critical failure. 🔧 OpenGrep (1.20.0)OpenGrep fatal error (exit code 2): �[32m✔�[39m �[1mOpengrep OSS�[0m �[1m Loading rules from local config...�[0m Comment |
There was a problem hiding this comment.
Actionable comments posted: 6
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@Gradata/scripts/weekly_correction_snapshot.py`:
- Around line 26-36: The function _is_graduation_accepted is treating generic
outcome/status/accepted fields as graduation results even when the event isn't
graduation-related; update it to first detect that the row is a graduation event
(e.g., event contains "graduation" or is in the graduation event set like
{"lesson.graduated","graduation.accepted","graduation.rejected"}) and only then
evaluate outcome/status/accepted for accepted logic. Make the same change for
the corresponding rejection logic (the similar block at lines referenced 39–49)
so outcome/status/accepted are only considered when the row is confirmed to be a
graduation event.
In `@Gradata/src/gradata/_core.py`:
- Line 120: Revert the default value of the min_severity parameter back to its
prior setting to avoid changing default gating behavior—restore min_severity
from "minor" to the original "as-is" in the function or class signature where
min_severity is declared (look for the min_severity parameter in _core.py) so
existing callers continue to receive "as-is" corrections by default.
In `@Gradata/src/gradata/enhancements/self_improvement/_graduation_flags.py`:
- Around line 19-37: The function read_beta_lb_threshold can return an
out-of-range default without clipping; update it so every return path returns a
normalized value in [0.0, 1.0] by clipping the default before returning when
raw_value is None, parsing fails, or threshold is non-finite. Modify
read_beta_lb_threshold to compute a clipped_default = min(max(default, 0.0),
1.0) (or reuse the same min/max logic used for threshold) and return
clipped_default instead of default in the branches that currently return
default; keep the same behavior for successfully parsed finite thresholds.
In `@Gradata/tests/test_dedup.py`:
- Around line 333-343: The test
test_format_drafting_floor_passes_larger_ed_corrections is brittle because it
depends on the live diff engine; instead patch or monkeypatch the compute_diff
function used by brain.correct to return a deterministic diff result with
edit_distance >= 0.07 (and appropriate token counts) so the low_signal_filtered
path is exercised reliably. Locate where brain.correct calls compute_diff (or
the module function named compute_diff) and in the test use unittest.mock.patch
or pytest monkeypatch to supply a fixed dict like {"edit_distance": 0.08, ...}
before calling brain.correct, then assert result.get("low_signal_filtered") is
not True; repeat the same approach for the other similar test referenced in the
comment.
- Line 223: The failing assertion shows second.get("observation_deduped") is
None because the test doesn't run the deterministic dedup step or seed any
randomness before checking the flag; update tests/test_dedup.py to ensure the
code path that marks the observation runs deterministically (call the semantic
dedup function used in the codebase or invoke the pipeline step that sets the
"observation_deduped" field for the `second` object) and/or seed/mock any
randomness or external services used by that dedup step so the flag is reliably
set, then assert second.get("observation_deduped") is True.
In `@Gradata/tests/test_weekly_correction_snapshot.py`:
- Around line 31-55: Add a regression test to ensure snapshot.aggregate ignores
stray status/outcome fields on non-graduation rows: in
test_aggregate_counts_and_top_categories_deterministically, append rows like
{"status": "accepted"} and {"outcome": "rejected"} (with no "event":
"graduation.*" or associated graduation row) and assert that
accepted_graduations and rejection_count remain the same as before; use the
existing snapshot.aggregate function and existing assertions for
total_corrections, acceptance_rate and top_rule_categories to confirm these
stray fields do not affect graduation metrics.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: ASSERTIVE
Plan: Pro
Run ID: 30280937-ce9f-4f04-b6f4-041e1330af75
📒 Files selected for processing (9)
.gitignoreGradata/docs/weekly-correction-snapshot.mdGradata/pyproject.tomlGradata/scripts/weekly_correction_snapshot.pyGradata/src/gradata/_core.pyGradata/src/gradata/enhancements/self_improvement/_graduation.pyGradata/src/gradata/enhancements/self_improvement/_graduation_flags.pyGradata/tests/test_dedup.pyGradata/tests/test_weekly_correction_snapshot.py
📜 Review details
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (6)
- GitHub Check: pytest macos-latest / py3.11
- GitHub Check: pytest ubuntu-latest / py3.12
- GitHub Check: pytest windows-latest / py3.12
- GitHub Check: pytest ubuntu-latest / py3.11
- GitHub Check: pytest windows-latest / py3.11
- GitHub Check: pytest macos-latest / py3.12
🧰 Additional context used
📓 Path-based instructions (3)
Gradata/**/pyproject.toml
📄 CodeRabbit inference engine (Gradata/AGENTS.md)
Maintain
dependencies = []in pyproject.toml — the base package is pure Python + stdlib with all heavy dependencies gated as optional extras: embeddings, gemini, encrypted, ranking, adapters-mem0
Files:
Gradata/pyproject.toml
Gradata/src/**/*.py
📄 CodeRabbit inference engine (Gradata/AGENTS.md)
Gradata/src/**/*.py: Prefersentence-transformersfor local embeddings,google-genaifor Gemini embeddings,cryptographyfor AES-GCM encrypted system.db,bm25sfor BM25 rule ranking, andmem0aifor external memory adapters — guard all optional dependency imports withtry / except ImportErrorat the call site, never at module level
Maintain strict layering: Layer 0 (Primitives: _types.py, _db.py, _events.py, _paths.py, _file_lock.py; Patterns: contrib/patterns/) must never import from Layer 1 (Enhancements: enhancements/, rules/) or Layer 2 (Public API: brain.py, cli.py, daemon.py, mcp_server.py)
Never use bareexcept: pass— use typed exceptions or at minimumlogger.warning(...)withexc_info=Trueto avoid silent failure in a memory product
Never import from out-of-scope sibling directories../Sprites/or../Hausgem/withingradata/*code — that is a layering bug
Never leak private-sibling paths into public docs/code — no references to../Sprites/,../Hausgem/, email addresses, OneDrive paths, or Sprites-specific examples from insidegradata/*
Use atomic-write helper when writing JSON files to prevent corruption from mid-write crashes
Files:
Gradata/src/gradata/enhancements/self_improvement/_graduation_flags.pyGradata/src/gradata/_core.pyGradata/src/gradata/enhancements/self_improvement/_graduation.py
Gradata/tests/**/*.py
📄 CodeRabbit inference engine (Gradata/AGENTS.md)
Gradata/tests/**/*.py: SetBRAIN_DIRenvironment variable viatmp_pathin conftest.py for test isolation — ensure_paths.pymodule cache refreshes when callingBrain.init()directly inside tests
Add unit tests intests/test_*.pyfor every CI push without LLM calls (deterministic); mark integration tests with@pytest.mark.integrationand skip them by default (they hit real LLM APIs)
Files:
Gradata/tests/test_dedup.pyGradata/tests/test_weekly_correction_snapshot.py
🪛 GitHub Actions: SDK CI / 0_pytest (py3.11).txt
Gradata/tests/test_dedup.py
[error] 223-223: Test failed in 'test_brain_correct_semantic_near_duplicate_is_deduped': expected second result to have observation_deduped == True, but second.get('observation_deduped') was None (AssertionError).
🪛 GitHub Actions: SDK CI / 1_pytest (py3.12).txt
Gradata/tests/test_dedup.py
[error] 223-223: Assertion failed in tests/test_dedup.py::test_brain_correct_semantic_near_duplicate_is_deduped. Expected second.get('observation_deduped') is True, but got None.
🪛 GitHub Actions: SDK CI / pytest (py3.11)
Gradata/tests/test_dedup.py
[error] 223-223: AssertionError: expected second.get("observation_deduped") is True, but value was None (deduplication near-duplicate not marked as deduped)
🪛 GitHub Actions: SDK CI / pytest (py3.12)
Gradata/tests/test_dedup.py
[error] 223-223: Assertion failed in test_brain_correct_semantic_near_duplicate_is_deduped: expected second.get('observation_deduped') is True, but got None.
🔇 Additional comments (5)
Gradata/src/gradata/enhancements/self_improvement/_graduation.py (1)
31-33: Good threshold-parser extraction and wiring.This keeps
_read_beta_lb_config()lean and moves env parsing/clamping concerns into a single shim without changing the default-path behavior.Also applies to: 120-120
Gradata/src/gradata/enhancements/self_improvement/_graduation_flags.py (1)
13-17: Nice constantization of GRA-210 experiment knobs.The naming and separation make the rollout intent explicit and keep call sites stable.
.gitignore (1)
178-178: Good targeted un-ignore entry.This cleanly keeps
Gradata/scripts/*ignored while allowing the new snapshot script to be tracked.Gradata/pyproject.toml (1)
168-168: Pytest path update looks correct.Adding
scriptstopythonpathis a practical fix for test-time imports of script modules.Gradata/docs/weekly-correction-snapshot.md (1)
1-37: Clear and actionable docs.The usage examples and schema description are concise and give a solid contract for downstream consumers.
| def _is_graduation_accepted(row: dict[str, Any]) -> bool: | ||
| event = str(row.get("event", "")).strip().lower() | ||
| outcome = str(row.get("outcome", "")).strip().lower() | ||
| accepted_flag = row.get("accepted") | ||
| status = str(row.get("status", "")).strip().lower() | ||
| return ( | ||
| event in {"lesson.graduated", "graduation.accepted"} | ||
| or outcome == "accepted" | ||
| or accepted_flag is True | ||
| or status in {"accepted", "graduated"} | ||
| ) |
There was a problem hiding this comment.
Graduation metrics are currently over-counted by non-graduation rows.
Lines 31–36 and 44–49 treat generic outcome/status/accepted fields as graduation outcomes without verifying the row is graduation-related, so unrelated events can skew accepted_graduations/rejection_count.
Proposed fix
+def _is_graduation_row(row: dict[str, Any]) -> bool:
+ event = str(row.get("event", "")).strip().lower()
+ kind = str(row.get("kind", "")).strip().lower()
+ return "graduation" in event or event.startswith("lesson.") or kind == "graduation"
+
def _is_graduation_accepted(row: dict[str, Any]) -> bool:
event = str(row.get("event", "")).strip().lower()
outcome = str(row.get("outcome", "")).strip().lower()
accepted_flag = row.get("accepted")
status = str(row.get("status", "")).strip().lower()
return (
event in {"lesson.graduated", "graduation.accepted"}
- or outcome == "accepted"
- or accepted_flag is True
- or status in {"accepted", "graduated"}
+ or (_is_graduation_row(row) and outcome == "accepted")
+ or (_is_graduation_row(row) and accepted_flag is True)
+ or (_is_graduation_row(row) and status in {"accepted", "graduated"})
)
def _is_rejection(row: dict[str, Any]) -> bool:
event = str(row.get("event", "")).strip().lower()
outcome = str(row.get("outcome", "")).strip().lower()
accepted_flag = row.get("accepted")
status = str(row.get("status", "")).strip().lower()
return (
event in {"graduation.rejected", "lesson.rejected"}
- or outcome == "rejected"
- or accepted_flag is False
- or status == "rejected"
+ or (_is_graduation_row(row) and outcome == "rejected")
+ or (_is_graduation_row(row) and accepted_flag is False)
+ or (_is_graduation_row(row) and status == "rejected")
)Also applies to: 39-49
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@Gradata/scripts/weekly_correction_snapshot.py` around lines 26 - 36, The
function _is_graduation_accepted is treating generic outcome/status/accepted
fields as graduation results even when the event isn't graduation-related;
update it to first detect that the row is a graduation event (e.g., event
contains "graduation" or is in the graduation event set like
{"lesson.graduated","graduation.accepted","graduation.rejected"}) and only then
evaluate outcome/status/accepted for accepted logic. Make the same change for
the corresponding rejection logic (the similar block at lines referenced 39–49)
so outcome/status/accepted are only considered when the row is confirmed to be a
graduation event.
| approval_required: bool = False, | ||
| dry_run: bool = False, | ||
| min_severity: str = "as-is", | ||
| min_severity: str = "minor", |
There was a problem hiding this comment.
Revert the default min_severity to avoid an unintentional behavior change.
Line 120 changes default gating to "minor", which suppresses "as-is" corrections for all existing callers and changes learning behavior by default.
Proposed fix
- min_severity: str = "minor",
+ min_severity: str = "as-is",📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| min_severity: str = "minor", | |
| min_severity: str = "as-is", |
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@Gradata/src/gradata/_core.py` at line 120, Revert the default value of the
min_severity parameter back to its prior setting to avoid changing default
gating behavior—restore min_severity from "minor" to the original "as-is" in the
function or class signature where min_severity is declared (look for the
min_severity parameter in _core.py) so existing callers continue to receive
"as-is" corrections by default.
| def read_beta_lb_threshold(default: float = GRA_210_GRADUATION_THRESHOLD_DEFAULT) -> float: | ||
| """Read the Beta-LB threshold override from env. | ||
|
|
||
| Returns a float clipped to [0.0, 1.0], or ``default`` when parsing fails. | ||
| """ | ||
|
|
||
| raw_value = os.environ.get(GRA_210_GRADUATION_THRESHOLD_ENV) | ||
| if raw_value is None: | ||
| return default | ||
|
|
||
| try: | ||
| threshold = float(raw_value) | ||
| except (TypeError, ValueError): | ||
| return default | ||
|
|
||
| if not math.isfinite(threshold): | ||
| return default | ||
|
|
||
| return min(max(threshold, 0.0), 1.0) |
There was a problem hiding this comment.
🧹 Nitpick | 🔵 Trivial | ⚡ Quick win
Normalize default before returning to honor the function contract.
When env parsing fails (or env is unset), an out-of-range default can currently bypass clipping even though the docstring promises a [0.0, 1.0] result.
Proposed patch
def read_beta_lb_threshold(default: float = GRA_210_GRADUATION_THRESHOLD_DEFAULT) -> float:
@@
- raw_value = os.environ.get(GRA_210_GRADUATION_THRESHOLD_ENV)
+ if not math.isfinite(default):
+ default = GRA_210_GRADUATION_THRESHOLD_DEFAULT
+ default = min(max(default, 0.0), 1.0)
+
+ raw_value = os.environ.get(GRA_210_GRADUATION_THRESHOLD_ENV)
if raw_value is None:
return default🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@Gradata/src/gradata/enhancements/self_improvement/_graduation_flags.py`
around lines 19 - 37, The function read_beta_lb_threshold can return an
out-of-range default without clipping; update it so every return path returns a
normalized value in [0.0, 1.0] by clipping the default before returning when
raw_value is None, parsing fails, or threshold is non-finite. Modify
read_beta_lb_threshold to compute a clipped_default = min(max(default, 0.0),
1.0) (or reuse the same min/max logic used for threshold) and return
clipped_default instead of default in the branches that currently return
default; keep the same behavior for successfully parsed finite thresholds.
| assert first.get("observation_deduped") is not True | ||
|
|
||
| second = brain.correct(a2, b2, category="DRAFTING", session=2) | ||
| assert second.get("observation_deduped") is True |
There was a problem hiding this comment.
CI blocker: semantic dedup assertion is failing now.
Line 223 is red in both py3.11 and py3.12 (second.get("observation_deduped") is None). This test needs deterministic setup (or corrected expectation) before merge.
As per coding guidelines, "Add unit tests in tests/test_*.py for every CI push without LLM calls (deterministic)".
🧰 Tools
🪛 GitHub Actions: SDK CI / 0_pytest (py3.11).txt
[error] 223-223: Test failed in 'test_brain_correct_semantic_near_duplicate_is_deduped': expected second result to have observation_deduped == True, but second.get('observation_deduped') was None (AssertionError).
🪛 GitHub Actions: SDK CI / 1_pytest (py3.12).txt
[error] 223-223: Assertion failed in tests/test_dedup.py::test_brain_correct_semantic_near_duplicate_is_deduped. Expected second.get('observation_deduped') is True, but got None.
🪛 GitHub Actions: SDK CI / pytest (py3.11)
[error] 223-223: AssertionError: expected second.get("observation_deduped") is True, but value was None (deduplication near-duplicate not marked as deduped)
🪛 GitHub Actions: SDK CI / pytest (py3.12)
[error] 223-223: Assertion failed in test_brain_correct_semantic_near_duplicate_is_deduped: expected second.get('observation_deduped') is True, but got None.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@Gradata/tests/test_dedup.py` at line 223, The failing assertion shows
second.get("observation_deduped") is None because the test doesn't run the
deterministic dedup step or seed any randomness before checking the flag; update
tests/test_dedup.py to ensure the code path that marks the observation runs
deterministically (call the semantic dedup function used in the codebase or
invoke the pipeline step that sets the "observation_deduped" field for the
`second` object) and/or seed/mock any randomness or external services used by
that dedup step so the flag is reliably set, then assert
second.get("observation_deduped") is True.
| def test_format_drafting_floor_passes_larger_ed_corrections(fresh_brain): | ||
| """FORMAT/DRAFTING corrections with ed ≥ 0.07 must NOT be filtered.""" | ||
| brain = fresh_brain | ||
| # Multi-word restructure in DRAFTING; edit distance should exceed 0.07. | ||
| draft = "Maybe we could perhaps consider thinking about simplifying this." | ||
| final = "Simplify this." | ||
| result = brain.correct(draft, final, category="DRAFTING", session=11) | ||
| assert result.get("low_signal_filtered") is not True, ( | ||
| f"Expected substantial DRAFTING edit to pass; ed={result.get('edit_distance')}" | ||
| ) | ||
|
|
There was a problem hiding this comment.
These threshold tests are brittle because they depend on live diff-engine behavior.
The assertions rely on expected edit-distance outcomes from natural-language rewrites without stubbing compute_diff, so small diff-engine changes can cause flaky failures.
As per coding guidelines, "Add unit tests in tests/test_*.py for every CI push without LLM calls (deterministic)".
Also applies to: 358-375
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@Gradata/tests/test_dedup.py` around lines 333 - 343, The test
test_format_drafting_floor_passes_larger_ed_corrections is brittle because it
depends on the live diff engine; instead patch or monkeypatch the compute_diff
function used by brain.correct to return a deterministic diff result with
edit_distance >= 0.07 (and appropriate token counts) so the low_signal_filtered
path is exercised reliably. Locate where brain.correct calls compute_diff (or
the module function named compute_diff) and in the test use unittest.mock.patch
or pytest monkeypatch to supply a fixed dict like {"edit_distance": 0.08, ...}
before calling brain.correct, then assert result.get("low_signal_filtered") is
not True; repeat the same approach for the other similar test referenced in the
comment.
| def test_aggregate_counts_and_top_categories_deterministically(): | ||
| rows = [ | ||
| {"event": "correction.created", "category": "Tone"}, | ||
| {"event": "correction.created", "category": "tone"}, | ||
| {"event": "correction.created", "category": "factual"}, | ||
| {"event": "correction.created", "category": " PROCESS "}, | ||
| {"kind": "correction", "category": ""}, | ||
| {"event": "lesson.graduated"}, | ||
| {"event": "graduation.accepted"}, | ||
| {"outcome": "accepted"}, | ||
| {"event": "graduation.rejected"}, | ||
| {"accepted": False}, | ||
| ] | ||
| data = snapshot.aggregate(rows) | ||
| assert data["total_corrections"] == 5 | ||
| assert data["accepted_graduations"] == 3 | ||
| assert data["rejection_count"] == 2 | ||
| assert data["acceptance_rate"] == 0.6 | ||
| assert data["top_rule_categories"] == [ | ||
| {"category": "tone", "count": 2}, | ||
| {"category": "factual", "count": 1}, | ||
| {"category": "process", "count": 1}, | ||
| {"category": "unknown", "count": 1}, | ||
| ] | ||
|
|
There was a problem hiding this comment.
🧹 Nitpick | 🔵 Trivial | ⚡ Quick win
Add a regression case for non-graduation status/outcome rows.
Current coverage doesn’t guard against counting unrelated events that happen to carry status="accepted" / outcome="rejected". A small test here will prevent silent metric drift.
As per coding guidelines, "Add unit tests in tests/test_*.py for every CI push without LLM calls (deterministic)".
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@Gradata/tests/test_weekly_correction_snapshot.py` around lines 31 - 55, Add a
regression test to ensure snapshot.aggregate ignores stray status/outcome fields
on non-graduation rows: in
test_aggregate_counts_and_top_categories_deterministically, append rows like
{"status": "accepted"} and {"outcome": "rejected"} (with no "event":
"graduation.*" or associated graduation row) and assert that
accepted_graduations and rejection_count remain the same as before; use the
existing snapshot.aggregate function and existing assertions for
total_corrections, acceptance_rate and top_rule_categories to confirm these
stray fields do not affect graduation metrics.
There was a problem hiding this comment.
Your free trial has ended. If you'd like to continue receiving code reviews, you can add a payment method here.
Implements GRA-216 for GRA-210.
Environment rollout for the experiment:
Model: [codex_local]