GRA-216: Add graduation threshold experiment flag shim by Gradata · Pull Request #188 · Gradata/gradata

Gradata · 2026-05-13T01:32:48Z

Implements GRA-216 for GRA-210.

Added gradata.enhancements.self_improvement._graduation_flags as a thin experiment shim for the Beta-LB threshold.
Wired _read_beta_lb_config() to use read_beta_lb_threshold(...).
No behavior change by default; default remains 0.75 unless GRADATA_BETA_LB_THRESHOLD overrides it.

Environment rollout for the experiment:

Set Railway staging env var GRADATA_BETA_LB_THRESHOLD=0.55 on gradata service.

Model: [codex_local]

…NG (GRA-113 cycle 3) Raises the low-signal edit-distance floor from 0.04 to 0.07 for FORMAT and DRAFTING categories so synonym-swap and minor phrasing edits (0.04 ≤ ed < 0.07) are filtered before they create lessons. SECURITY and ACCURACY corrections are unaffected — they already bypass the gate via _is_meaningful_low_signal_change. - Add _FORMAT_DRAFTING_EDIT_DISTANCE_FLOOR = 0.07 and _FORMAT_DRAFTING_CATEGORIES - Gate logic now selects a category-specific floor before the low_signal_filtered check - 6 new tests in test_dedup.py covering constants, filter boundary, pass-through for SECURITY/ACCURACY, and non-FORMAT/DRAFTING categories retaining the 0.04 floor Co-Authored-By: Paperclip <noreply@paperclip.ing>

greptile-apps

Your free trial has ended. If you'd like to continue receiving code reviews, you can add a payment method here.

coderabbitai · 2026-05-13T01:33:00Z

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: 22837c8f-c503-4e77-ba4c-81e6e29341a9

📥 Commits

Reviewing files that changed from the base of the PR and between 3fc53c6 and 0e03bd1.

📒 Files selected for processing (1)

Gradata/tests/test_byo_key_provider.py

📜 Recent review details

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (7)

GitHub Check: pytest (py3.12)
GitHub Check: pytest windows-latest / py3.12
GitHub Check: pytest windows-latest / py3.11
GitHub Check: pytest ubuntu-latest / py3.11
GitHub Check: pytest macos-latest / py3.11
GitHub Check: pytest macos-latest / py3.12
GitHub Check: pytest ubuntu-latest / py3.12

🧰 Additional context used

📓 Path-based instructions (1)

Gradata/tests/**/*.py

📄 CodeRabbit inference engine (Gradata/AGENTS.md)

Gradata/tests/**/*.py: Set BRAIN_DIR environment variable via tmp_path in conftest.py for test isolation — ensure _paths.py module cache refreshes when calling Brain.init() directly inside tests
Add unit tests in tests/test_*.py for every CI push without LLM calls (deterministic); mark integration tests with @pytest.mark.integration and skip them by default (they hit real LLM APIs)

Files:

Gradata/tests/test_byo_key_provider.py

🔇 Additional comments (1)

Gradata/tests/test_byo_key_provider.py (1)

3-8: Dependency guard for optional httpx test runtime is clean.

pytest.importorskip("httpx") is an appropriate module-level guard here and keeps this suite deterministic when httpx is present.

📝 Walkthrough

New experiment shim module: Added gradata.enhancements.self_improvement._graduation_flags with constants and read_beta_lb_threshold(default=0.75) to control the Beta-LB graduation gating (GRA-210).
Refactored threshold logic: _read_beta_lb_config() now delegates parsing/validation/clamping to read_beta_lb_threshold().
Default behavior preserved: default threshold remains 0.75; overridden via GRADATA_BETA_LB_THRESHOLD and clamped to [0.0, 1.0].
Breaking change: brain_correct() default min_severity changed from "as-is" to "minor".
Low-signal filtering: introduced category-aware edit-distance floors (higher for FORMAT/DRAFTING; SECURITY/ACCURACY always pass) and suppression of very small/non-meaningful edits (emits low_signal_filtered).
Auto-export enhancement: brain_end_session() now auto-exports AGENTS.md by default, controllable via GRADATA_AUTO_EXPORT_AGENTS env var.
New CLI utility: scripts/weekly_correction_snapshot.py (parse_rows, aggregate, _read_lines, main) — transforms NDJSON event streams into deterministic compact JSON snapshots including acceptance rate, top 5 rule categories, and skipped_rows.
Tests added/expanded: 9 dedup/low-signal tests (tests/test_dedup.py) and 5 weekly snapshot tests (tests/test_weekly_correction_snapshot.py); plus minor test import change to skip BYO-key tests if httpx missing.
Minor infra/docs updates: pyproject.toml pytest pythonpath extended to include scripts/, added docs/weekly-correction-snapshot.md, and .gitignore adjusted to un-ignore the new script.

Walkthrough

This PR introduces three independent feature additions: a weekly correction snapshot CLI tool that aggregates NDJSON event streams into deterministic JSON statistics; low-signal filtering logic in the core correction pipeline to suppress tiny, non-meaningful corrections while auto-exporting learned agents; and a refactoring of Beta-LB graduation threshold parsing into a dedicated module.

Changes

Weekly Correction Snapshot Feature

Layer / File(s)	Summary
Snapshot script implementation `Gradata/scripts/weekly_correction_snapshot.py`	Core functions parse NDJSON lines into dicts, classify rows as corrections/accepted/rejected graduations based on event heuristics, aggregate totals and acceptance rates, and emit deterministic JSON to stdout with configurable file or stdin input.
Testing, documentation, and integration `Gradata/tests/test_weekly_correction_snapshot.py`, `Gradata/docs/weekly-correction-snapshot.md`, `Gradata/pyproject.toml`, `.gitignore`	Pytest functions validate parse_rows skips malformed JSON, aggregate computes zero-safe defaults and deterministic top categories, and main reads stdin and emits exact JSON; documentation specifies output schema; pythonpath extended to include `scripts`; `.gitignore` explicitly un-ignores the new script.

Low-Signal Filtering in Correction Pipeline

Layer / File(s)	Summary
Edit-distance floors and meaningfulness predicate `Gradata/src/gradata/_core.py`	Introduces edit-distance thresholds (base 0.04 and FORMAT/DRAFTING-specific 0.07), os import for environment variables, and _is_meaningful_low_signal_change() predicate that exempts ACCURACY/SECURITY and permits case-only or acronym/proper-noun-like changes.
Lesson-creation gating and auto-export `Gradata/src/gradata/_core.py`	Updates brain_correct() default min_severity from "as-is" to "minor"; enhances lesson creation to compute category-dependent floors, suppress low-signal corrections when not meaningful, set low_signal_filtered flag, and auto-export AGENTS.md post-graduation when GRADATA_AUTO_EXPORT_AGENTS is not disabled.
Low-signal and dedup test coverage `Gradata/tests/test_dedup.py`, `Gradata/tests/test_byo_key_provider.py`	Nine tests assert semantic near-duplicates are deduped, punctuation-only corrections filtered, dedup preserves lineage, FORMAT/DRAFTING uses higher threshold (0.07 vs 0.04), filtering behavior at various edit distances, SECURITY/ACCURACY always pass, and non-FORMAT/DRAFTING categories keep the base floor; plus a small test-module skip import change for BYO-key tests.

Beta-LB Graduation Threshold Refactoring

Layer / File(s)	Summary
Graduation flags module with threshold reading `Gradata/src/gradata/enhancements/self_improvement/_graduation_flags.py`	New module defines GRA-210 experiment constants, GRADATA_BETA_LB_THRESHOLD env var, 0.75 default, and read_beta_lb_threshold() which parses environment values, validates as finite float, falls back to a default, and clamps to [0.0, 1.0].
Graduation.py refactored to use flags module `Gradata/src/gradata/enhancements/self_improvement/_graduation.py`	Imports read_beta_lb_threshold and delegates Beta-LB threshold computation; removes previous inline float parsing, math.isfinite validation, and clamping logic along with the unused math import.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Possibly related PRs

Gradata/gradata#31: Touches brain_end_session(...)/graduation-related flow, related to AGENTS.md export changes.
Gradata/gradata#85: Modifies brain_correct()/correction metadata paths, related to low-signal filtering and emitted event flags.

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 40.00% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title clearly and specifically describes the main change: adding a graduation threshold experiment flag shim for GRA-216.
Description check	✅ Passed	The description is directly related to the changeset, explaining the implementation of GRA-216, the new module, wiring details, and environment rollout instructions.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

📝 Generate docstrings

Create stacked PR
Commit on current branch

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch feat/gra-216-beta-lb-threshold

Warning

There were issues while running some tools. Please review the errors and either fix the tool's configuration or disable the tool if it's a critical failure.

🔧 OpenGrep (1.20.0)

OpenGrep fatal error (exit code 2):
┌──────────────┐
│ Opengrep CLI │
└──────────────┘

�[32m✔�[39m �[1mOpengrep OSS�[0m
�[32m✔�[39m Basic security coverage for first-party code vulnerabilities.

�[1m Loading rules from local config...�[0m
[00.15][ERROR]: Error: exception Glob.Lexer.Syntax_error("malformed glob pattern: missing ']'")
Raised at Glob__Lexer.syntax_error in file "libs/glob/Lexer.mll", line 8, characters 2-26
Called from Glob__Lexer.__ocaml_lex_token_rec in file "libs/glob/Lexer.mll", line 29, characters 26-53
Cal

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 6

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@Gradata/scripts/weekly_correction_snapshot.py`:
- Around line 26-36: The function _is_graduation_accepted is treating generic
outcome/status/accepted fields as graduation results even when the event isn't
graduation-related; update it to first detect that the row is a graduation event
(e.g., event contains "graduation" or is in the graduation event set like
{"lesson.graduated","graduation.accepted","graduation.rejected"}) and only then
evaluate outcome/status/accepted for accepted logic. Make the same change for
the corresponding rejection logic (the similar block at lines referenced 39–49)
so outcome/status/accepted are only considered when the row is confirmed to be a
graduation event.

In `@Gradata/src/gradata/_core.py`:
- Line 120: Revert the default value of the min_severity parameter back to its
prior setting to avoid changing default gating behavior—restore min_severity
from "minor" to the original "as-is" in the function or class signature where
min_severity is declared (look for the min_severity parameter in _core.py) so
existing callers continue to receive "as-is" corrections by default.

In `@Gradata/src/gradata/enhancements/self_improvement/_graduation_flags.py`:
- Around line 19-37: The function read_beta_lb_threshold can return an
out-of-range default without clipping; update it so every return path returns a
normalized value in [0.0, 1.0] by clipping the default before returning when
raw_value is None, parsing fails, or threshold is non-finite. Modify
read_beta_lb_threshold to compute a clipped_default = min(max(default, 0.0),
1.0) (or reuse the same min/max logic used for threshold) and return
clipped_default instead of default in the branches that currently return
default; keep the same behavior for successfully parsed finite thresholds.

In `@Gradata/tests/test_dedup.py`:
- Around line 333-343: The test
test_format_drafting_floor_passes_larger_ed_corrections is brittle because it
depends on the live diff engine; instead patch or monkeypatch the compute_diff
function used by brain.correct to return a deterministic diff result with
edit_distance >= 0.07 (and appropriate token counts) so the low_signal_filtered
path is exercised reliably. Locate where brain.correct calls compute_diff (or
the module function named compute_diff) and in the test use unittest.mock.patch
or pytest monkeypatch to supply a fixed dict like {"edit_distance": 0.08, ...}
before calling brain.correct, then assert result.get("low_signal_filtered") is
not True; repeat the same approach for the other similar test referenced in the
comment.
- Line 223: The failing assertion shows second.get("observation_deduped") is
None because the test doesn't run the deterministic dedup step or seed any
randomness before checking the flag; update tests/test_dedup.py to ensure the
code path that marks the observation runs deterministically (call the semantic
dedup function used in the codebase or invoke the pipeline step that sets the
"observation_deduped" field for the `second` object) and/or seed/mock any
randomness or external services used by that dedup step so the flag is reliably
set, then assert second.get("observation_deduped") is True.

In `@Gradata/tests/test_weekly_correction_snapshot.py`:
- Around line 31-55: Add a regression test to ensure snapshot.aggregate ignores
stray status/outcome fields on non-graduation rows: in
test_aggregate_counts_and_top_categories_deterministically, append rows like
{"status": "accepted"} and {"outcome": "rejected"} (with no "event":
"graduation.*" or associated graduation row) and assert that
accepted_graduations and rejection_count remain the same as before; use the
existing snapshot.aggregate function and existing assertions for
total_corrections, acceptance_rate and top_rule_categories to confirm these
stray fields do not affect graduation metrics.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: 30280937-ce9f-4f04-b6f4-041e1330af75

📥 Commits

Reviewing files that changed from the base of the PR and between 5a5c4d8 and 3fc53c6.

📒 Files selected for processing (9)

.gitignore
Gradata/docs/weekly-correction-snapshot.md
Gradata/pyproject.toml
Gradata/scripts/weekly_correction_snapshot.py
Gradata/src/gradata/_core.py
Gradata/src/gradata/enhancements/self_improvement/_graduation.py
Gradata/src/gradata/enhancements/self_improvement/_graduation_flags.py
Gradata/tests/test_dedup.py
Gradata/tests/test_weekly_correction_snapshot.py

📜 Review details

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (6)

GitHub Check: pytest macos-latest / py3.11
GitHub Check: pytest ubuntu-latest / py3.12
GitHub Check: pytest windows-latest / py3.12
GitHub Check: pytest ubuntu-latest / py3.11
GitHub Check: pytest windows-latest / py3.11
GitHub Check: pytest macos-latest / py3.12

🧰 Additional context used

📓 Path-based instructions (3)

Gradata/**/pyproject.toml

📄 CodeRabbit inference engine (Gradata/AGENTS.md)

Maintain dependencies = [] in pyproject.toml — the base package is pure Python + stdlib with all heavy dependencies gated as optional extras: embeddings, gemini, encrypted, ranking, adapters-mem0

Files:

Gradata/pyproject.toml

Gradata/src/**/*.py

📄 CodeRabbit inference engine (Gradata/AGENTS.md)

Gradata/src/**/*.py: Prefer sentence-transformers for local embeddings, google-genai for Gemini embeddings, cryptography for AES-GCM encrypted system.db, bm25s for BM25 rule ranking, and mem0ai for external memory adapters — guard all optional dependency imports with try / except ImportError at the call site, never at module level
Maintain strict layering: Layer 0 (Primitives: _types.py, _db.py, _events.py, _paths.py, _file_lock.py; Patterns: contrib/patterns/) must never import from Layer 1 (Enhancements: enhancements/, rules/) or Layer 2 (Public API: brain.py, cli.py, daemon.py, mcp_server.py)
Never use bare except: pass — use typed exceptions or at minimum logger.warning(...) with exc_info=True to avoid silent failure in a memory product
Never import from out-of-scope sibling directories ../Sprites/ or ../Hausgem/ within gradata/* code — that is a layering bug
Never leak private-sibling paths into public docs/code — no references to ../Sprites/, ../Hausgem/, email addresses, OneDrive paths, or Sprites-specific examples from inside gradata/*
Use atomic-write helper when writing JSON files to prevent corruption from mid-write crashes

Files:

Gradata/src/gradata/enhancements/self_improvement/_graduation_flags.py
Gradata/src/gradata/_core.py
Gradata/src/gradata/enhancements/self_improvement/_graduation.py

Gradata/tests/**/*.py

📄 CodeRabbit inference engine (Gradata/AGENTS.md)

Gradata/tests/**/*.py: Set BRAIN_DIR environment variable via tmp_path in conftest.py for test isolation — ensure _paths.py module cache refreshes when calling Brain.init() directly inside tests
Add unit tests in tests/test_*.py for every CI push without LLM calls (deterministic); mark integration tests with @pytest.mark.integration and skip them by default (they hit real LLM APIs)

Files:

Gradata/tests/test_dedup.py
Gradata/tests/test_weekly_correction_snapshot.py

🪛 GitHub Actions: SDK CI / 0_pytest (py3.11).txt

Gradata/tests/test_dedup.py

[error] 223-223: Test failed in 'test_brain_correct_semantic_near_duplicate_is_deduped': expected second result to have observation_deduped == True, but second.get('observation_deduped') was None (AssertionError).

🪛 GitHub Actions: SDK CI / 1_pytest (py3.12).txt

Gradata/tests/test_dedup.py

[error] 223-223: Assertion failed in tests/test_dedup.py::test_brain_correct_semantic_near_duplicate_is_deduped. Expected second.get('observation_deduped') is True, but got None.

🪛 GitHub Actions: SDK CI / pytest (py3.11)

Gradata/tests/test_dedup.py

[error] 223-223: AssertionError: expected second.get("observation_deduped") is True, but value was None (deduplication near-duplicate not marked as deduped)

🪛 GitHub Actions: SDK CI / pytest (py3.12)

Gradata/tests/test_dedup.py

[error] 223-223: Assertion failed in test_brain_correct_semantic_near_duplicate_is_deduped: expected second.get('observation_deduped') is True, but got None.

🔇 Additional comments (5)

Gradata/src/gradata/enhancements/self_improvement/_graduation.py (1)

31-33: Good threshold-parser extraction and wiring.

This keeps _read_beta_lb_config() lean and moves env parsing/clamping concerns into a single shim without changing the default-path behavior.

Also applies to: 120-120

Gradata/src/gradata/enhancements/self_improvement/_graduation_flags.py (1)

13-17: Nice constantization of GRA-210 experiment knobs.

The naming and separation make the rollout intent explicit and keep call sites stable.

.gitignore (1)

178-178: Good targeted un-ignore entry.

This cleanly keeps Gradata/scripts/* ignored while allowing the new snapshot script to be tracked.

Gradata/pyproject.toml (1)

168-168: Pytest path update looks correct.

Adding scripts to pythonpath is a practical fix for test-time imports of script modules.

Gradata/docs/weekly-correction-snapshot.md (1)

1-37: Clear and actionable docs.

The usage examples and schema description are concise and give a solid contract for downstream consumers.

coderabbitai · 2026-05-13T01:36:45Z

+def _is_graduation_accepted(row: dict[str, Any]) -> bool:
+    event = str(row.get("event", "")).strip().lower()
+    outcome = str(row.get("outcome", "")).strip().lower()
+    accepted_flag = row.get("accepted")
+    status = str(row.get("status", "")).strip().lower()
+    return (
+        event in {"lesson.graduated", "graduation.accepted"}
+        or outcome == "accepted"
+        or accepted_flag is True
+        or status in {"accepted", "graduated"}
+    )


⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Graduation metrics are currently over-counted by non-graduation rows.

Lines 31–36 and 44–49 treat generic outcome/status/accepted fields as graduation outcomes without verifying the row is graduation-related, so unrelated events can skew accepted_graduations/rejection_count.

Proposed fix

+def _is_graduation_row(row: dict[str, Any]) -> bool: + event = str(row.get("event", "")).strip().lower() + kind = str(row.get("kind", "")).strip().lower() + return "graduation" in event or event.startswith("lesson.") or kind == "graduation" + def _is_graduation_accepted(row: dict[str, Any]) -> bool: event = str(row.get("event", "")).strip().lower() outcome = str(row.get("outcome", "")).strip().lower() accepted_flag = row.get("accepted") status = str(row.get("status", "")).strip().lower() return ( event in {"lesson.graduated", "graduation.accepted"} - or outcome == "accepted" - or accepted_flag is True - or status in {"accepted", "graduated"} + or (_is_graduation_row(row) and outcome == "accepted") + or (_is_graduation_row(row) and accepted_flag is True) + or (_is_graduation_row(row) and status in {"accepted", "graduated"}) ) def _is_rejection(row: dict[str, Any]) -> bool: event = str(row.get("event", "")).strip().lower() outcome = str(row.get("outcome", "")).strip().lower() accepted_flag = row.get("accepted") status = str(row.get("status", "")).strip().lower() return ( event in {"graduation.rejected", "lesson.rejected"} - or outcome == "rejected" - or accepted_flag is False - or status == "rejected" + or (_is_graduation_row(row) and outcome == "rejected") + or (_is_graduation_row(row) and accepted_flag is False) + or (_is_graduation_row(row) and status == "rejected") )

Also applies to: 39-49

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@Gradata/scripts/weekly_correction_snapshot.py` around lines 26 - 36, The function _is_graduation_accepted is treating generic outcome/status/accepted fields as graduation results even when the event isn't graduation-related; update it to first detect that the row is a graduation event (e.g., event contains "graduation" or is in the graduation event set like {"lesson.graduated","graduation.accepted","graduation.rejected"}) and only then evaluate outcome/status/accepted for accepted logic. Make the same change for the corresponding rejection logic (the similar block at lines referenced 39–49) so outcome/status/accepted are only considered when the row is confirmed to be a graduation event.

coderabbitai · 2026-05-13T01:36:45Z

    approval_required: bool = False,
    dry_run: bool = False,
-    min_severity: str = "as-is",
+    min_severity: str = "minor",


⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Revert the default min_severity to avoid an unintentional behavior change.

Line 120 changes default gating to "minor", which suppresses "as-is" corrections for all existing callers and changes learning behavior by default.

Proposed fix

- min_severity: str = "minor", + min_severity: str = "as-is",

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

min_severity: str = "minor",

min_severity: str = "as-is",

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@Gradata/src/gradata/_core.py` at line 120, Revert the default value of the min_severity parameter back to its prior setting to avoid changing default gating behavior—restore min_severity from "minor" to the original "as-is" in the function or class signature where min_severity is declared (look for the min_severity parameter in _core.py) so existing callers continue to receive "as-is" corrections by default.

coderabbitai · 2026-05-13T01:36:45Z

+def read_beta_lb_threshold(default: float = GRA_210_GRADUATION_THRESHOLD_DEFAULT) -> float:
+    """Read the Beta-LB threshold override from env.
+
+    Returns a float clipped to [0.0, 1.0], or ``default`` when parsing fails.
+    """
+
+    raw_value = os.environ.get(GRA_210_GRADUATION_THRESHOLD_ENV)
+    if raw_value is None:
+        return default
+
+    try:
+        threshold = float(raw_value)
+    except (TypeError, ValueError):
+        return default
+
+    if not math.isfinite(threshold):
+        return default
+
+    return min(max(threshold, 0.0), 1.0)


🧹 Nitpick | 🔵 Trivial | ⚡ Quick win

Normalize default before returning to honor the function contract.

When env parsing fails (or env is unset), an out-of-range default can currently bypass clipping even though the docstring promises a [0.0, 1.0] result.

Proposed patch

def read_beta_lb_threshold(default: float = GRA_210_GRADUATION_THRESHOLD_DEFAULT) -> float: @@ - raw_value = os.environ.get(GRA_210_GRADUATION_THRESHOLD_ENV) + if not math.isfinite(default): + default = GRA_210_GRADUATION_THRESHOLD_DEFAULT + default = min(max(default, 0.0), 1.0) + + raw_value = os.environ.get(GRA_210_GRADUATION_THRESHOLD_ENV) if raw_value is None: return default

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@Gradata/src/gradata/enhancements/self_improvement/_graduation_flags.py` around lines 19 - 37, The function read_beta_lb_threshold can return an out-of-range default without clipping; update it so every return path returns a normalized value in [0.0, 1.0] by clipping the default before returning when raw_value is None, parsing fails, or threshold is non-finite. Modify read_beta_lb_threshold to compute a clipped_default = min(max(default, 0.0), 1.0) (or reuse the same min/max logic used for threshold) and return clipped_default instead of default in the branches that currently return default; keep the same behavior for successfully parsed finite thresholds.

coderabbitai · 2026-05-13T01:36:45Z

+    assert first.get("observation_deduped") is not True
+
+    second = brain.correct(a2, b2, category="DRAFTING", session=2)
+    assert second.get("observation_deduped") is True


⚠️ Potential issue | 🔴 Critical | ⚡ Quick win

CI blocker: semantic dedup assertion is failing now.

Line 223 is red in both py3.11 and py3.12 (second.get("observation_deduped") is None). This test needs deterministic setup (or corrected expectation) before merge.

As per coding guidelines, "Add unit tests in tests/test_*.py for every CI push without LLM calls (deterministic)".

🧰 Tools

🪛 GitHub Actions: SDK CI / 0_pytest (py3.11).txt

[error] 223-223: Test failed in 'test_brain_correct_semantic_near_duplicate_is_deduped': expected second result to have observation_deduped == True, but second.get('observation_deduped') was None (AssertionError).

🪛 GitHub Actions: SDK CI / 1_pytest (py3.12).txt

[error] 223-223: Assertion failed in tests/test_dedup.py::test_brain_correct_semantic_near_duplicate_is_deduped. Expected second.get('observation_deduped') is True, but got None.

🪛 GitHub Actions: SDK CI / pytest (py3.11)

[error] 223-223: AssertionError: expected second.get("observation_deduped") is True, but value was None (deduplication near-duplicate not marked as deduped)

🪛 GitHub Actions: SDK CI / pytest (py3.12)

[error] 223-223: Assertion failed in test_brain_correct_semantic_near_duplicate_is_deduped: expected second.get('observation_deduped') is True, but got None.

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@Gradata/tests/test_dedup.py` at line 223, The failing assertion shows second.get("observation_deduped") is None because the test doesn't run the deterministic dedup step or seed any randomness before checking the flag; update tests/test_dedup.py to ensure the code path that marks the observation runs deterministically (call the semantic dedup function used in the codebase or invoke the pipeline step that sets the "observation_deduped" field for the `second` object) and/or seed/mock any randomness or external services used by that dedup step so the flag is reliably set, then assert second.get("observation_deduped") is True.

coderabbitai · 2026-05-13T01:36:45Z

+def test_format_drafting_floor_passes_larger_ed_corrections(fresh_brain):
+    """FORMAT/DRAFTING corrections with ed ≥ 0.07 must NOT be filtered."""
+    brain = fresh_brain
+    # Multi-word restructure in DRAFTING; edit distance should exceed 0.07.
+    draft = "Maybe we could perhaps consider thinking about simplifying this."
+    final = "Simplify this."
+    result = brain.correct(draft, final, category="DRAFTING", session=11)
+    assert result.get("low_signal_filtered") is not True, (
+        f"Expected substantial DRAFTING edit to pass; ed={result.get('edit_distance')}"
+    )
+


⚠️ Potential issue | 🟠 Major | ⚡ Quick win

These threshold tests are brittle because they depend on live diff-engine behavior.

The assertions rely on expected edit-distance outcomes from natural-language rewrites without stubbing compute_diff, so small diff-engine changes can cause flaky failures.

As per coding guidelines, "Add unit tests in tests/test_*.py for every CI push without LLM calls (deterministic)".

Also applies to: 358-375

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@Gradata/tests/test_dedup.py` around lines 333 - 343, The test test_format_drafting_floor_passes_larger_ed_corrections is brittle because it depends on the live diff engine; instead patch or monkeypatch the compute_diff function used by brain.correct to return a deterministic diff result with edit_distance >= 0.07 (and appropriate token counts) so the low_signal_filtered path is exercised reliably. Locate where brain.correct calls compute_diff (or the module function named compute_diff) and in the test use unittest.mock.patch or pytest monkeypatch to supply a fixed dict like {"edit_distance": 0.08, ...} before calling brain.correct, then assert result.get("low_signal_filtered") is not True; repeat the same approach for the other similar test referenced in the comment.

coderabbitai · 2026-05-13T01:36:45Z

+def test_aggregate_counts_and_top_categories_deterministically():
+    rows = [
+        {"event": "correction.created", "category": "Tone"},
+        {"event": "correction.created", "category": "tone"},
+        {"event": "correction.created", "category": "factual"},
+        {"event": "correction.created", "category": "  PROCESS  "},
+        {"kind": "correction", "category": ""},
+        {"event": "lesson.graduated"},
+        {"event": "graduation.accepted"},
+        {"outcome": "accepted"},
+        {"event": "graduation.rejected"},
+        {"accepted": False},
+    ]
+    data = snapshot.aggregate(rows)
+    assert data["total_corrections"] == 5
+    assert data["accepted_graduations"] == 3
+    assert data["rejection_count"] == 2
+    assert data["acceptance_rate"] == 0.6
+    assert data["top_rule_categories"] == [
+        {"category": "tone", "count": 2},
+        {"category": "factual", "count": 1},
+        {"category": "process", "count": 1},
+        {"category": "unknown", "count": 1},
+    ]
+


🧹 Nitpick | 🔵 Trivial | ⚡ Quick win

Add a regression case for non-graduation status/outcome rows.

Current coverage doesn’t guard against counting unrelated events that happen to carry status="accepted" / outcome="rejected". A small test here will prevent silent metric drift.

As per coding guidelines, "Add unit tests in tests/test_*.py for every CI push without LLM calls (deterministic)".

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@Gradata/tests/test_weekly_correction_snapshot.py` around lines 31 - 55, Add a regression test to ensure snapshot.aggregate ignores stray status/outcome fields on non-graduation rows: in test_aggregate_counts_and_top_categories_deterministically, append rows like {"status": "accepted"} and {"outcome": "rejected"} (with no "event": "graduation.*" or associated graduation row) and assert that accepted_graduations and rejection_count remain the same as before; use the existing snapshot.aggregate function and existing assertions for total_corrections, acceptance_rate and top_rule_categories to confirm these stray fields do not affect graduation metrics.

greptile-apps

Your free trial has ended. If you'd like to continue receiving code reviews, you can add a payment method here.

data-engineer and others added 4 commits May 12, 2026 12:18

Add weekly correction outcome snapshot script and tests

6e286db

Fix weekly snapshot outcome classification and test import path

9f61abb

GRA-216: add graduation threshold experiment shim

3fc53c6

greptile-apps Bot reviewed May 13, 2026

View reviewed changes

coderabbitai Bot added the feature label May 13, 2026

coderabbitai Bot requested changes May 13, 2026

View reviewed changes

Skip BYO key tests when optional httpx is unavailable

0e03bd1

greptile-apps Bot reviewed May 13, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GRA-216: Add graduation threshold experiment flag shim#188

GRA-216: Add graduation threshold experiment flag shim#188
Gradata wants to merge 5 commits into
mainfrom
feat/gra-216-beta-lb-threshold

Gradata commented May 13, 2026

Uh oh!

greptile-apps Bot left a comment

Uh oh!

coderabbitai Bot commented May 13, 2026 •

edited

Loading

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Uh oh!

coderabbitai Bot May 13, 2026

Uh oh!

coderabbitai Bot May 13, 2026

Uh oh!

coderabbitai Bot May 13, 2026

Uh oh!

coderabbitai Bot May 13, 2026

Uh oh!

coderabbitai Bot May 13, 2026

Uh oh!

coderabbitai Bot May 13, 2026

Uh oh!

greptile-apps Bot left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

Gradata commented May 13, 2026

Uh oh!

greptile-apps Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot commented May 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot May 13, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot May 13, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot May 13, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot May 13, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot May 13, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot May 13, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

coderabbitai Bot commented May 13, 2026 •

edited

Loading