Skip to content

GRA-216: Add graduation threshold experiment flag shim#188

Open
Gradata wants to merge 5 commits into
mainfrom
feat/gra-216-beta-lb-threshold
Open

GRA-216: Add graduation threshold experiment flag shim#188
Gradata wants to merge 5 commits into
mainfrom
feat/gra-216-beta-lb-threshold

Conversation

@Gradata
Copy link
Copy Markdown
Owner

@Gradata Gradata commented May 13, 2026

Implements GRA-216 for GRA-210.

  • Added gradata.enhancements.self_improvement._graduation_flags as a thin experiment shim for the Beta-LB threshold.
  • Wired _read_beta_lb_config() to use read_beta_lb_threshold(...).
  • No behavior change by default; default remains 0.75 unless GRADATA_BETA_LB_THRESHOLD overrides it.

Environment rollout for the experiment:

  • Set Railway staging env var GRADATA_BETA_LB_THRESHOLD=0.55 on gradata service.

Model: [codex_local]

data-engineer and others added 4 commits May 12, 2026 12:18
…NG (GRA-113 cycle 3)

Raises the low-signal edit-distance floor from 0.04 to 0.07 for FORMAT and
DRAFTING categories so synonym-swap and minor phrasing edits (0.04 ≤ ed < 0.07)
are filtered before they create lessons.  SECURITY and ACCURACY corrections are
unaffected — they already bypass the gate via _is_meaningful_low_signal_change.

- Add _FORMAT_DRAFTING_EDIT_DISTANCE_FLOOR = 0.07 and _FORMAT_DRAFTING_CATEGORIES
- Gate logic now selects a category-specific floor before the low_signal_filtered check
- 6 new tests in test_dedup.py covering constants, filter boundary, pass-through for
  SECURITY/ACCURACY, and non-FORMAT/DRAFTING categories retaining the 0.04 floor

Co-Authored-By: Paperclip <noreply@paperclip.ing>
Copy link
Copy Markdown

@greptile-apps greptile-apps Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Your free trial has ended. If you'd like to continue receiving code reviews, you can add a payment method here.

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 13, 2026

Review Change Stack
No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: 22837c8f-c503-4e77-ba4c-81e6e29341a9

📥 Commits

Reviewing files that changed from the base of the PR and between 3fc53c6 and 0e03bd1.

📒 Files selected for processing (1)
  • Gradata/tests/test_byo_key_provider.py
📜 Recent review details
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (7)
  • GitHub Check: pytest (py3.12)
  • GitHub Check: pytest windows-latest / py3.12
  • GitHub Check: pytest windows-latest / py3.11
  • GitHub Check: pytest ubuntu-latest / py3.11
  • GitHub Check: pytest macos-latest / py3.11
  • GitHub Check: pytest macos-latest / py3.12
  • GitHub Check: pytest ubuntu-latest / py3.12
🧰 Additional context used
📓 Path-based instructions (1)
Gradata/tests/**/*.py

📄 CodeRabbit inference engine (Gradata/AGENTS.md)

Gradata/tests/**/*.py: Set BRAIN_DIR environment variable via tmp_path in conftest.py for test isolation — ensure _paths.py module cache refreshes when calling Brain.init() directly inside tests
Add unit tests in tests/test_*.py for every CI push without LLM calls (deterministic); mark integration tests with @pytest.mark.integration and skip them by default (they hit real LLM APIs)

Files:

  • Gradata/tests/test_byo_key_provider.py
🔇 Additional comments (1)
Gradata/tests/test_byo_key_provider.py (1)

3-8: Dependency guard for optional httpx test runtime is clean.

pytest.importorskip("httpx") is an appropriate module-level guard here and keeps this suite deterministic when httpx is present.


📝 Walkthrough
  • New experiment shim module: Added gradata.enhancements.self_improvement._graduation_flags with constants and read_beta_lb_threshold(default=0.75) to control the Beta-LB graduation gating (GRA-210).
  • Refactored threshold logic: _read_beta_lb_config() now delegates parsing/validation/clamping to read_beta_lb_threshold().
  • Default behavior preserved: default threshold remains 0.75; overridden via GRADATA_BETA_LB_THRESHOLD and clamped to [0.0, 1.0].
  • Breaking change: brain_correct() default min_severity changed from "as-is" to "minor".
  • Low-signal filtering: introduced category-aware edit-distance floors (higher for FORMAT/DRAFTING; SECURITY/ACCURACY always pass) and suppression of very small/non-meaningful edits (emits low_signal_filtered).
  • Auto-export enhancement: brain_end_session() now auto-exports AGENTS.md by default, controllable via GRADATA_AUTO_EXPORT_AGENTS env var.
  • New CLI utility: scripts/weekly_correction_snapshot.py (parse_rows, aggregate, _read_lines, main) — transforms NDJSON event streams into deterministic compact JSON snapshots including acceptance rate, top 5 rule categories, and skipped_rows.
  • Tests added/expanded: 9 dedup/low-signal tests (tests/test_dedup.py) and 5 weekly snapshot tests (tests/test_weekly_correction_snapshot.py); plus minor test import change to skip BYO-key tests if httpx missing.
  • Minor infra/docs updates: pyproject.toml pytest pythonpath extended to include scripts/, added docs/weekly-correction-snapshot.md, and .gitignore adjusted to un-ignore the new script.

Walkthrough

This PR introduces three independent feature additions: a weekly correction snapshot CLI tool that aggregates NDJSON event streams into deterministic JSON statistics; low-signal filtering logic in the core correction pipeline to suppress tiny, non-meaningful corrections while auto-exporting learned agents; and a refactoring of Beta-LB graduation threshold parsing into a dedicated module.

Changes

Weekly Correction Snapshot Feature

Layer / File(s) Summary
Snapshot script implementation
Gradata/scripts/weekly_correction_snapshot.py
Core functions parse NDJSON lines into dicts, classify rows as corrections/accepted/rejected graduations based on event heuristics, aggregate totals and acceptance rates, and emit deterministic JSON to stdout with configurable file or stdin input.
Testing, documentation, and integration
Gradata/tests/test_weekly_correction_snapshot.py, Gradata/docs/weekly-correction-snapshot.md, Gradata/pyproject.toml, .gitignore
Pytest functions validate parse_rows skips malformed JSON, aggregate computes zero-safe defaults and deterministic top categories, and main reads stdin and emits exact JSON; documentation specifies output schema; pythonpath extended to include scripts; .gitignore explicitly un-ignores the new script.

Low-Signal Filtering in Correction Pipeline

Layer / File(s) Summary
Edit-distance floors and meaningfulness predicate
Gradata/src/gradata/_core.py
Introduces edit-distance thresholds (base 0.04 and FORMAT/DRAFTING-specific 0.07), os import for environment variables, and _is_meaningful_low_signal_change() predicate that exempts ACCURACY/SECURITY and permits case-only or acronym/proper-noun-like changes.
Lesson-creation gating and auto-export
Gradata/src/gradata/_core.py
Updates brain_correct() default min_severity from "as-is" to "minor"; enhances lesson creation to compute category-dependent floors, suppress low-signal corrections when not meaningful, set low_signal_filtered flag, and auto-export AGENTS.md post-graduation when GRADATA_AUTO_EXPORT_AGENTS is not disabled.
Low-signal and dedup test coverage
Gradata/tests/test_dedup.py, Gradata/tests/test_byo_key_provider.py
Nine tests assert semantic near-duplicates are deduped, punctuation-only corrections filtered, dedup preserves lineage, FORMAT/DRAFTING uses higher threshold (0.07 vs 0.04), filtering behavior at various edit distances, SECURITY/ACCURACY always pass, and non-FORMAT/DRAFTING categories keep the base floor; plus a small test-module skip import change for BYO-key tests.

Beta-LB Graduation Threshold Refactoring

Layer / File(s) Summary
Graduation flags module with threshold reading
Gradata/src/gradata/enhancements/self_improvement/_graduation_flags.py
New module defines GRA-210 experiment constants, GRADATA_BETA_LB_THRESHOLD env var, 0.75 default, and read_beta_lb_threshold() which parses environment values, validates as finite float, falls back to a default, and clamps to [0.0, 1.0].
Graduation.py refactored to use flags module
Gradata/src/gradata/enhancements/self_improvement/_graduation.py
Imports read_beta_lb_threshold and delegates Beta-LB threshold computation; removes previous inline float parsing, math.isfinite validation, and clamping logic along with the unused math import.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Possibly related PRs

  • Gradata/gradata#31: Touches brain_end_session(...)/graduation-related flow, related to AGENTS.md export changes.
  • Gradata/gradata#85: Modifies brain_correct()/correction metadata paths, related to low-signal filtering and emitted event flags.
🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 40.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly and specifically describes the main change: adding a graduation threshold experiment flag shim for GRA-216.
Description check ✅ Passed The description is directly related to the changeset, explaining the implementation of GRA-216, the new module, wiring details, and environment rollout instructions.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch feat/gra-216-beta-lb-threshold

Warning

There were issues while running some tools. Please review the errors and either fix the tool's configuration or disable the tool if it's a critical failure.

🔧 OpenGrep (1.20.0)

OpenGrep fatal error (exit code 2):
┌──────────────┐
│ Opengrep CLI │
└──────────────┘

�[32m✔�[39m �[1mOpengrep OSS�[0m
�[32m✔�[39m Basic security coverage for first-party code vulnerabilities.

�[1m Loading rules from local config...�[0m
[00.15][ERROR]: Error: exception Glob.Lexer.Syntax_error("malformed glob pattern: missing ']'")
Raised at Glob__Lexer.syntax_error in file "libs/glob/Lexer.mll", line 8, characters 2-26
Called from Glob__Lexer.__ocaml_lex_token_rec in file "libs/glob/Lexer.mll", line 29, characters 26-53
Cal


Comment @coderabbitai help to get the list of available commands and usage tips.

@coderabbitai coderabbitai Bot added the feature label May 13, 2026
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 6

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@Gradata/scripts/weekly_correction_snapshot.py`:
- Around line 26-36: The function _is_graduation_accepted is treating generic
outcome/status/accepted fields as graduation results even when the event isn't
graduation-related; update it to first detect that the row is a graduation event
(e.g., event contains "graduation" or is in the graduation event set like
{"lesson.graduated","graduation.accepted","graduation.rejected"}) and only then
evaluate outcome/status/accepted for accepted logic. Make the same change for
the corresponding rejection logic (the similar block at lines referenced 39–49)
so outcome/status/accepted are only considered when the row is confirmed to be a
graduation event.

In `@Gradata/src/gradata/_core.py`:
- Line 120: Revert the default value of the min_severity parameter back to its
prior setting to avoid changing default gating behavior—restore min_severity
from "minor" to the original "as-is" in the function or class signature where
min_severity is declared (look for the min_severity parameter in _core.py) so
existing callers continue to receive "as-is" corrections by default.

In `@Gradata/src/gradata/enhancements/self_improvement/_graduation_flags.py`:
- Around line 19-37: The function read_beta_lb_threshold can return an
out-of-range default without clipping; update it so every return path returns a
normalized value in [0.0, 1.0] by clipping the default before returning when
raw_value is None, parsing fails, or threshold is non-finite. Modify
read_beta_lb_threshold to compute a clipped_default = min(max(default, 0.0),
1.0) (or reuse the same min/max logic used for threshold) and return
clipped_default instead of default in the branches that currently return
default; keep the same behavior for successfully parsed finite thresholds.

In `@Gradata/tests/test_dedup.py`:
- Around line 333-343: The test
test_format_drafting_floor_passes_larger_ed_corrections is brittle because it
depends on the live diff engine; instead patch or monkeypatch the compute_diff
function used by brain.correct to return a deterministic diff result with
edit_distance >= 0.07 (and appropriate token counts) so the low_signal_filtered
path is exercised reliably. Locate where brain.correct calls compute_diff (or
the module function named compute_diff) and in the test use unittest.mock.patch
or pytest monkeypatch to supply a fixed dict like {"edit_distance": 0.08, ...}
before calling brain.correct, then assert result.get("low_signal_filtered") is
not True; repeat the same approach for the other similar test referenced in the
comment.
- Line 223: The failing assertion shows second.get("observation_deduped") is
None because the test doesn't run the deterministic dedup step or seed any
randomness before checking the flag; update tests/test_dedup.py to ensure the
code path that marks the observation runs deterministically (call the semantic
dedup function used in the codebase or invoke the pipeline step that sets the
"observation_deduped" field for the `second` object) and/or seed/mock any
randomness or external services used by that dedup step so the flag is reliably
set, then assert second.get("observation_deduped") is True.

In `@Gradata/tests/test_weekly_correction_snapshot.py`:
- Around line 31-55: Add a regression test to ensure snapshot.aggregate ignores
stray status/outcome fields on non-graduation rows: in
test_aggregate_counts_and_top_categories_deterministically, append rows like
{"status": "accepted"} and {"outcome": "rejected"} (with no "event":
"graduation.*" or associated graduation row) and assert that
accepted_graduations and rejection_count remain the same as before; use the
existing snapshot.aggregate function and existing assertions for
total_corrections, acceptance_rate and top_rule_categories to confirm these
stray fields do not affect graduation metrics.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: 30280937-ce9f-4f04-b6f4-041e1330af75

📥 Commits

Reviewing files that changed from the base of the PR and between 5a5c4d8 and 3fc53c6.

📒 Files selected for processing (9)
  • .gitignore
  • Gradata/docs/weekly-correction-snapshot.md
  • Gradata/pyproject.toml
  • Gradata/scripts/weekly_correction_snapshot.py
  • Gradata/src/gradata/_core.py
  • Gradata/src/gradata/enhancements/self_improvement/_graduation.py
  • Gradata/src/gradata/enhancements/self_improvement/_graduation_flags.py
  • Gradata/tests/test_dedup.py
  • Gradata/tests/test_weekly_correction_snapshot.py
📜 Review details
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (6)
  • GitHub Check: pytest macos-latest / py3.11
  • GitHub Check: pytest ubuntu-latest / py3.12
  • GitHub Check: pytest windows-latest / py3.12
  • GitHub Check: pytest ubuntu-latest / py3.11
  • GitHub Check: pytest windows-latest / py3.11
  • GitHub Check: pytest macos-latest / py3.12
🧰 Additional context used
📓 Path-based instructions (3)
Gradata/**/pyproject.toml

📄 CodeRabbit inference engine (Gradata/AGENTS.md)

Maintain dependencies = [] in pyproject.toml — the base package is pure Python + stdlib with all heavy dependencies gated as optional extras: embeddings, gemini, encrypted, ranking, adapters-mem0

Files:

  • Gradata/pyproject.toml
Gradata/src/**/*.py

📄 CodeRabbit inference engine (Gradata/AGENTS.md)

Gradata/src/**/*.py: Prefer sentence-transformers for local embeddings, google-genai for Gemini embeddings, cryptography for AES-GCM encrypted system.db, bm25s for BM25 rule ranking, and mem0ai for external memory adapters — guard all optional dependency imports with try / except ImportError at the call site, never at module level
Maintain strict layering: Layer 0 (Primitives: _types.py, _db.py, _events.py, _paths.py, _file_lock.py; Patterns: contrib/patterns/) must never import from Layer 1 (Enhancements: enhancements/, rules/) or Layer 2 (Public API: brain.py, cli.py, daemon.py, mcp_server.py)
Never use bare except: pass — use typed exceptions or at minimum logger.warning(...) with exc_info=True to avoid silent failure in a memory product
Never import from out-of-scope sibling directories ../Sprites/ or ../Hausgem/ within gradata/* code — that is a layering bug
Never leak private-sibling paths into public docs/code — no references to ../Sprites/, ../Hausgem/, email addresses, OneDrive paths, or Sprites-specific examples from inside gradata/*
Use atomic-write helper when writing JSON files to prevent corruption from mid-write crashes

Files:

  • Gradata/src/gradata/enhancements/self_improvement/_graduation_flags.py
  • Gradata/src/gradata/_core.py
  • Gradata/src/gradata/enhancements/self_improvement/_graduation.py
Gradata/tests/**/*.py

📄 CodeRabbit inference engine (Gradata/AGENTS.md)

Gradata/tests/**/*.py: Set BRAIN_DIR environment variable via tmp_path in conftest.py for test isolation — ensure _paths.py module cache refreshes when calling Brain.init() directly inside tests
Add unit tests in tests/test_*.py for every CI push without LLM calls (deterministic); mark integration tests with @pytest.mark.integration and skip them by default (they hit real LLM APIs)

Files:

  • Gradata/tests/test_dedup.py
  • Gradata/tests/test_weekly_correction_snapshot.py
🪛 GitHub Actions: SDK CI / 0_pytest (py3.11).txt
Gradata/tests/test_dedup.py

[error] 223-223: Test failed in 'test_brain_correct_semantic_near_duplicate_is_deduped': expected second result to have observation_deduped == True, but second.get('observation_deduped') was None (AssertionError).

🪛 GitHub Actions: SDK CI / 1_pytest (py3.12).txt
Gradata/tests/test_dedup.py

[error] 223-223: Assertion failed in tests/test_dedup.py::test_brain_correct_semantic_near_duplicate_is_deduped. Expected second.get('observation_deduped') is True, but got None.

🪛 GitHub Actions: SDK CI / pytest (py3.11)
Gradata/tests/test_dedup.py

[error] 223-223: AssertionError: expected second.get("observation_deduped") is True, but value was None (deduplication near-duplicate not marked as deduped)

🪛 GitHub Actions: SDK CI / pytest (py3.12)
Gradata/tests/test_dedup.py

[error] 223-223: Assertion failed in test_brain_correct_semantic_near_duplicate_is_deduped: expected second.get('observation_deduped') is True, but got None.

🔇 Additional comments (5)
Gradata/src/gradata/enhancements/self_improvement/_graduation.py (1)

31-33: Good threshold-parser extraction and wiring.

This keeps _read_beta_lb_config() lean and moves env parsing/clamping concerns into a single shim without changing the default-path behavior.

Also applies to: 120-120

Gradata/src/gradata/enhancements/self_improvement/_graduation_flags.py (1)

13-17: Nice constantization of GRA-210 experiment knobs.

The naming and separation make the rollout intent explicit and keep call sites stable.

.gitignore (1)

178-178: Good targeted un-ignore entry.

This cleanly keeps Gradata/scripts/* ignored while allowing the new snapshot script to be tracked.

Gradata/pyproject.toml (1)

168-168: Pytest path update looks correct.

Adding scripts to pythonpath is a practical fix for test-time imports of script modules.

Gradata/docs/weekly-correction-snapshot.md (1)

1-37: Clear and actionable docs.

The usage examples and schema description are concise and give a solid contract for downstream consumers.

Comment on lines +26 to +36
def _is_graduation_accepted(row: dict[str, Any]) -> bool:
event = str(row.get("event", "")).strip().lower()
outcome = str(row.get("outcome", "")).strip().lower()
accepted_flag = row.get("accepted")
status = str(row.get("status", "")).strip().lower()
return (
event in {"lesson.graduated", "graduation.accepted"}
or outcome == "accepted"
or accepted_flag is True
or status in {"accepted", "graduated"}
)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Graduation metrics are currently over-counted by non-graduation rows.

Lines 31–36 and 44–49 treat generic outcome/status/accepted fields as graduation outcomes without verifying the row is graduation-related, so unrelated events can skew accepted_graduations/rejection_count.

Proposed fix
+def _is_graduation_row(row: dict[str, Any]) -> bool:
+    event = str(row.get("event", "")).strip().lower()
+    kind = str(row.get("kind", "")).strip().lower()
+    return "graduation" in event or event.startswith("lesson.") or kind == "graduation"
+
 def _is_graduation_accepted(row: dict[str, Any]) -> bool:
     event = str(row.get("event", "")).strip().lower()
     outcome = str(row.get("outcome", "")).strip().lower()
     accepted_flag = row.get("accepted")
     status = str(row.get("status", "")).strip().lower()
     return (
         event in {"lesson.graduated", "graduation.accepted"}
-        or outcome == "accepted"
-        or accepted_flag is True
-        or status in {"accepted", "graduated"}
+        or (_is_graduation_row(row) and outcome == "accepted")
+        or (_is_graduation_row(row) and accepted_flag is True)
+        or (_is_graduation_row(row) and status in {"accepted", "graduated"})
     )

 def _is_rejection(row: dict[str, Any]) -> bool:
     event = str(row.get("event", "")).strip().lower()
     outcome = str(row.get("outcome", "")).strip().lower()
     accepted_flag = row.get("accepted")
     status = str(row.get("status", "")).strip().lower()
     return (
         event in {"graduation.rejected", "lesson.rejected"}
-        or outcome == "rejected"
-        or accepted_flag is False
-        or status == "rejected"
+        or (_is_graduation_row(row) and outcome == "rejected")
+        or (_is_graduation_row(row) and accepted_flag is False)
+        or (_is_graduation_row(row) and status == "rejected")
     )

Also applies to: 39-49

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@Gradata/scripts/weekly_correction_snapshot.py` around lines 26 - 36, The
function _is_graduation_accepted is treating generic outcome/status/accepted
fields as graduation results even when the event isn't graduation-related;
update it to first detect that the row is a graduation event (e.g., event
contains "graduation" or is in the graduation event set like
{"lesson.graduated","graduation.accepted","graduation.rejected"}) and only then
evaluate outcome/status/accepted for accepted logic. Make the same change for
the corresponding rejection logic (the similar block at lines referenced 39–49)
so outcome/status/accepted are only considered when the row is confirmed to be a
graduation event.

approval_required: bool = False,
dry_run: bool = False,
min_severity: str = "as-is",
min_severity: str = "minor",
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Revert the default min_severity to avoid an unintentional behavior change.

Line 120 changes default gating to "minor", which suppresses "as-is" corrections for all existing callers and changes learning behavior by default.

Proposed fix
-    min_severity: str = "minor",
+    min_severity: str = "as-is",
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
min_severity: str = "minor",
min_severity: str = "as-is",
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@Gradata/src/gradata/_core.py` at line 120, Revert the default value of the
min_severity parameter back to its prior setting to avoid changing default
gating behavior—restore min_severity from "minor" to the original "as-is" in the
function or class signature where min_severity is declared (look for the
min_severity parameter in _core.py) so existing callers continue to receive
"as-is" corrections by default.

Comment on lines +19 to +37
def read_beta_lb_threshold(default: float = GRA_210_GRADUATION_THRESHOLD_DEFAULT) -> float:
"""Read the Beta-LB threshold override from env.

Returns a float clipped to [0.0, 1.0], or ``default`` when parsing fails.
"""

raw_value = os.environ.get(GRA_210_GRADUATION_THRESHOLD_ENV)
if raw_value is None:
return default

try:
threshold = float(raw_value)
except (TypeError, ValueError):
return default

if not math.isfinite(threshold):
return default

return min(max(threshold, 0.0), 1.0)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick | 🔵 Trivial | ⚡ Quick win

Normalize default before returning to honor the function contract.

When env parsing fails (or env is unset), an out-of-range default can currently bypass clipping even though the docstring promises a [0.0, 1.0] result.

Proposed patch
 def read_beta_lb_threshold(default: float = GRA_210_GRADUATION_THRESHOLD_DEFAULT) -> float:
@@
-    raw_value = os.environ.get(GRA_210_GRADUATION_THRESHOLD_ENV)
+    if not math.isfinite(default):
+        default = GRA_210_GRADUATION_THRESHOLD_DEFAULT
+    default = min(max(default, 0.0), 1.0)
+
+    raw_value = os.environ.get(GRA_210_GRADUATION_THRESHOLD_ENV)
     if raw_value is None:
         return default
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@Gradata/src/gradata/enhancements/self_improvement/_graduation_flags.py`
around lines 19 - 37, The function read_beta_lb_threshold can return an
out-of-range default without clipping; update it so every return path returns a
normalized value in [0.0, 1.0] by clipping the default before returning when
raw_value is None, parsing fails, or threshold is non-finite. Modify
read_beta_lb_threshold to compute a clipped_default = min(max(default, 0.0),
1.0) (or reuse the same min/max logic used for threshold) and return
clipped_default instead of default in the branches that currently return
default; keep the same behavior for successfully parsed finite thresholds.

assert first.get("observation_deduped") is not True

second = brain.correct(a2, b2, category="DRAFTING", session=2)
assert second.get("observation_deduped") is True
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical | ⚡ Quick win

CI blocker: semantic dedup assertion is failing now.

Line 223 is red in both py3.11 and py3.12 (second.get("observation_deduped") is None). This test needs deterministic setup (or corrected expectation) before merge.

As per coding guidelines, "Add unit tests in tests/test_*.py for every CI push without LLM calls (deterministic)".

🧰 Tools
🪛 GitHub Actions: SDK CI / 0_pytest (py3.11).txt

[error] 223-223: Test failed in 'test_brain_correct_semantic_near_duplicate_is_deduped': expected second result to have observation_deduped == True, but second.get('observation_deduped') was None (AssertionError).

🪛 GitHub Actions: SDK CI / 1_pytest (py3.12).txt

[error] 223-223: Assertion failed in tests/test_dedup.py::test_brain_correct_semantic_near_duplicate_is_deduped. Expected second.get('observation_deduped') is True, but got None.

🪛 GitHub Actions: SDK CI / pytest (py3.11)

[error] 223-223: AssertionError: expected second.get("observation_deduped") is True, but value was None (deduplication near-duplicate not marked as deduped)

🪛 GitHub Actions: SDK CI / pytest (py3.12)

[error] 223-223: Assertion failed in test_brain_correct_semantic_near_duplicate_is_deduped: expected second.get('observation_deduped') is True, but got None.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@Gradata/tests/test_dedup.py` at line 223, The failing assertion shows
second.get("observation_deduped") is None because the test doesn't run the
deterministic dedup step or seed any randomness before checking the flag; update
tests/test_dedup.py to ensure the code path that marks the observation runs
deterministically (call the semantic dedup function used in the codebase or
invoke the pipeline step that sets the "observation_deduped" field for the
`second` object) and/or seed/mock any randomness or external services used by
that dedup step so the flag is reliably set, then assert
second.get("observation_deduped") is True.

Comment on lines +333 to +343
def test_format_drafting_floor_passes_larger_ed_corrections(fresh_brain):
"""FORMAT/DRAFTING corrections with ed ≥ 0.07 must NOT be filtered."""
brain = fresh_brain
# Multi-word restructure in DRAFTING; edit distance should exceed 0.07.
draft = "Maybe we could perhaps consider thinking about simplifying this."
final = "Simplify this."
result = brain.correct(draft, final, category="DRAFTING", session=11)
assert result.get("low_signal_filtered") is not True, (
f"Expected substantial DRAFTING edit to pass; ed={result.get('edit_distance')}"
)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

These threshold tests are brittle because they depend on live diff-engine behavior.

The assertions rely on expected edit-distance outcomes from natural-language rewrites without stubbing compute_diff, so small diff-engine changes can cause flaky failures.

As per coding guidelines, "Add unit tests in tests/test_*.py for every CI push without LLM calls (deterministic)".

Also applies to: 358-375

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@Gradata/tests/test_dedup.py` around lines 333 - 343, The test
test_format_drafting_floor_passes_larger_ed_corrections is brittle because it
depends on the live diff engine; instead patch or monkeypatch the compute_diff
function used by brain.correct to return a deterministic diff result with
edit_distance >= 0.07 (and appropriate token counts) so the low_signal_filtered
path is exercised reliably. Locate where brain.correct calls compute_diff (or
the module function named compute_diff) and in the test use unittest.mock.patch
or pytest monkeypatch to supply a fixed dict like {"edit_distance": 0.08, ...}
before calling brain.correct, then assert result.get("low_signal_filtered") is
not True; repeat the same approach for the other similar test referenced in the
comment.

Comment on lines +31 to +55
def test_aggregate_counts_and_top_categories_deterministically():
rows = [
{"event": "correction.created", "category": "Tone"},
{"event": "correction.created", "category": "tone"},
{"event": "correction.created", "category": "factual"},
{"event": "correction.created", "category": " PROCESS "},
{"kind": "correction", "category": ""},
{"event": "lesson.graduated"},
{"event": "graduation.accepted"},
{"outcome": "accepted"},
{"event": "graduation.rejected"},
{"accepted": False},
]
data = snapshot.aggregate(rows)
assert data["total_corrections"] == 5
assert data["accepted_graduations"] == 3
assert data["rejection_count"] == 2
assert data["acceptance_rate"] == 0.6
assert data["top_rule_categories"] == [
{"category": "tone", "count": 2},
{"category": "factual", "count": 1},
{"category": "process", "count": 1},
{"category": "unknown", "count": 1},
]

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick | 🔵 Trivial | ⚡ Quick win

Add a regression case for non-graduation status/outcome rows.

Current coverage doesn’t guard against counting unrelated events that happen to carry status="accepted" / outcome="rejected". A small test here will prevent silent metric drift.

As per coding guidelines, "Add unit tests in tests/test_*.py for every CI push without LLM calls (deterministic)".

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@Gradata/tests/test_weekly_correction_snapshot.py` around lines 31 - 55, Add a
regression test to ensure snapshot.aggregate ignores stray status/outcome fields
on non-graduation rows: in
test_aggregate_counts_and_top_categories_deterministically, append rows like
{"status": "accepted"} and {"outcome": "rejected"} (with no "event":
"graduation.*" or associated graduation row) and assert that
accepted_graduations and rejection_count remain the same as before; use the
existing snapshot.aggregate function and existing assertions for
total_corrections, acceptance_rate and top_rule_categories to confirm these
stray fields do not affect graduation metrics.

Copy link
Copy Markdown

@greptile-apps greptile-apps Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Your free trial has ended. If you'd like to continue receiving code reviews, you can add a payment method here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant