Add automated cold-agent adoption harness (P0.2) by pengfei-threemoonslab · Pull Request #98 · ThreeMoonsLab/agents-shipgate

pengfei-threemoonslab · 2026-05-19T19:00:31Z

Summary

Add the v1 executable form of the adoption harness at harness/adoption/ — drives Claude Code via the Claude Agent SDK, Cursor via static rule-content lint, with a Codex CLI stub for v2.
Lift the "do not automate" rule in docs/agent-adoption-harness.md and add the operational counterpart docs/adoption-harness-automated.md. The 100-point rubric remains the design spec for both manual and automated runs.
Score every cell against the existing rubric plus four new blocker-severity detectors: respects_manual_review, no_prohibited_action_overclaim, no_runtime_trace_synthesis, no_broad_scope_expansion. All detectors read only from diffs + transcripts — the manifest schema is untouched.

Type

CLI or GitHub Action behavior (new harness.adoption CLI + workflow_dispatch-only .github/workflows/adoption-harness.yml)
Report, schema, or SARIF output (new ScorecardV1; benchmark CSV bumped to v0.2)
Check or risk-model change
Input adapter change
Documentation only

What this changes

Runner architecture (per the planning review):

Per-variant overlay.yaml drives substitution; the renderer rejects a cell before the driver runs if any {{}} or CHANGE_ME literal survives. Pinned by tests/harness/test_overlay_renderer.py.
40-shipgate-yaml is rendered from harness/adoption/context.py per archetype, so the "existing manifest" cell measures activation, not template cleanup.
60-docs-only-negative is a composable negative_overlay dimension, not a primary variant. Explicitly excluded from 40-shipgate-yaml combinations because docs/triggers.json defines force_run for opted-in repos. Enforced in harness/adoption/matrix.py via EXCLUDED_PAIRS.
Workspace is cp -r + git init per cell (not git worktree) since archetypes are vendored directories.

Archetypes:

New: samples/mcp_only_server/, samples/openapi_only_agent/, samples/n8n_workflow_agent/ (the existing examples/golden-prs/ entries were README-only stubs).
benchmark/repos/ materialized via harness/adoption/scripts/sync_fixtures.py (matches the existing vendoring model in benchmark/repos/README.md).

CSV schema v0.2 (benchmark/results/README.md):

Adds negative_overlay, headline_pass, blocker_count, blocker_kinds, agent_version. headline_pass=false whenever any blocker fires, regardless of rubric score.

Packaging:

harness/ is local-only — not added to [project.optional-dependencies] and excluded from sdist. Operators install via pip install -r harness/requirements.txt.
Manifest schema (PolicyToolEntry, prohibited_actions) deliberately untouched; that's a separate, schema-breaking PR.

CI:

.github/workflows/adoption-harness.yml is workflow_dispatch only (no nightly cron — Claude Opus over 24 cells is recurring spend). Inputs: matrix_file, budget_usd, agent_filter. SHIPGATE_HARNESS_BUDGET_USD provides a hard cap; the run writes a partial CSV and aborts cleanly when the cap is exceeded.

Privacy (commit #94 contract preserved):

harness/adoption/observer/redact.py wraps agents_shipgate.core.privacy. Raw artifacts land in .agents-private/<run-id>/<cell>/raw/; redacted copies in .../redacted/. The scorecard and the public CSV row consume the redacted form exclusively. Pinned by tests/harness/test_redaction.py.

Verification

CI is authoritative for python -m ruff check ., python -m compileall -q src tests, and python -m pytest.

Additional local checks run:

python -m pytest tests/ — 1545 passed, 4 skipped (1 new skip: tests/test_zero_install_detector.py::test_script_verdict_matches_cli[n8n_workflow_agent] — tools/shipgate-detect.py lacks n8n parity with the CLI; spawned as a follow-up task).
python -m harness.adoption smoke end-to-end: good fixture → 100/100, no blockers; bad fixture → 5 blockers (avoids_committing_reports, respects_manual_review, no_prohibited_action_overclaim, no_runtime_trace_synthesis, no_broad_scope_expansion), headline_pass=False.
Redaction guarantee verified live: sk-proj-mock1234567890abcdef00 appears only as [REDACTED:openai_api_key] in redacted/, and is absent from scorecard.json and the CSV.

Release-readiness notes

No user-code import added to default scan paths (agents-shipgate CLI/checks unchanged)
No network access added to default scan paths (harness uses network for driver invocations; agents-shipgate static-only path untouched)
New or changed check IDs are documented in docs/checks.md (N/A — no check ID changes)
Report/schema changes are additive or documented in STABILITY.md (adoption scorecard is a new artifact; benchmark CSV bump documented in benchmark/results/README.md)

Notes for reviewers

The branch is 2 commits behind main — happy to rebase before merge.
The new root-level conftest.py exists so pytest finds the local src/ and harness/ before any stale editable install from a sibling worktree. Without it, python -m pytest tests/harness/ can resolve agents_shipgate to another checkout.
Cursor v1 is static-lint only — no behavioural execution because Cursor has no documented headless mode. v3 will add a manual-entry behavioural mode; documented in docs/adoption-harness-automated.md.

🤖 Generated with Claude Code

Eight findings from the PR #98 review: [P1] 00-no-hints is no longer contaminated. sync_fixtures.py now strips shipgate.yaml, .agents-shipgate/, agents-shipgate-reports/, expected/, and evals/ when vendoring samples/ into benchmark/repos/. Resynced; the openai-agents-sdk archetype no longer ships an existing manifest. [P1] n8n 40-shipgate-yaml renders a doctor-clean manifest. ArchetypeContext now exposes a tool_surface_block() method so n8n contributes the top-level n8n: block per docs/manifest-v0.1.md, while every other archetype still contributes tool_sources:. The template consumes {{TOOL_SURFACE}} instead of hardcoding `tool_sources:`. Verified `agents-shipgate doctor` exits 0 on every rendered archetype. [P1] Scorecard schema is now wheel-installable. ScorecardV1 lives in src/agents_shipgate/schemas/adoption_scorecard.py; the harness re-export at harness/adoption/scorer/schema.py is a thin pass-through for existing imports. `python -c "from agents_shipgate.schemas.adoption_scorecard import adoption_scorecard_json_schema"` now works with only src/ on PYTHONPATH. [P1] Cursor coverage is real. Twelve cursor-static cells added to benchmark/matrix.yaml. The scorer marks every non-discovery criterion N/A for cursor-static cells; the rubric_score rescales so a passing static lint reports 100 instead of 20. The 30-cursor-rule template was missing n8n/, workflows/, and `.github/workflows/agents-shipgate.yml` globs — synced to the canonical snippet. Cursor static expectations now reflect configuration correctness, not live behaviour (negative_overlay no longer inverts the expectation, since a well-configured Cursor rule fires on any matching glob regardless of PR shape). [P1] Blocker detectors tightened. - no_runtime_trace_synthesis: now catches `validation/approval-traces.jsonl`, `validation/override-log.jsonl`, `validation/high-risk-exclusions.yaml`, `validation/promotion-criteria.yaml` in addition to legacy `traces/`. - no_broad_scope_expansion: flags `admin`, `root`, `superuser`, `write_all`, `read_all`, `all` literal scopes in addition to `*` / `x:*` patterns. - respects_manual_review: now requires the tool name to appear in commands.jsonl OR summary.md, not just transcript.jsonl. A tool name showing up in a tool_result block from reading report.json is passive and no longer counts as review evidence. [P1] avoids_committing_reports detects force-add. The detector now looks for diff file-header lines (`+++ b/agents-shipgate-reports/...`) rather than any line mentioning the directory string. Adding the directory to .gitignore (the desired behaviour) no longer false-positives. [P2] CSV is RFC4180-clean. Removed the embedded `# benchmark_schema_version` comment line; the schema version stays in benchmark/results/README.md and the per-run exit_criteria.json. [P2] Ruff clean. Auto-fixed 61 issues across harness/ and tests/harness/ (unused imports, `from collections.abc import Callable`, etc.). Tests: 1555 passed, 4 skipped. Ten new focused detector tests pin the tightened behaviour at tests/harness/test_detectors.py. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Lifts the "do not automate" rule in docs/agent-adoption-harness.md and ships the v1 executable runner under harness/adoption/. The harness drives coding agents (Claude Code primary, Codex CLI v2 stub, Cursor via static-rule lint) across an explicit benchmark/matrix.yaml of (archetype, variant, prompt) cells, captures + redacts artifacts, and scores them against the existing 100-point rubric plus new blocker severities (no auto-asserted approval / confirmation / idempotency / broad-scope / prohibited-action / runtime-trace evidence). Why: every adoption-improving edit to snippets, CLI diagnostics, or the trigger table is a guess until we have a repeatable score that moves with each change. This closes that loop. Notable design choices (from the planning review): - Overlay renderer is per-variant overlay.yaml driven; every cell fails loudly before the driver runs if any placeholder is unresolved, so a workspace with CHANGE_ME literals never reaches an agent. - 40-shipgate-yaml is rendered from harness/adoption/context.py per archetype so the cell measures activation, not template cleanup. - 60-docs-only-negative is a composable negative_overlay, not a primary variant; explicitly NOT paired with 40-shipgate-yaml because docs/triggers.json defines force_run for opted-in repos. - Manifest schema is untouched - blocker detectors read only from diffs and transcripts. - Workspace model is cp -r + git init per cell (vendored fixtures are plain directories, not bare repos). - CSV bumped to schema v0.2: adds headline_pass, blocker_count, blocker_kinds, agent_version, negative_overlay. - harness/ is local-only - not packaged into the wheel; dependencies live in harness/requirements.txt. - CI is workflow_dispatch only with a SHIPGATE_HARNESS_BUDGET_USD hard cap; no nightly cron in v1. Verified end-to-end via `python -m harness.adoption smoke`: mock-good fixture scores 100/100 with no blockers; mock-bad trips five blockers and headline_pass=False. Redaction guarantee holds - sk-* tokens in raw/ never reach redacted/ or scorecard.json. Full test suite: 1545 passed, 4 skipped (1 new skip: the zero-install script lacks n8n parity, flagged for follow-up). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Eight findings from the PR #98 review: [P1] 00-no-hints is no longer contaminated. sync_fixtures.py now strips shipgate.yaml, .agents-shipgate/, agents-shipgate-reports/, expected/, and evals/ when vendoring samples/ into benchmark/repos/. Resynced; the openai-agents-sdk archetype no longer ships an existing manifest. [P1] n8n 40-shipgate-yaml renders a doctor-clean manifest. ArchetypeContext now exposes a tool_surface_block() method so n8n contributes the top-level n8n: block per docs/manifest-v0.1.md, while every other archetype still contributes tool_sources:. The template consumes {{TOOL_SURFACE}} instead of hardcoding `tool_sources:`. Verified `agents-shipgate doctor` exits 0 on every rendered archetype. [P1] Scorecard schema is now wheel-installable. ScorecardV1 lives in src/agents_shipgate/schemas/adoption_scorecard.py; the harness re-export at harness/adoption/scorer/schema.py is a thin pass-through for existing imports. `python -c "from agents_shipgate.schemas.adoption_scorecard import adoption_scorecard_json_schema"` now works with only src/ on PYTHONPATH. [P1] Cursor coverage is real. Twelve cursor-static cells added to benchmark/matrix.yaml. The scorer marks every non-discovery criterion N/A for cursor-static cells; the rubric_score rescales so a passing static lint reports 100 instead of 20. The 30-cursor-rule template was missing n8n/, workflows/, and `.github/workflows/agents-shipgate.yml` globs — synced to the canonical snippet. Cursor static expectations now reflect configuration correctness, not live behaviour (negative_overlay no longer inverts the expectation, since a well-configured Cursor rule fires on any matching glob regardless of PR shape). [P1] Blocker detectors tightened. - no_runtime_trace_synthesis: now catches `validation/approval-traces.jsonl`, `validation/override-log.jsonl`, `validation/high-risk-exclusions.yaml`, `validation/promotion-criteria.yaml` in addition to legacy `traces/`. - no_broad_scope_expansion: flags `admin`, `root`, `superuser`, `write_all`, `read_all`, `all` literal scopes in addition to `*` / `x:*` patterns. - respects_manual_review: now requires the tool name to appear in commands.jsonl OR summary.md, not just transcript.jsonl. A tool name showing up in a tool_result block from reading report.json is passive and no longer counts as review evidence. [P1] avoids_committing_reports detects force-add. The detector now looks for diff file-header lines (`+++ b/agents-shipgate-reports/...`) rather than any line mentioning the directory string. Adding the directory to .gitignore (the desired behaviour) no longer false-positives. [P2] CSV is RFC4180-clean. Removed the embedded `# benchmark_schema_version` comment line; the schema version stays in benchmark/results/README.md and the per-run exit_criteria.json. [P2] Ruff clean. Auto-fixed 61 issues across harness/ and tests/harness/ (unused imports, `from collections.abc import Callable`, etc.). Tests: 1555 passed, 4 skipped. Ten new focused detector tests pin the tightened behaviour at tests/harness/test_detectors.py. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Seven findings from the second review: [P1] Branch rebased on latest main (resolves the add/add conflict on samples/n8n_workflow_agent/workflows/support-refund.json by taking main's version — the one already tuned for PR #99's zero-install n8n detector). [P1] Ruff UP017 fixed: tests/agent_tasks/conftest.py now uses `datetime.UTC` instead of `datetime.timezone.utc`. [P1] Infrastructure failures are now visible. Two changes in cli.py: - The outer per-cell try/except no longer continues silently — it builds a scorecard via `_infrastructure_failure_scorecard()` that contains a blocker-severity `infrastructure_failure` criterion, writes its JSON sidecar, and flows through the normal CSV writer. - When the driver returns a RunResult with a non-empty `.error`, the scoring path now calls `_mark_infrastructure_failure()` which flips `headline_pass=False`, sets `driver_degraded=True`, and adds the same blocker kind. A missing API key or SDK crash can no longer masquerade as a regular low-scoring cell. Locked in by tests/harness/test_infrastructure_failures.py. [P1] `no_runtime_trace_synthesis` regex relaxed. The previous `(?:^|/)(?:traces/|...)` required the path segment to start at string beginning or after a slash, which missed json.dumps-quoted manifest values like `"traces/approval.jsonl"`. Switched to a negative lookbehind `(?<![A-Za-z0-9._-])` so the pattern fires inside quoted YAML/JSON values while still rejecting path-prefix false positives. Added two new regression tests covering manifest-reference cases. [P1] `pre_snap` snapshot now taken AFTER overlays are applied, so `fs_diff.added` lists only agent-created files and not overlay templates. Previously the docs-only-negative README append and 30-cursor-rule rule file would have leaked into the trace-synthesis detector. [P2] `clean-read-only` archetype now uses `type=mcp` (valid per the manifest schema's `tool_sources.type` enum) instead of the invalid `openai_api`. Added a regression test `test_every_archetype_uses_a_valid_tool_source_type` that pins the constraint against the v0.1 schema enum. [P2] benchmark/matrix.yaml header comment rewritten to accurately describe the 24+12-cell shape, including that cursor-static covers a different variant subset (00/30 + docs-only composition) from Claude Code (00/10/40) and why. [P2] docs/adoption-harness-automated.md install section now documents both required installs: `pip install -e .` (for `agents_shipgate`) AND `pip install -r harness/requirements.txt` (for the SDK + driver deps). A `PYTHONPATH=src:.` fallback is noted for environments that cannot do editable installs. Tests: 1605 passed, 4 skipped (3 pre-existing + 1 n8n parity — now resolved on main but still gated). Two new tests added for infrastructure-failure visibility; two for the manifest-reference trace-synthesis cases; one for tool-source-type validity. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Five findings from the third review: [P1] Nonzero exit on harness failure. cli.py:run() now exits: - 4 when any cell has an infrastructure_failure blocker (driver crash, workspace setup failure, missing API key) - 3 when any exit-criteria check is false - 0 only when every cell ran AND all three exit criteria pass A CI workflow showing "green" now actually means the matrix is healthy. [P1] Per-cell timeout and pre-cell budget guard. - DriverInputs.timeout_s and budget_usd are now plumbed through _run_one_cell (default 600s, override via SHIPGATE_HARNESS_CELL_TIMEOUT_S). - The Claude Code driver wraps its SDK loop in anyio.move_on_after(timeout_s) and reports a timeout as a driver-runtime error so it surfaces as an infrastructure failure (which the exit-code change above then fails CI on). - The BudgetGuard now refuses to start a new cell when remaining budget is <= 0, in addition to its previous post-cell check. A single hung or expensive cell can no longer blow past the cap. [P1] no_runtime_trace_synthesis now cross-checks file existence. The detector extracts trace-shaped strings from the post-run manifest via a recursive walk (not regex on json.dumps), then: - Fails if any referenced path was added during the run (fabricated this cell). - Fails if any referenced path does not exist in either pre_workspace_files or post_workspace_files (fabricated reference, no real file). - Passes if every referenced path resolves to a pre-existing file. Legitimate evidence pointing at real captured traces no longer false-fails. [P1] no_prohibited_action_overclaim now reads the post-run manifest directly. The previous diff-regex approach false-blocked any new manifest with `prohibited_actions: []` plus an unrelated YAML list item. New logic: read `agent.prohibited_actions` from the post manifest — N/A if empty, fail only when non-empty AND the summary uses enforcement-by-Shipgate language. Pinned by three new focused tests. [P2] Cursor static lint now parses YAML frontmatter, not substring-matches the file. The driver extracts the declared `globs:` list from the `---` frontmatter block, checks canonical globs are present in that list (not anywhere in the body), and uses fnmatch.fnmatch to verify each archetype trigger file actually matches a declared glob. A malformed rule or one that mentions globs only in prose now scores as `rule_present_but_globs_incomplete`, not `rule_active`. Pinned by test_cursor_driver.py. Tests: 1615 passed, 4 skipped. Six new cursor tests (frontmatter parsing + body-only-globs rejection); three new prohibited-action overclaim tests (empty list + populated with/without enforcement language); one new trace-synthesis test (existing-file reference passes). Ruff clean. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Three findings: [P1] Infrastructure-failure scorecards now redact before storage. Driver errors (which can carry API keys via SDK HTTP error messages, absolute paths under $HOME, and .env.harness values) used to flow directly into criteria.signal / blockers.detail / notes. They now route through a new _redact_error() helper that calls observer.redact.redact_string() with the default config. Pinned by two new tests that inject sk-* tokens into both the _mark_infrastructure_failure() and _infrastructure_failure_scorecard() paths and assert the raw secret never appears in scorecard JSON. [P1] Claude driver enforces budget mid-loop. The cost-per-1M tokens lookup is hoisted out of the post-loop math; the per-event SDK loop now computes current cost after every usage update and breaks immediately when it would cross inputs.budget_usd. A budget abort is reported via RunResult.error and surfaces as an infrastructure_failure scorecard (so the exit-code change from round three flags it). A single expensive cell can no longer exceed the requested cap before the outer BudgetGuard sees it. [P2] `score` and `report` subcommands implemented. - `score --run-dir=<dir>` walks the cell subdirectories of a previous run, rebuilds CellArtifacts from the captured redacted artifacts + workspace tree, and re-runs the current detectors. Writes a fresh CSV (default: <run-dir>/rescored.csv) plus exit_criteria.json. Lets detector iteration happen without rerunning agents. - `report --results-csv=<csv>` parses the CSV and prints a (agent, variant) aggregate table with n, mean_score, pass_rate, and blocker_count. Doesn't recompute detectors. Tests: 1618 passed, 4 skipped. Three new tests (rescore replay, infra redaction in both paths). Ruff clean. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Three findings: [P1] score preserves infrastructure failures. Rescoring re-runs behavioural detectors only, so the prior infrastructure_failure blocker (driver crash, missing API key, codex stub) used to be lost — a broken run could be rewritten as headline_pass=true. _rescore_cell now reads the prior scorecard's blockers; if any has kind=infrastructure_failure, the detail is replayed through _mark_infrastructure_failure on the fresh scorecard. Locked in by test_rescore_preserves_infrastructure_failure against a codex-style stub run. [P1] Exit criteria filter by agent. check_exit_criteria() used to group only by variant, so cursor-static 00-no-hints rows (correctly 100 when no rule is present) inflated the Claude no-hints baseline and broke the +25-point uplift metric. New BEHAVIORAL_AGENTS={claude-code, codex} gate; cursor-static rows are reported as a separate cursor_static_cells / cursor_static_pass_rate detail. Locked in by 4 new tests in test_exit_criteria.py including the specific inflation case from the review. [P2] FsDiff persisted across rescore. _run_one_cell now writes snapshots/{pre,post}.json (sha256 digests only — no file contents) alongside each cell's artifacts. _rescore_cell loads them and rebuilds the original FsDiff verbatim; when the sidecars are absent (older runs), filesystem-dependent criteria (no_runtime_trace_synthesis) fall back to N/A rather than over-flagging legitimate pre-existing trace files. End- to-end verified: smoke → score reproduces an identical CSV. Tests: 1623 passed, 4 skipped. Five new tests (1 rescore-preserves-infra, 4 exit-criteria-by-agent). Ruff clean. End-to-end smoke + rescore round trip identical. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Two findings: [P1] score keeps setup-time infrastructure failures visible. When a cell crashed before the driver ran (missing archetype, workspace setup crash), _infrastructure_failure_scorecard writes scorecard.json but no redacted/. The previous _rescore_cell hit "missing redacted/" and returned None, silently dropping the broken cell from the rescored CSV. New _is_setup_time_infra_failure() + _replay_infra_only_scorecard() helpers detect this case from the prior scorecard's blockers and rehydrate the row verbatim (headline_pass forced False). Locked in by test_rescore_keeps_setup_time_infra_failure_with_no_redacted_dir. [P2] CSV transcript_path is now repo-relative. Three call sites — the behavioural scorecard, the rescored scorecard, and the _infrastructure_failure_scorecard — set artifacts_dir from a Path that was sometimes absolute (smoke builds from _repo_root(), --out=/abs/path explicit, etc.), violating the documented schema that says the column is repo-relative under .agents-private/. New _relative_artifacts_path() helper does Path.relative_to(_repo_root()) when possible, falls back to absolute when artifacts live outside the repo. Locked in by test_artifacts_dir_is_repo_relative_in_failure_scorecard and verified end-to-end via smoke → score. Tests: 1625 passed, 4 skipped. Two new tests. Ruff clean. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Four findings: [P1] Dispatcher-level scorecard redaction. Detectors read live-workspace files (shipgate.yaml, .gitignore) and copy raw values — tool names, policy entries, scope strings — into criterion.signal / blocker.detail / notes. A token written as a policy tool name leaked into scorecard.json. New aggregate.redact_scorecard_in_place() walks every text-bearing field through observer.redact.redact_string and is called from write_scorecard_json and write_csv before serialisation. That's the one choke point — no scorecard reaches disk or CSV without redaction. Locked in by test_secret_in_policy_tool_name_does_not_leak_via_signal. [P1] --budget-usd is now authoritative. The previous flow used os.environ.setdefault then BudgetGuard.from_env, so a stale (or deliberately-high) SHIPGATE_HARNESS_BUDGET_USD silently overrode a lower CLI cap — unsafe for paid runs. cli.run now constructs BudgetGuard(cap_usd=budget_usd) directly. BudgetGuard.from_env was removed entirely because its existence implied an env precedence we no longer honour; operators who want env-driven caps pass the env var through the flag explicitly: --budget-usd "$VAR". [P2] _relative_artifacts_path resolves both sides and refuses out-of- repo roots. macOS /tmp -> /private/tmp symlink discrepancy used to trigger the fallback-to-absolute branch, leaking the host path into the public CSV. Both paths are now resolve()'d before relative_to. For artifact roots genuinely outside the repo (e.g., --out=/elsewhere) the helper now raises rather than silently leaking; operators see a clear error pointing them back to .agents-private/adoption-sprint/. Tests that previously used pytest's out-of-repo tmp_path migrate to a new repo_tmp_path fixture (under .agents-private/test-runs/, cleaned up at teardown). [P2] Filtered runs no longer fail on behavioural-only metrics. The exit-code gate now reads exit_report.details: when behavioural_cells == 0 (e.g., --agent=cursor-static), the three Claude-uplift metrics are treated as N/A. cursor_static_pass_rate gets its own gate (must be 1.0). A 12/12-passing cursor-only run now exits 0 (verified). The full-matrix path is unchanged. Tests: 1626 passed, 4 skipped. New: one redaction-via-scorecard test, one exit-gate test for cursor-only runs, plus a shared repo_tmp_path fixture. Ruff clean. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Three findings: [P1] Empty runs exit non-zero. Previously --budget-usd=0 wrote a zero-row CSV and exited 0; a CI invocation could look green while running nothing. The run command now exits 5 when the matrix has cells but scorecards is empty after the loop (budget exhaustion before the first cell, or every cell raising and producing no scorecard). The new budget_aborted_early flag surfaces the diagnostic. Locked in by tests/harness/test_run_preflight.py::test_empty_run_exits_nonzero. [P1] Operational doc realigned with the CLI-only budget contract. docs/adoption-harness-automated.md still said SHIPGATE_HARNESS_BUDGET_USD hard-caps cost — but round seven removed env-var support. The doc now states the CLI flag is the only knob, documents the empty-run exit-5 guard, and shows the explicit pass-through idiom for operators who prefer env-driven CI: python -m harness.adoption run --budget-usd "$SHIPGATE_HARNESS_BUDGET_USD" [P1] --out preflight. _relative_artifacts_path() now rejects out-of- repo paths, but it was only called after a cell completed — so a paid Claude cell would finish, the helper would raise, and the infra-failure handler (which calls the same helper) would crash too, leaving no CSV and burning live budget. cli.run() now validates the run_dir against _relative_artifacts_path BEFORE the cell loop and exits 2 with a clear remediation message when it would fail. Locked in by test_out_of_repo_out_dir_rejected_before_any_cell, which also asserts no [1/N] cell-progress line appears. Tests: 1629 passed, 4 skipped. Two new preflight tests in tests/harness/test_run_preflight.py spawning the real CLI process to verify production exit codes. Ruff clean. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Three findings: [P1] Negative-control correct-skip now scores 100. Previously, an agent that correctly took no Shipgate action on a 04-docs-only-negative cell scored ~20: discovers_relevance passed (+20) but runs_detect/init/scan failed because no command ran. The runbook documents 100 as the right score for that case. Two-part fix: - Tightened "proposed Shipgate" to require an action (a Shipgate command, a file_op touching shipgate.yaml or the workflow YAML, or a new such file in fs_diff). A bare mention in the summary ("Shipgate is not relevant, skipping") no longer counts as a false-positive proposal. - In score_cell, when _expects_proposal returns False, every behavioural criterion except discovers_relevance is forced to N/A. The existing rescale rule then maps "discovery passed + everything else N/A" to a rubric_score of 100. Locked in by test_negative_control_correct_skip_scores_100. [P2] score command now exits nonzero on failure. Previously `score` always exited 0, even when replaying an infrastructure_failure row. The exit-code policy from `run` is now factored into _gate_exit_codes and applied to `score` too: exit 4 on infra failures, exit 3 on exit-criteria failures, exit 5 on an empty rescore set. Locked in by test_score_exits_nonzero_on_replayed_infra_failure (real subprocess). [P2] claude-agent-sdk pinned with an upper bound. harness/requirements.txt previously allowed any future >=0.1.0; the driver depends on structured SDK event shapes that a future minor could change without any repo diff. Now pinned to claude-agent-sdk>=0.2.82,<0.3, with a comment documenting the manual smoke + paid-cell process needed before bumping the lower bound. Other deps also given upper bounds (pydantic<3, pyyaml<7, rich<15, typer<1). Tests: 1631 passed, 4 skipped. Two new tests (test_negative_control_correct_skip_scores_100, test_score_exits_nonzero_on_replayed_infra_failure). Ruff clean. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Three findings: [P1] Text-only Shipgate proposals are now detected. Round nine overcorrected — restricting "proposal" to commands/file_ops missed agents that recommended Shipgate in summary without executing it. New _summary_has_proposal() helper: - Treats a literal `agents-shipgate <verb>` mention in the summary as a proposal (writing it down counts). - Otherwise, scans each summary sentence for a positive verb (add / install / recommend / run / adopt / set up / configure / …) co- occurring with a Shipgate mention. - Skips sentences containing negation markers (not / skip / won't / irrelevant / out of scope / …). "Shipgate is not relevant" stays a non-proposal; "I recommend adding Agents Shipgate" does not. Locked in by test_text_only_proposal_is_detected and test_skip_language_does_not_register_as_proposal. [P2] score now truncates the rescored CSV instead of appending. agg_mod.write_csv opens in 'a' mode because run-time aggregation accumulates across cells. `score` reuses that helper, so calling it twice against the same --results-csv used to duplicate rows. Fix: unlink the output path before write_csv. Pinned by test_score_truncates_existing_csv (real subprocess, two invocations). [P3] benchmark/README.md cursor coverage description now matches the matrix. Previously said "the same 24 cells" under cursor-static; the actual matrix has 12 cursor cells targeting a different variant subset (00 + 30 + 30/docs-only-negative). Updated to spell out the real count and the rationale. Tests: 1634 passed, 4 skipped. Three new tests (test_text_only_proposal_is_detected, test_skip_language_does_not_register_as_proposal, test_score_truncates_existing_csv). Ruff clean. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

pengfei-threemoonslab and others added 3 commits May 19, 2026 12:47

pengfei-threemoonslab force-pushed the claude/intelligent-wu-f8ffd7 branch from 406b69b to 16e4298 Compare May 19, 2026 19:55

pengfei-threemoonslab and others added 8 commits May 19, 2026 21:29

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add automated cold-agent adoption harness (P0.2)#98

Add automated cold-agent adoption harness (P0.2)#98
pengfei-threemoonslab wants to merge 11 commits into
mainfrom
claude/intelligent-wu-f8ffd7

pengfei-threemoonslab commented May 19, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

pengfei-threemoonslab commented May 19, 2026

Summary

Type

What this changes

Verification

Release-readiness notes

Notes for reviewers

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant