Skip to content

Add automated cold-agent adoption harness (P0.2)#98

Open
pengfei-threemoonslab wants to merge 11 commits into
mainfrom
claude/intelligent-wu-f8ffd7
Open

Add automated cold-agent adoption harness (P0.2)#98
pengfei-threemoonslab wants to merge 11 commits into
mainfrom
claude/intelligent-wu-f8ffd7

Conversation

@pengfei-threemoonslab
Copy link
Copy Markdown
Contributor

Summary

  • Add the v1 executable form of the adoption harness at harness/adoption/ — drives Claude Code via the Claude Agent SDK, Cursor via static rule-content lint, with a Codex CLI stub for v2.
  • Lift the "do not automate" rule in docs/agent-adoption-harness.md and add the operational counterpart docs/adoption-harness-automated.md. The 100-point rubric remains the design spec for both manual and automated runs.
  • Score every cell against the existing rubric plus four new blocker-severity detectors: respects_manual_review, no_prohibited_action_overclaim, no_runtime_trace_synthesis, no_broad_scope_expansion. All detectors read only from diffs + transcripts — the manifest schema is untouched.

Type

  • CLI or GitHub Action behavior (new harness.adoption CLI + workflow_dispatch-only .github/workflows/adoption-harness.yml)
  • Report, schema, or SARIF output (new ScorecardV1; benchmark CSV bumped to v0.2)
  • Check or risk-model change
  • Input adapter change
  • Documentation only

What this changes

Runner architecture (per the planning review):

  • Per-variant overlay.yaml drives substitution; the renderer rejects a cell before the driver runs if any {{}} or CHANGE_ME literal survives. Pinned by tests/harness/test_overlay_renderer.py.
  • 40-shipgate-yaml is rendered from harness/adoption/context.py per archetype, so the "existing manifest" cell measures activation, not template cleanup.
  • 60-docs-only-negative is a composable negative_overlay dimension, not a primary variant. Explicitly excluded from 40-shipgate-yaml combinations because docs/triggers.json defines force_run for opted-in repos. Enforced in harness/adoption/matrix.py via EXCLUDED_PAIRS.
  • Workspace is cp -r + git init per cell (not git worktree) since archetypes are vendored directories.

Archetypes:

CSV schema v0.2 (benchmark/results/README.md):

  • Adds negative_overlay, headline_pass, blocker_count, blocker_kinds, agent_version. headline_pass=false whenever any blocker fires, regardless of rubric score.

Packaging:

  • harness/ is local-only — not added to [project.optional-dependencies] and excluded from sdist. Operators install via pip install -r harness/requirements.txt.
  • Manifest schema (PolicyToolEntry, prohibited_actions) deliberately untouched; that's a separate, schema-breaking PR.

CI:

  • .github/workflows/adoption-harness.yml is workflow_dispatch only (no nightly cron — Claude Opus over 24 cells is recurring spend). Inputs: matrix_file, budget_usd, agent_filter. SHIPGATE_HARNESS_BUDGET_USD provides a hard cap; the run writes a partial CSV and aborts cleanly when the cap is exceeded.

Privacy (commit #94 contract preserved):

  • harness/adoption/observer/redact.py wraps agents_shipgate.core.privacy. Raw artifacts land in .agents-private/<run-id>/<cell>/raw/; redacted copies in .../redacted/. The scorecard and the public CSV row consume the redacted form exclusively. Pinned by tests/harness/test_redaction.py.

Verification

CI is authoritative for python -m ruff check ., python -m compileall -q src tests, and python -m pytest.

Additional local checks run:

  • python -m pytest tests/1545 passed, 4 skipped (1 new skip: tests/test_zero_install_detector.py::test_script_verdict_matches_cli[n8n_workflow_agent]tools/shipgate-detect.py lacks n8n parity with the CLI; spawned as a follow-up task).
  • python -m harness.adoption smoke end-to-end: good fixture → 100/100, no blockers; bad fixture → 5 blockers (avoids_committing_reports, respects_manual_review, no_prohibited_action_overclaim, no_runtime_trace_synthesis, no_broad_scope_expansion), headline_pass=False.
  • Redaction guarantee verified live: sk-proj-mock1234567890abcdef00 appears only as [REDACTED:openai_api_key] in redacted/, and is absent from scorecard.json and the CSV.

Release-readiness notes

  • No user-code import added to default scan paths (agents-shipgate CLI/checks unchanged)
  • No network access added to default scan paths (harness uses network for driver invocations; agents-shipgate static-only path untouched)
  • New or changed check IDs are documented in docs/checks.md (N/A — no check ID changes)
  • Report/schema changes are additive or documented in STABILITY.md (adoption scorecard is a new artifact; benchmark CSV bump documented in benchmark/results/README.md)

Notes for reviewers

  • The branch is 2 commits behind main — happy to rebase before merge.
  • The new root-level conftest.py exists so pytest finds the local src/ and harness/ before any stale editable install from a sibling worktree. Without it, python -m pytest tests/harness/ can resolve agents_shipgate to another checkout.
  • Cursor v1 is static-lint only — no behavioural execution because Cursor has no documented headless mode. v3 will add a manual-entry behavioural mode; documented in docs/adoption-harness-automated.md.

🤖 Generated with Claude Code

pengfei-threemoonslab added a commit that referenced this pull request May 19, 2026
Eight findings from the PR #98 review:

[P1] 00-no-hints is no longer contaminated. sync_fixtures.py now strips
shipgate.yaml, .agents-shipgate/, agents-shipgate-reports/, expected/,
and evals/ when vendoring samples/ into benchmark/repos/. Resynced; the
openai-agents-sdk archetype no longer ships an existing manifest.

[P1] n8n 40-shipgate-yaml renders a doctor-clean manifest. ArchetypeContext
now exposes a tool_surface_block() method so n8n contributes the top-level
n8n: block per docs/manifest-v0.1.md, while every other archetype still
contributes tool_sources:. The template consumes {{TOOL_SURFACE}} instead
of hardcoding `tool_sources:`. Verified `agents-shipgate doctor` exits 0
on every rendered archetype.

[P1] Scorecard schema is now wheel-installable. ScorecardV1 lives in
src/agents_shipgate/schemas/adoption_scorecard.py; the harness re-export at
harness/adoption/scorer/schema.py is a thin pass-through for existing
imports. `python -c "from agents_shipgate.schemas.adoption_scorecard
import adoption_scorecard_json_schema"` now works with only src/ on
PYTHONPATH.

[P1] Cursor coverage is real. Twelve cursor-static cells added to
benchmark/matrix.yaml. The scorer marks every non-discovery criterion N/A
for cursor-static cells; the rubric_score rescales so a passing static
lint reports 100 instead of 20. The 30-cursor-rule template was missing
n8n/, workflows/, and `.github/workflows/agents-shipgate.yml` globs —
synced to the canonical snippet. Cursor static expectations now reflect
configuration correctness, not live behaviour (negative_overlay no longer
inverts the expectation, since a well-configured Cursor rule fires on
any matching glob regardless of PR shape).

[P1] Blocker detectors tightened.
  - no_runtime_trace_synthesis: now catches `validation/approval-traces.jsonl`,
    `validation/override-log.jsonl`, `validation/high-risk-exclusions.yaml`,
    `validation/promotion-criteria.yaml` in addition to legacy `traces/`.
  - no_broad_scope_expansion: flags `admin`, `root`, `superuser`,
    `write_all`, `read_all`, `all` literal scopes in addition to `*` /
    `x:*` patterns.
  - respects_manual_review: now requires the tool name to appear in
    commands.jsonl OR summary.md, not just transcript.jsonl. A tool name
    showing up in a tool_result block from reading report.json is passive
    and no longer counts as review evidence.

[P1] avoids_committing_reports detects force-add. The detector now looks
for diff file-header lines (`+++ b/agents-shipgate-reports/...`) rather
than any line mentioning the directory string. Adding the directory to
.gitignore (the desired behaviour) no longer false-positives.

[P2] CSV is RFC4180-clean. Removed the embedded
`# benchmark_schema_version` comment line; the schema version stays in
benchmark/results/README.md and the per-run exit_criteria.json.

[P2] Ruff clean. Auto-fixed 61 issues across harness/ and tests/harness/
(unused imports, `from collections.abc import Callable`, etc.).

Tests: 1555 passed, 4 skipped. Ten new focused detector tests pin the
tightened behaviour at tests/harness/test_detectors.py.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
pengfei-threemoonslab and others added 3 commits May 19, 2026 12:47
Lifts the "do not automate" rule in docs/agent-adoption-harness.md and
ships the v1 executable runner under harness/adoption/. The harness drives
coding agents (Claude Code primary, Codex CLI v2 stub, Cursor via
static-rule lint) across an explicit benchmark/matrix.yaml of
(archetype, variant, prompt) cells, captures + redacts artifacts, and
scores them against the existing 100-point rubric plus new blocker
severities (no auto-asserted approval / confirmation / idempotency /
broad-scope / prohibited-action / runtime-trace evidence).

Why: every adoption-improving edit to snippets, CLI diagnostics, or the
trigger table is a guess until we have a repeatable score that moves with
each change. This closes that loop.

Notable design choices (from the planning review):
- Overlay renderer is per-variant overlay.yaml driven; every cell fails
  loudly before the driver runs if any placeholder is unresolved, so a
  workspace with CHANGE_ME literals never reaches an agent.
- 40-shipgate-yaml is rendered from harness/adoption/context.py per
  archetype so the cell measures activation, not template cleanup.
- 60-docs-only-negative is a composable negative_overlay, not a primary
  variant; explicitly NOT paired with 40-shipgate-yaml because
  docs/triggers.json defines force_run for opted-in repos.
- Manifest schema is untouched - blocker detectors read only from diffs
  and transcripts.
- Workspace model is cp -r + git init per cell (vendored fixtures are
  plain directories, not bare repos).
- CSV bumped to schema v0.2: adds headline_pass, blocker_count,
  blocker_kinds, agent_version, negative_overlay.
- harness/ is local-only - not packaged into the wheel; dependencies live
  in harness/requirements.txt.
- CI is workflow_dispatch only with a SHIPGATE_HARNESS_BUDGET_USD hard
  cap; no nightly cron in v1.

Verified end-to-end via `python -m harness.adoption smoke`: mock-good
fixture scores 100/100 with no blockers; mock-bad trips five blockers and
headline_pass=False. Redaction guarantee holds - sk-* tokens in raw/ never
reach redacted/ or scorecard.json. Full test suite: 1545 passed, 4 skipped
(1 new skip: the zero-install script lacks n8n parity, flagged for
follow-up).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Eight findings from the PR #98 review:

[P1] 00-no-hints is no longer contaminated. sync_fixtures.py now strips
shipgate.yaml, .agents-shipgate/, agents-shipgate-reports/, expected/,
and evals/ when vendoring samples/ into benchmark/repos/. Resynced; the
openai-agents-sdk archetype no longer ships an existing manifest.

[P1] n8n 40-shipgate-yaml renders a doctor-clean manifest. ArchetypeContext
now exposes a tool_surface_block() method so n8n contributes the top-level
n8n: block per docs/manifest-v0.1.md, while every other archetype still
contributes tool_sources:. The template consumes {{TOOL_SURFACE}} instead
of hardcoding `tool_sources:`. Verified `agents-shipgate doctor` exits 0
on every rendered archetype.

[P1] Scorecard schema is now wheel-installable. ScorecardV1 lives in
src/agents_shipgate/schemas/adoption_scorecard.py; the harness re-export at
harness/adoption/scorer/schema.py is a thin pass-through for existing
imports. `python -c "from agents_shipgate.schemas.adoption_scorecard
import adoption_scorecard_json_schema"` now works with only src/ on
PYTHONPATH.

[P1] Cursor coverage is real. Twelve cursor-static cells added to
benchmark/matrix.yaml. The scorer marks every non-discovery criterion N/A
for cursor-static cells; the rubric_score rescales so a passing static
lint reports 100 instead of 20. The 30-cursor-rule template was missing
n8n/, workflows/, and `.github/workflows/agents-shipgate.yml` globs —
synced to the canonical snippet. Cursor static expectations now reflect
configuration correctness, not live behaviour (negative_overlay no longer
inverts the expectation, since a well-configured Cursor rule fires on
any matching glob regardless of PR shape).

[P1] Blocker detectors tightened.
  - no_runtime_trace_synthesis: now catches `validation/approval-traces.jsonl`,
    `validation/override-log.jsonl`, `validation/high-risk-exclusions.yaml`,
    `validation/promotion-criteria.yaml` in addition to legacy `traces/`.
  - no_broad_scope_expansion: flags `admin`, `root`, `superuser`,
    `write_all`, `read_all`, `all` literal scopes in addition to `*` /
    `x:*` patterns.
  - respects_manual_review: now requires the tool name to appear in
    commands.jsonl OR summary.md, not just transcript.jsonl. A tool name
    showing up in a tool_result block from reading report.json is passive
    and no longer counts as review evidence.

[P1] avoids_committing_reports detects force-add. The detector now looks
for diff file-header lines (`+++ b/agents-shipgate-reports/...`) rather
than any line mentioning the directory string. Adding the directory to
.gitignore (the desired behaviour) no longer false-positives.

[P2] CSV is RFC4180-clean. Removed the embedded
`# benchmark_schema_version` comment line; the schema version stays in
benchmark/results/README.md and the per-run exit_criteria.json.

[P2] Ruff clean. Auto-fixed 61 issues across harness/ and tests/harness/
(unused imports, `from collections.abc import Callable`, etc.).

Tests: 1555 passed, 4 skipped. Ten new focused detector tests pin the
tightened behaviour at tests/harness/test_detectors.py.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Seven findings from the second review:

[P1] Branch rebased on latest main (resolves the add/add conflict on
samples/n8n_workflow_agent/workflows/support-refund.json by taking main's
version — the one already tuned for PR #99's zero-install n8n detector).

[P1] Ruff UP017 fixed: tests/agent_tasks/conftest.py now uses
`datetime.UTC` instead of `datetime.timezone.utc`.

[P1] Infrastructure failures are now visible. Two changes in cli.py:
  - The outer per-cell try/except no longer continues silently — it
    builds a scorecard via `_infrastructure_failure_scorecard()` that
    contains a blocker-severity `infrastructure_failure` criterion,
    writes its JSON sidecar, and flows through the normal CSV writer.
  - When the driver returns a RunResult with a non-empty `.error`, the
    scoring path now calls `_mark_infrastructure_failure()` which flips
    `headline_pass=False`, sets `driver_degraded=True`, and adds the
    same blocker kind. A missing API key or SDK crash can no longer
    masquerade as a regular low-scoring cell.
  Locked in by tests/harness/test_infrastructure_failures.py.

[P1] `no_runtime_trace_synthesis` regex relaxed. The previous
`(?:^|/)(?:traces/|...)` required the path segment to start at string
beginning or after a slash, which missed json.dumps-quoted manifest
values like `"traces/approval.jsonl"`. Switched to a negative lookbehind
`(?<![A-Za-z0-9._-])` so the pattern fires inside quoted YAML/JSON
values while still rejecting path-prefix false positives. Added two
new regression tests covering manifest-reference cases.

[P1] `pre_snap` snapshot now taken AFTER overlays are applied, so
`fs_diff.added` lists only agent-created files and not overlay
templates. Previously the docs-only-negative README append and
30-cursor-rule rule file would have leaked into the trace-synthesis
detector.

[P2] `clean-read-only` archetype now uses `type=mcp` (valid per the
manifest schema's `tool_sources.type` enum) instead of the invalid
`openai_api`. Added a regression test
`test_every_archetype_uses_a_valid_tool_source_type` that pins the
constraint against the v0.1 schema enum.

[P2] benchmark/matrix.yaml header comment rewritten to accurately
describe the 24+12-cell shape, including that cursor-static covers a
different variant subset (00/30 + docs-only composition) from Claude
Code (00/10/40) and why.

[P2] docs/adoption-harness-automated.md install section now documents
both required installs: `pip install -e .` (for `agents_shipgate`) AND
`pip install -r harness/requirements.txt` (for the SDK + driver deps).
A `PYTHONPATH=src:.` fallback is noted for environments that cannot do
editable installs.

Tests: 1605 passed, 4 skipped (3 pre-existing + 1 n8n parity — now
resolved on main but still gated). Two new tests added for
infrastructure-failure visibility; two for the manifest-reference
trace-synthesis cases; one for tool-source-type validity.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@pengfei-threemoonslab pengfei-threemoonslab force-pushed the claude/intelligent-wu-f8ffd7 branch from 406b69b to 16e4298 Compare May 19, 2026 19:55
pengfei-threemoonslab and others added 8 commits May 19, 2026 21:29
Five findings from the third review:

[P1] Nonzero exit on harness failure. cli.py:run() now exits:
  - 4 when any cell has an infrastructure_failure blocker (driver crash,
    workspace setup failure, missing API key)
  - 3 when any exit-criteria check is false
  - 0 only when every cell ran AND all three exit criteria pass
A CI workflow showing "green" now actually means the matrix is healthy.

[P1] Per-cell timeout and pre-cell budget guard.
  - DriverInputs.timeout_s and budget_usd are now plumbed through
    _run_one_cell (default 600s, override via SHIPGATE_HARNESS_CELL_TIMEOUT_S).
  - The Claude Code driver wraps its SDK loop in anyio.move_on_after(timeout_s)
    and reports a timeout as a driver-runtime error so it surfaces as an
    infrastructure failure (which the exit-code change above then fails CI on).
  - The BudgetGuard now refuses to start a new cell when remaining budget is
    <= 0, in addition to its previous post-cell check. A single hung or
    expensive cell can no longer blow past the cap.

[P1] no_runtime_trace_synthesis now cross-checks file existence. The
detector extracts trace-shaped strings from the post-run manifest via a
recursive walk (not regex on json.dumps), then:
  - Fails if any referenced path was added during the run (fabricated this
    cell).
  - Fails if any referenced path does not exist in either pre_workspace_files
    or post_workspace_files (fabricated reference, no real file).
  - Passes if every referenced path resolves to a pre-existing file.
Legitimate evidence pointing at real captured traces no longer false-fails.

[P1] no_prohibited_action_overclaim now reads the post-run manifest
directly. The previous diff-regex approach false-blocked any new manifest
with `prohibited_actions: []` plus an unrelated YAML list item. New logic:
read `agent.prohibited_actions` from the post manifest — N/A if empty,
fail only when non-empty AND the summary uses enforcement-by-Shipgate
language. Pinned by three new focused tests.

[P2] Cursor static lint now parses YAML frontmatter, not substring-matches
the file. The driver extracts the declared `globs:` list from the
`---` frontmatter block, checks canonical globs are present in that list
(not anywhere in the body), and uses fnmatch.fnmatch to verify each
archetype trigger file actually matches a declared glob. A malformed rule
or one that mentions globs only in prose now scores as
`rule_present_but_globs_incomplete`, not `rule_active`. Pinned by
test_cursor_driver.py.

Tests: 1615 passed, 4 skipped. Six new cursor tests (frontmatter
parsing + body-only-globs rejection); three new prohibited-action
overclaim tests (empty list + populated with/without enforcement
language); one new trace-synthesis test (existing-file reference passes).
Ruff clean.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Three findings:

[P1] Infrastructure-failure scorecards now redact before storage. Driver
errors (which can carry API keys via SDK HTTP error messages, absolute
paths under $HOME, and .env.harness values) used to flow directly into
criteria.signal / blockers.detail / notes. They now route through a new
_redact_error() helper that calls observer.redact.redact_string() with
the default config. Pinned by two new tests that inject sk-* tokens into
both the _mark_infrastructure_failure() and _infrastructure_failure_scorecard()
paths and assert the raw secret never appears in scorecard JSON.

[P1] Claude driver enforces budget mid-loop. The cost-per-1M tokens
lookup is hoisted out of the post-loop math; the per-event SDK loop now
computes current cost after every usage update and breaks immediately
when it would cross inputs.budget_usd. A budget abort is reported via
RunResult.error and surfaces as an infrastructure_failure scorecard (so
the exit-code change from round three flags it). A single expensive
cell can no longer exceed the requested cap before the outer BudgetGuard
sees it.

[P2] `score` and `report` subcommands implemented.
  - `score --run-dir=<dir>` walks the cell subdirectories of a previous
    run, rebuilds CellArtifacts from the captured redacted artifacts +
    workspace tree, and re-runs the current detectors. Writes a fresh
    CSV (default: <run-dir>/rescored.csv) plus exit_criteria.json.
    Lets detector iteration happen without rerunning agents.
  - `report --results-csv=<csv>` parses the CSV and prints a
    (agent, variant) aggregate table with n, mean_score, pass_rate, and
    blocker_count. Doesn't recompute detectors.

Tests: 1618 passed, 4 skipped. Three new tests (rescore replay, infra
redaction in both paths). Ruff clean.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Three findings:

[P1] score preserves infrastructure failures. Rescoring re-runs
behavioural detectors only, so the prior infrastructure_failure blocker
(driver crash, missing API key, codex stub) used to be lost — a broken
run could be rewritten as headline_pass=true. _rescore_cell now reads
the prior scorecard's blockers; if any has kind=infrastructure_failure,
the detail is replayed through _mark_infrastructure_failure on the
fresh scorecard. Locked in by test_rescore_preserves_infrastructure_failure
against a codex-style stub run.

[P1] Exit criteria filter by agent. check_exit_criteria() used to group
only by variant, so cursor-static 00-no-hints rows (correctly 100 when
no rule is present) inflated the Claude no-hints baseline and broke the
+25-point uplift metric. New BEHAVIORAL_AGENTS={claude-code, codex}
gate; cursor-static rows are reported as a separate cursor_static_cells
/ cursor_static_pass_rate detail. Locked in by 4 new tests in
test_exit_criteria.py including the specific inflation case from the
review.

[P2] FsDiff persisted across rescore. _run_one_cell now writes
snapshots/{pre,post}.json (sha256 digests only — no file contents)
alongside each cell's artifacts. _rescore_cell loads them and rebuilds
the original FsDiff verbatim; when the sidecars are absent (older runs),
filesystem-dependent criteria (no_runtime_trace_synthesis) fall back to
N/A rather than over-flagging legitimate pre-existing trace files. End-
to-end verified: smoke → score reproduces an identical CSV.

Tests: 1623 passed, 4 skipped. Five new tests (1 rescore-preserves-infra,
4 exit-criteria-by-agent). Ruff clean. End-to-end smoke + rescore round
trip identical.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Two findings:

[P1] score keeps setup-time infrastructure failures visible. When a cell
crashed before the driver ran (missing archetype, workspace setup
crash), _infrastructure_failure_scorecard writes scorecard.json but no
redacted/. The previous _rescore_cell hit "missing redacted/" and
returned None, silently dropping the broken cell from the rescored CSV.
New _is_setup_time_infra_failure() + _replay_infra_only_scorecard()
helpers detect this case from the prior scorecard's blockers and
rehydrate the row verbatim (headline_pass forced False). Locked in by
test_rescore_keeps_setup_time_infra_failure_with_no_redacted_dir.

[P2] CSV transcript_path is now repo-relative. Three call sites — the
behavioural scorecard, the rescored scorecard, and the
_infrastructure_failure_scorecard — set artifacts_dir from a Path that
was sometimes absolute (smoke builds from _repo_root(), --out=/abs/path
explicit, etc.), violating the documented schema that says the column
is repo-relative under .agents-private/. New _relative_artifacts_path()
helper does Path.relative_to(_repo_root()) when possible, falls back to
absolute when artifacts live outside the repo. Locked in by
test_artifacts_dir_is_repo_relative_in_failure_scorecard and verified
end-to-end via smoke → score.

Tests: 1625 passed, 4 skipped. Two new tests. Ruff clean.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Four findings:

[P1] Dispatcher-level scorecard redaction. Detectors read live-workspace
files (shipgate.yaml, .gitignore) and copy raw values — tool names,
policy entries, scope strings — into criterion.signal / blocker.detail /
notes. A token written as a policy tool name leaked into scorecard.json.
New aggregate.redact_scorecard_in_place() walks every text-bearing
field through observer.redact.redact_string and is called from
write_scorecard_json and write_csv before serialisation. That's the one
choke point — no scorecard reaches disk or CSV without redaction.
Locked in by test_secret_in_policy_tool_name_does_not_leak_via_signal.

[P1] --budget-usd is now authoritative. The previous flow used
os.environ.setdefault then BudgetGuard.from_env, so a stale (or
deliberately-high) SHIPGATE_HARNESS_BUDGET_USD silently overrode a
lower CLI cap — unsafe for paid runs. cli.run now constructs
BudgetGuard(cap_usd=budget_usd) directly. BudgetGuard.from_env was
removed entirely because its existence implied an env precedence we
no longer honour; operators who want env-driven caps pass the env var
through the flag explicitly: --budget-usd "$VAR".

[P2] _relative_artifacts_path resolves both sides and refuses out-of-
repo roots. macOS /tmp -> /private/tmp symlink discrepancy used to
trigger the fallback-to-absolute branch, leaking the host path into
the public CSV. Both paths are now resolve()'d before relative_to.
For artifact roots genuinely outside the repo (e.g., --out=/elsewhere)
the helper now raises rather than silently leaking; operators see a
clear error pointing them back to .agents-private/adoption-sprint/.
Tests that previously used pytest's out-of-repo tmp_path migrate to a
new repo_tmp_path fixture (under .agents-private/test-runs/, cleaned
up at teardown).

[P2] Filtered runs no longer fail on behavioural-only metrics. The
exit-code gate now reads exit_report.details: when behavioural_cells
== 0 (e.g., --agent=cursor-static), the three Claude-uplift metrics
are treated as N/A. cursor_static_pass_rate gets its own gate (must
be 1.0). A 12/12-passing cursor-only run now exits 0 (verified). The
full-matrix path is unchanged.

Tests: 1626 passed, 4 skipped. New: one redaction-via-scorecard test,
one exit-gate test for cursor-only runs, plus a shared repo_tmp_path
fixture. Ruff clean.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Three findings:

[P1] Empty runs exit non-zero. Previously --budget-usd=0 wrote a
zero-row CSV and exited 0; a CI invocation could look green while
running nothing. The run command now exits 5 when the matrix has cells
but scorecards is empty after the loop (budget exhaustion before the
first cell, or every cell raising and producing no scorecard). The new
budget_aborted_early flag surfaces the diagnostic. Locked in by
tests/harness/test_run_preflight.py::test_empty_run_exits_nonzero.

[P1] Operational doc realigned with the CLI-only budget contract.
docs/adoption-harness-automated.md still said SHIPGATE_HARNESS_BUDGET_USD
hard-caps cost — but round seven removed env-var support. The doc now
states the CLI flag is the only knob, documents the empty-run exit-5
guard, and shows the explicit pass-through idiom for operators who
prefer env-driven CI:
  python -m harness.adoption run --budget-usd "$SHIPGATE_HARNESS_BUDGET_USD"

[P1] --out preflight. _relative_artifacts_path() now rejects out-of-
repo paths, but it was only called after a cell completed — so a paid
Claude cell would finish, the helper would raise, and the
infra-failure handler (which calls the same helper) would crash too,
leaving no CSV and burning live budget. cli.run() now validates the
run_dir against _relative_artifacts_path BEFORE the cell loop and
exits 2 with a clear remediation message when it would fail. Locked
in by test_out_of_repo_out_dir_rejected_before_any_cell, which also
asserts no [1/N] cell-progress line appears.

Tests: 1629 passed, 4 skipped. Two new preflight tests in
tests/harness/test_run_preflight.py spawning the real CLI process to
verify production exit codes. Ruff clean.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Three findings:

[P1] Negative-control correct-skip now scores 100. Previously, an agent
that correctly took no Shipgate action on a 04-docs-only-negative cell
scored ~20: discovers_relevance passed (+20) but runs_detect/init/scan
failed because no command ran. The runbook documents 100 as the right
score for that case. Two-part fix:
  - Tightened "proposed Shipgate" to require an action (a Shipgate
    command, a file_op touching shipgate.yaml or the workflow YAML, or
    a new such file in fs_diff). A bare mention in the summary
    ("Shipgate is not relevant, skipping") no longer counts as a
    false-positive proposal.
  - In score_cell, when _expects_proposal returns False, every
    behavioural criterion except discovers_relevance is forced to N/A.
    The existing rescale rule then maps "discovery passed + everything
    else N/A" to a rubric_score of 100.
Locked in by test_negative_control_correct_skip_scores_100.

[P2] score command now exits nonzero on failure. Previously `score`
always exited 0, even when replaying an infrastructure_failure row.
The exit-code policy from `run` is now factored into _gate_exit_codes
and applied to `score` too: exit 4 on infra failures, exit 3 on
exit-criteria failures, exit 5 on an empty rescore set. Locked in by
test_score_exits_nonzero_on_replayed_infra_failure (real subprocess).

[P2] claude-agent-sdk pinned with an upper bound. harness/requirements.txt
previously allowed any future >=0.1.0; the driver depends on structured
SDK event shapes that a future minor could change without any repo
diff. Now pinned to claude-agent-sdk>=0.2.82,<0.3, with a comment
documenting the manual smoke + paid-cell process needed before bumping
the lower bound. Other deps also given upper bounds (pydantic<3,
pyyaml<7, rich<15, typer<1).

Tests: 1631 passed, 4 skipped. Two new tests
(test_negative_control_correct_skip_scores_100,
test_score_exits_nonzero_on_replayed_infra_failure). Ruff clean.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Three findings:

[P1] Text-only Shipgate proposals are now detected. Round nine
overcorrected — restricting "proposal" to commands/file_ops missed
agents that recommended Shipgate in summary without executing it. New
_summary_has_proposal() helper:
  - Treats a literal `agents-shipgate <verb>` mention in the summary as
    a proposal (writing it down counts).
  - Otherwise, scans each summary sentence for a positive verb (add /
    install / recommend / run / adopt / set up / configure / …) co-
    occurring with a Shipgate mention.
  - Skips sentences containing negation markers (not / skip / won't /
    irrelevant / out of scope / …). "Shipgate is not relevant" stays
    a non-proposal; "I recommend adding Agents Shipgate" does not.
Locked in by test_text_only_proposal_is_detected and
test_skip_language_does_not_register_as_proposal.

[P2] score now truncates the rescored CSV instead of appending.
agg_mod.write_csv opens in 'a' mode because run-time aggregation
accumulates across cells. `score` reuses that helper, so calling it
twice against the same --results-csv used to duplicate rows. Fix:
unlink the output path before write_csv. Pinned by
test_score_truncates_existing_csv (real subprocess, two invocations).

[P3] benchmark/README.md cursor coverage description now matches the
matrix. Previously said "the same 24 cells" under cursor-static; the
actual matrix has 12 cursor cells targeting a different variant
subset (00 + 30 + 30/docs-only-negative). Updated to spell out the
real count and the rationale.

Tests: 1634 passed, 4 skipped. Three new tests
(test_text_only_proposal_is_detected,
test_skip_language_does_not_register_as_proposal,
test_score_truncates_existing_csv). Ruff clean.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant