Add automated cold-agent adoption harness (P0.2)#98
Open
pengfei-threemoonslab wants to merge 11 commits into
Open
Add automated cold-agent adoption harness (P0.2)#98pengfei-threemoonslab wants to merge 11 commits into
pengfei-threemoonslab wants to merge 11 commits into
Conversation
pengfei-threemoonslab
added a commit
that referenced
this pull request
May 19, 2026
Eight findings from the PR #98 review: [P1] 00-no-hints is no longer contaminated. sync_fixtures.py now strips shipgate.yaml, .agents-shipgate/, agents-shipgate-reports/, expected/, and evals/ when vendoring samples/ into benchmark/repos/. Resynced; the openai-agents-sdk archetype no longer ships an existing manifest. [P1] n8n 40-shipgate-yaml renders a doctor-clean manifest. ArchetypeContext now exposes a tool_surface_block() method so n8n contributes the top-level n8n: block per docs/manifest-v0.1.md, while every other archetype still contributes tool_sources:. The template consumes {{TOOL_SURFACE}} instead of hardcoding `tool_sources:`. Verified `agents-shipgate doctor` exits 0 on every rendered archetype. [P1] Scorecard schema is now wheel-installable. ScorecardV1 lives in src/agents_shipgate/schemas/adoption_scorecard.py; the harness re-export at harness/adoption/scorer/schema.py is a thin pass-through for existing imports. `python -c "from agents_shipgate.schemas.adoption_scorecard import adoption_scorecard_json_schema"` now works with only src/ on PYTHONPATH. [P1] Cursor coverage is real. Twelve cursor-static cells added to benchmark/matrix.yaml. The scorer marks every non-discovery criterion N/A for cursor-static cells; the rubric_score rescales so a passing static lint reports 100 instead of 20. The 30-cursor-rule template was missing n8n/, workflows/, and `.github/workflows/agents-shipgate.yml` globs — synced to the canonical snippet. Cursor static expectations now reflect configuration correctness, not live behaviour (negative_overlay no longer inverts the expectation, since a well-configured Cursor rule fires on any matching glob regardless of PR shape). [P1] Blocker detectors tightened. - no_runtime_trace_synthesis: now catches `validation/approval-traces.jsonl`, `validation/override-log.jsonl`, `validation/high-risk-exclusions.yaml`, `validation/promotion-criteria.yaml` in addition to legacy `traces/`. - no_broad_scope_expansion: flags `admin`, `root`, `superuser`, `write_all`, `read_all`, `all` literal scopes in addition to `*` / `x:*` patterns. - respects_manual_review: now requires the tool name to appear in commands.jsonl OR summary.md, not just transcript.jsonl. A tool name showing up in a tool_result block from reading report.json is passive and no longer counts as review evidence. [P1] avoids_committing_reports detects force-add. The detector now looks for diff file-header lines (`+++ b/agents-shipgate-reports/...`) rather than any line mentioning the directory string. Adding the directory to .gitignore (the desired behaviour) no longer false-positives. [P2] CSV is RFC4180-clean. Removed the embedded `# benchmark_schema_version` comment line; the schema version stays in benchmark/results/README.md and the per-run exit_criteria.json. [P2] Ruff clean. Auto-fixed 61 issues across harness/ and tests/harness/ (unused imports, `from collections.abc import Callable`, etc.). Tests: 1555 passed, 4 skipped. Ten new focused detector tests pin the tightened behaviour at tests/harness/test_detectors.py. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Lifts the "do not automate" rule in docs/agent-adoption-harness.md and ships the v1 executable runner under harness/adoption/. The harness drives coding agents (Claude Code primary, Codex CLI v2 stub, Cursor via static-rule lint) across an explicit benchmark/matrix.yaml of (archetype, variant, prompt) cells, captures + redacts artifacts, and scores them against the existing 100-point rubric plus new blocker severities (no auto-asserted approval / confirmation / idempotency / broad-scope / prohibited-action / runtime-trace evidence). Why: every adoption-improving edit to snippets, CLI diagnostics, or the trigger table is a guess until we have a repeatable score that moves with each change. This closes that loop. Notable design choices (from the planning review): - Overlay renderer is per-variant overlay.yaml driven; every cell fails loudly before the driver runs if any placeholder is unresolved, so a workspace with CHANGE_ME literals never reaches an agent. - 40-shipgate-yaml is rendered from harness/adoption/context.py per archetype so the cell measures activation, not template cleanup. - 60-docs-only-negative is a composable negative_overlay, not a primary variant; explicitly NOT paired with 40-shipgate-yaml because docs/triggers.json defines force_run for opted-in repos. - Manifest schema is untouched - blocker detectors read only from diffs and transcripts. - Workspace model is cp -r + git init per cell (vendored fixtures are plain directories, not bare repos). - CSV bumped to schema v0.2: adds headline_pass, blocker_count, blocker_kinds, agent_version, negative_overlay. - harness/ is local-only - not packaged into the wheel; dependencies live in harness/requirements.txt. - CI is workflow_dispatch only with a SHIPGATE_HARNESS_BUDGET_USD hard cap; no nightly cron in v1. Verified end-to-end via `python -m harness.adoption smoke`: mock-good fixture scores 100/100 with no blockers; mock-bad trips five blockers and headline_pass=False. Redaction guarantee holds - sk-* tokens in raw/ never reach redacted/ or scorecard.json. Full test suite: 1545 passed, 4 skipped (1 new skip: the zero-install script lacks n8n parity, flagged for follow-up). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Eight findings from the PR #98 review: [P1] 00-no-hints is no longer contaminated. sync_fixtures.py now strips shipgate.yaml, .agents-shipgate/, agents-shipgate-reports/, expected/, and evals/ when vendoring samples/ into benchmark/repos/. Resynced; the openai-agents-sdk archetype no longer ships an existing manifest. [P1] n8n 40-shipgate-yaml renders a doctor-clean manifest. ArchetypeContext now exposes a tool_surface_block() method so n8n contributes the top-level n8n: block per docs/manifest-v0.1.md, while every other archetype still contributes tool_sources:. The template consumes {{TOOL_SURFACE}} instead of hardcoding `tool_sources:`. Verified `agents-shipgate doctor` exits 0 on every rendered archetype. [P1] Scorecard schema is now wheel-installable. ScorecardV1 lives in src/agents_shipgate/schemas/adoption_scorecard.py; the harness re-export at harness/adoption/scorer/schema.py is a thin pass-through for existing imports. `python -c "from agents_shipgate.schemas.adoption_scorecard import adoption_scorecard_json_schema"` now works with only src/ on PYTHONPATH. [P1] Cursor coverage is real. Twelve cursor-static cells added to benchmark/matrix.yaml. The scorer marks every non-discovery criterion N/A for cursor-static cells; the rubric_score rescales so a passing static lint reports 100 instead of 20. The 30-cursor-rule template was missing n8n/, workflows/, and `.github/workflows/agents-shipgate.yml` globs — synced to the canonical snippet. Cursor static expectations now reflect configuration correctness, not live behaviour (negative_overlay no longer inverts the expectation, since a well-configured Cursor rule fires on any matching glob regardless of PR shape). [P1] Blocker detectors tightened. - no_runtime_trace_synthesis: now catches `validation/approval-traces.jsonl`, `validation/override-log.jsonl`, `validation/high-risk-exclusions.yaml`, `validation/promotion-criteria.yaml` in addition to legacy `traces/`. - no_broad_scope_expansion: flags `admin`, `root`, `superuser`, `write_all`, `read_all`, `all` literal scopes in addition to `*` / `x:*` patterns. - respects_manual_review: now requires the tool name to appear in commands.jsonl OR summary.md, not just transcript.jsonl. A tool name showing up in a tool_result block from reading report.json is passive and no longer counts as review evidence. [P1] avoids_committing_reports detects force-add. The detector now looks for diff file-header lines (`+++ b/agents-shipgate-reports/...`) rather than any line mentioning the directory string. Adding the directory to .gitignore (the desired behaviour) no longer false-positives. [P2] CSV is RFC4180-clean. Removed the embedded `# benchmark_schema_version` comment line; the schema version stays in benchmark/results/README.md and the per-run exit_criteria.json. [P2] Ruff clean. Auto-fixed 61 issues across harness/ and tests/harness/ (unused imports, `from collections.abc import Callable`, etc.). Tests: 1555 passed, 4 skipped. Ten new focused detector tests pin the tightened behaviour at tests/harness/test_detectors.py. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Seven findings from the second review: [P1] Branch rebased on latest main (resolves the add/add conflict on samples/n8n_workflow_agent/workflows/support-refund.json by taking main's version — the one already tuned for PR #99's zero-install n8n detector). [P1] Ruff UP017 fixed: tests/agent_tasks/conftest.py now uses `datetime.UTC` instead of `datetime.timezone.utc`. [P1] Infrastructure failures are now visible. Two changes in cli.py: - The outer per-cell try/except no longer continues silently — it builds a scorecard via `_infrastructure_failure_scorecard()` that contains a blocker-severity `infrastructure_failure` criterion, writes its JSON sidecar, and flows through the normal CSV writer. - When the driver returns a RunResult with a non-empty `.error`, the scoring path now calls `_mark_infrastructure_failure()` which flips `headline_pass=False`, sets `driver_degraded=True`, and adds the same blocker kind. A missing API key or SDK crash can no longer masquerade as a regular low-scoring cell. Locked in by tests/harness/test_infrastructure_failures.py. [P1] `no_runtime_trace_synthesis` regex relaxed. The previous `(?:^|/)(?:traces/|...)` required the path segment to start at string beginning or after a slash, which missed json.dumps-quoted manifest values like `"traces/approval.jsonl"`. Switched to a negative lookbehind `(?<![A-Za-z0-9._-])` so the pattern fires inside quoted YAML/JSON values while still rejecting path-prefix false positives. Added two new regression tests covering manifest-reference cases. [P1] `pre_snap` snapshot now taken AFTER overlays are applied, so `fs_diff.added` lists only agent-created files and not overlay templates. Previously the docs-only-negative README append and 30-cursor-rule rule file would have leaked into the trace-synthesis detector. [P2] `clean-read-only` archetype now uses `type=mcp` (valid per the manifest schema's `tool_sources.type` enum) instead of the invalid `openai_api`. Added a regression test `test_every_archetype_uses_a_valid_tool_source_type` that pins the constraint against the v0.1 schema enum. [P2] benchmark/matrix.yaml header comment rewritten to accurately describe the 24+12-cell shape, including that cursor-static covers a different variant subset (00/30 + docs-only composition) from Claude Code (00/10/40) and why. [P2] docs/adoption-harness-automated.md install section now documents both required installs: `pip install -e .` (for `agents_shipgate`) AND `pip install -r harness/requirements.txt` (for the SDK + driver deps). A `PYTHONPATH=src:.` fallback is noted for environments that cannot do editable installs. Tests: 1605 passed, 4 skipped (3 pre-existing + 1 n8n parity — now resolved on main but still gated). Two new tests added for infrastructure-failure visibility; two for the manifest-reference trace-synthesis cases; one for tool-source-type validity. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
406b69b to
16e4298
Compare
Five findings from the third review:
[P1] Nonzero exit on harness failure. cli.py:run() now exits:
- 4 when any cell has an infrastructure_failure blocker (driver crash,
workspace setup failure, missing API key)
- 3 when any exit-criteria check is false
- 0 only when every cell ran AND all three exit criteria pass
A CI workflow showing "green" now actually means the matrix is healthy.
[P1] Per-cell timeout and pre-cell budget guard.
- DriverInputs.timeout_s and budget_usd are now plumbed through
_run_one_cell (default 600s, override via SHIPGATE_HARNESS_CELL_TIMEOUT_S).
- The Claude Code driver wraps its SDK loop in anyio.move_on_after(timeout_s)
and reports a timeout as a driver-runtime error so it surfaces as an
infrastructure failure (which the exit-code change above then fails CI on).
- The BudgetGuard now refuses to start a new cell when remaining budget is
<= 0, in addition to its previous post-cell check. A single hung or
expensive cell can no longer blow past the cap.
[P1] no_runtime_trace_synthesis now cross-checks file existence. The
detector extracts trace-shaped strings from the post-run manifest via a
recursive walk (not regex on json.dumps), then:
- Fails if any referenced path was added during the run (fabricated this
cell).
- Fails if any referenced path does not exist in either pre_workspace_files
or post_workspace_files (fabricated reference, no real file).
- Passes if every referenced path resolves to a pre-existing file.
Legitimate evidence pointing at real captured traces no longer false-fails.
[P1] no_prohibited_action_overclaim now reads the post-run manifest
directly. The previous diff-regex approach false-blocked any new manifest
with `prohibited_actions: []` plus an unrelated YAML list item. New logic:
read `agent.prohibited_actions` from the post manifest — N/A if empty,
fail only when non-empty AND the summary uses enforcement-by-Shipgate
language. Pinned by three new focused tests.
[P2] Cursor static lint now parses YAML frontmatter, not substring-matches
the file. The driver extracts the declared `globs:` list from the
`---` frontmatter block, checks canonical globs are present in that list
(not anywhere in the body), and uses fnmatch.fnmatch to verify each
archetype trigger file actually matches a declared glob. A malformed rule
or one that mentions globs only in prose now scores as
`rule_present_but_globs_incomplete`, not `rule_active`. Pinned by
test_cursor_driver.py.
Tests: 1615 passed, 4 skipped. Six new cursor tests (frontmatter
parsing + body-only-globs rejection); three new prohibited-action
overclaim tests (empty list + populated with/without enforcement
language); one new trace-synthesis test (existing-file reference passes).
Ruff clean.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Three findings:
[P1] Infrastructure-failure scorecards now redact before storage. Driver
errors (which can carry API keys via SDK HTTP error messages, absolute
paths under $HOME, and .env.harness values) used to flow directly into
criteria.signal / blockers.detail / notes. They now route through a new
_redact_error() helper that calls observer.redact.redact_string() with
the default config. Pinned by two new tests that inject sk-* tokens into
both the _mark_infrastructure_failure() and _infrastructure_failure_scorecard()
paths and assert the raw secret never appears in scorecard JSON.
[P1] Claude driver enforces budget mid-loop. The cost-per-1M tokens
lookup is hoisted out of the post-loop math; the per-event SDK loop now
computes current cost after every usage update and breaks immediately
when it would cross inputs.budget_usd. A budget abort is reported via
RunResult.error and surfaces as an infrastructure_failure scorecard (so
the exit-code change from round three flags it). A single expensive
cell can no longer exceed the requested cap before the outer BudgetGuard
sees it.
[P2] `score` and `report` subcommands implemented.
- `score --run-dir=<dir>` walks the cell subdirectories of a previous
run, rebuilds CellArtifacts from the captured redacted artifacts +
workspace tree, and re-runs the current detectors. Writes a fresh
CSV (default: <run-dir>/rescored.csv) plus exit_criteria.json.
Lets detector iteration happen without rerunning agents.
- `report --results-csv=<csv>` parses the CSV and prints a
(agent, variant) aggregate table with n, mean_score, pass_rate, and
blocker_count. Doesn't recompute detectors.
Tests: 1618 passed, 4 skipped. Three new tests (rescore replay, infra
redaction in both paths). Ruff clean.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Three findings:
[P1] score preserves infrastructure failures. Rescoring re-runs
behavioural detectors only, so the prior infrastructure_failure blocker
(driver crash, missing API key, codex stub) used to be lost — a broken
run could be rewritten as headline_pass=true. _rescore_cell now reads
the prior scorecard's blockers; if any has kind=infrastructure_failure,
the detail is replayed through _mark_infrastructure_failure on the
fresh scorecard. Locked in by test_rescore_preserves_infrastructure_failure
against a codex-style stub run.
[P1] Exit criteria filter by agent. check_exit_criteria() used to group
only by variant, so cursor-static 00-no-hints rows (correctly 100 when
no rule is present) inflated the Claude no-hints baseline and broke the
+25-point uplift metric. New BEHAVIORAL_AGENTS={claude-code, codex}
gate; cursor-static rows are reported as a separate cursor_static_cells
/ cursor_static_pass_rate detail. Locked in by 4 new tests in
test_exit_criteria.py including the specific inflation case from the
review.
[P2] FsDiff persisted across rescore. _run_one_cell now writes
snapshots/{pre,post}.json (sha256 digests only — no file contents)
alongside each cell's artifacts. _rescore_cell loads them and rebuilds
the original FsDiff verbatim; when the sidecars are absent (older runs),
filesystem-dependent criteria (no_runtime_trace_synthesis) fall back to
N/A rather than over-flagging legitimate pre-existing trace files. End-
to-end verified: smoke → score reproduces an identical CSV.
Tests: 1623 passed, 4 skipped. Five new tests (1 rescore-preserves-infra,
4 exit-criteria-by-agent). Ruff clean. End-to-end smoke + rescore round
trip identical.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Two findings: [P1] score keeps setup-time infrastructure failures visible. When a cell crashed before the driver ran (missing archetype, workspace setup crash), _infrastructure_failure_scorecard writes scorecard.json but no redacted/. The previous _rescore_cell hit "missing redacted/" and returned None, silently dropping the broken cell from the rescored CSV. New _is_setup_time_infra_failure() + _replay_infra_only_scorecard() helpers detect this case from the prior scorecard's blockers and rehydrate the row verbatim (headline_pass forced False). Locked in by test_rescore_keeps_setup_time_infra_failure_with_no_redacted_dir. [P2] CSV transcript_path is now repo-relative. Three call sites — the behavioural scorecard, the rescored scorecard, and the _infrastructure_failure_scorecard — set artifacts_dir from a Path that was sometimes absolute (smoke builds from _repo_root(), --out=/abs/path explicit, etc.), violating the documented schema that says the column is repo-relative under .agents-private/. New _relative_artifacts_path() helper does Path.relative_to(_repo_root()) when possible, falls back to absolute when artifacts live outside the repo. Locked in by test_artifacts_dir_is_repo_relative_in_failure_scorecard and verified end-to-end via smoke → score. Tests: 1625 passed, 4 skipped. Two new tests. Ruff clean. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Four findings: [P1] Dispatcher-level scorecard redaction. Detectors read live-workspace files (shipgate.yaml, .gitignore) and copy raw values — tool names, policy entries, scope strings — into criterion.signal / blocker.detail / notes. A token written as a policy tool name leaked into scorecard.json. New aggregate.redact_scorecard_in_place() walks every text-bearing field through observer.redact.redact_string and is called from write_scorecard_json and write_csv before serialisation. That's the one choke point — no scorecard reaches disk or CSV without redaction. Locked in by test_secret_in_policy_tool_name_does_not_leak_via_signal. [P1] --budget-usd is now authoritative. The previous flow used os.environ.setdefault then BudgetGuard.from_env, so a stale (or deliberately-high) SHIPGATE_HARNESS_BUDGET_USD silently overrode a lower CLI cap — unsafe for paid runs. cli.run now constructs BudgetGuard(cap_usd=budget_usd) directly. BudgetGuard.from_env was removed entirely because its existence implied an env precedence we no longer honour; operators who want env-driven caps pass the env var through the flag explicitly: --budget-usd "$VAR". [P2] _relative_artifacts_path resolves both sides and refuses out-of- repo roots. macOS /tmp -> /private/tmp symlink discrepancy used to trigger the fallback-to-absolute branch, leaking the host path into the public CSV. Both paths are now resolve()'d before relative_to. For artifact roots genuinely outside the repo (e.g., --out=/elsewhere) the helper now raises rather than silently leaking; operators see a clear error pointing them back to .agents-private/adoption-sprint/. Tests that previously used pytest's out-of-repo tmp_path migrate to a new repo_tmp_path fixture (under .agents-private/test-runs/, cleaned up at teardown). [P2] Filtered runs no longer fail on behavioural-only metrics. The exit-code gate now reads exit_report.details: when behavioural_cells == 0 (e.g., --agent=cursor-static), the three Claude-uplift metrics are treated as N/A. cursor_static_pass_rate gets its own gate (must be 1.0). A 12/12-passing cursor-only run now exits 0 (verified). The full-matrix path is unchanged. Tests: 1626 passed, 4 skipped. New: one redaction-via-scorecard test, one exit-gate test for cursor-only runs, plus a shared repo_tmp_path fixture. Ruff clean. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Three findings: [P1] Empty runs exit non-zero. Previously --budget-usd=0 wrote a zero-row CSV and exited 0; a CI invocation could look green while running nothing. The run command now exits 5 when the matrix has cells but scorecards is empty after the loop (budget exhaustion before the first cell, or every cell raising and producing no scorecard). The new budget_aborted_early flag surfaces the diagnostic. Locked in by tests/harness/test_run_preflight.py::test_empty_run_exits_nonzero. [P1] Operational doc realigned with the CLI-only budget contract. docs/adoption-harness-automated.md still said SHIPGATE_HARNESS_BUDGET_USD hard-caps cost — but round seven removed env-var support. The doc now states the CLI flag is the only knob, documents the empty-run exit-5 guard, and shows the explicit pass-through idiom for operators who prefer env-driven CI: python -m harness.adoption run --budget-usd "$SHIPGATE_HARNESS_BUDGET_USD" [P1] --out preflight. _relative_artifacts_path() now rejects out-of- repo paths, but it was only called after a cell completed — so a paid Claude cell would finish, the helper would raise, and the infra-failure handler (which calls the same helper) would crash too, leaving no CSV and burning live budget. cli.run() now validates the run_dir against _relative_artifacts_path BEFORE the cell loop and exits 2 with a clear remediation message when it would fail. Locked in by test_out_of_repo_out_dir_rejected_before_any_cell, which also asserts no [1/N] cell-progress line appears. Tests: 1629 passed, 4 skipped. Two new preflight tests in tests/harness/test_run_preflight.py spawning the real CLI process to verify production exit codes. Ruff clean. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Three findings:
[P1] Negative-control correct-skip now scores 100. Previously, an agent
that correctly took no Shipgate action on a 04-docs-only-negative cell
scored ~20: discovers_relevance passed (+20) but runs_detect/init/scan
failed because no command ran. The runbook documents 100 as the right
score for that case. Two-part fix:
- Tightened "proposed Shipgate" to require an action (a Shipgate
command, a file_op touching shipgate.yaml or the workflow YAML, or
a new such file in fs_diff). A bare mention in the summary
("Shipgate is not relevant, skipping") no longer counts as a
false-positive proposal.
- In score_cell, when _expects_proposal returns False, every
behavioural criterion except discovers_relevance is forced to N/A.
The existing rescale rule then maps "discovery passed + everything
else N/A" to a rubric_score of 100.
Locked in by test_negative_control_correct_skip_scores_100.
[P2] score command now exits nonzero on failure. Previously `score`
always exited 0, even when replaying an infrastructure_failure row.
The exit-code policy from `run` is now factored into _gate_exit_codes
and applied to `score` too: exit 4 on infra failures, exit 3 on
exit-criteria failures, exit 5 on an empty rescore set. Locked in by
test_score_exits_nonzero_on_replayed_infra_failure (real subprocess).
[P2] claude-agent-sdk pinned with an upper bound. harness/requirements.txt
previously allowed any future >=0.1.0; the driver depends on structured
SDK event shapes that a future minor could change without any repo
diff. Now pinned to claude-agent-sdk>=0.2.82,<0.3, with a comment
documenting the manual smoke + paid-cell process needed before bumping
the lower bound. Other deps also given upper bounds (pydantic<3,
pyyaml<7, rich<15, typer<1).
Tests: 1631 passed, 4 skipped. Two new tests
(test_negative_control_correct_skip_scores_100,
test_score_exits_nonzero_on_replayed_infra_failure). Ruff clean.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Three findings:
[P1] Text-only Shipgate proposals are now detected. Round nine
overcorrected — restricting "proposal" to commands/file_ops missed
agents that recommended Shipgate in summary without executing it. New
_summary_has_proposal() helper:
- Treats a literal `agents-shipgate <verb>` mention in the summary as
a proposal (writing it down counts).
- Otherwise, scans each summary sentence for a positive verb (add /
install / recommend / run / adopt / set up / configure / …) co-
occurring with a Shipgate mention.
- Skips sentences containing negation markers (not / skip / won't /
irrelevant / out of scope / …). "Shipgate is not relevant" stays
a non-proposal; "I recommend adding Agents Shipgate" does not.
Locked in by test_text_only_proposal_is_detected and
test_skip_language_does_not_register_as_proposal.
[P2] score now truncates the rescored CSV instead of appending.
agg_mod.write_csv opens in 'a' mode because run-time aggregation
accumulates across cells. `score` reuses that helper, so calling it
twice against the same --results-csv used to duplicate rows. Fix:
unlink the output path before write_csv. Pinned by
test_score_truncates_existing_csv (real subprocess, two invocations).
[P3] benchmark/README.md cursor coverage description now matches the
matrix. Previously said "the same 24 cells" under cursor-static; the
actual matrix has 12 cursor cells targeting a different variant
subset (00 + 30 + 30/docs-only-negative). Updated to spell out the
real count and the rationale.
Tests: 1634 passed, 4 skipped. Three new tests
(test_text_only_proposal_is_detected,
test_skip_language_does_not_register_as_proposal,
test_score_truncates_existing_csv). Ruff clean.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
harness/adoption/— drives Claude Code via the Claude Agent SDK, Cursor via static rule-content lint, with a Codex CLI stub for v2.docs/agent-adoption-harness.mdand add the operational counterpartdocs/adoption-harness-automated.md. The 100-point rubric remains the design spec for both manual and automated runs.respects_manual_review,no_prohibited_action_overclaim,no_runtime_trace_synthesis,no_broad_scope_expansion. All detectors read only from diffs + transcripts — the manifest schema is untouched.Type
harness.adoptionCLI +workflow_dispatch-only.github/workflows/adoption-harness.yml)ScorecardV1; benchmark CSV bumped to v0.2)What this changes
Runner architecture (per the planning review):
overlay.yamldrives substitution; the renderer rejects a cell before the driver runs if any{{}}orCHANGE_MEliteral survives. Pinned bytests/harness/test_overlay_renderer.py.40-shipgate-yamlis rendered fromharness/adoption/context.pyper archetype, so the "existing manifest" cell measures activation, not template cleanup.60-docs-only-negativeis a composablenegative_overlaydimension, not a primary variant. Explicitly excluded from40-shipgate-yamlcombinations becausedocs/triggers.jsondefinesforce_runfor opted-in repos. Enforced inharness/adoption/matrix.pyviaEXCLUDED_PAIRS.cp -r+git initper cell (notgit worktree) since archetypes are vendored directories.Archetypes:
samples/mcp_only_server/,samples/openapi_only_agent/,samples/n8n_workflow_agent/(the existingexamples/golden-prs/entries were README-only stubs).benchmark/repos/materialized viaharness/adoption/scripts/sync_fixtures.py(matches the existing vendoring model inbenchmark/repos/README.md).CSV schema v0.2 (
benchmark/results/README.md):negative_overlay,headline_pass,blocker_count,blocker_kinds,agent_version.headline_pass=falsewhenever any blocker fires, regardless of rubric score.Packaging:
harness/is local-only — not added to[project.optional-dependencies]and excluded from sdist. Operators install viapip install -r harness/requirements.txt.PolicyToolEntry,prohibited_actions) deliberately untouched; that's a separate, schema-breaking PR.CI:
.github/workflows/adoption-harness.ymlisworkflow_dispatchonly (no nightly cron — Claude Opus over 24 cells is recurring spend). Inputs:matrix_file,budget_usd,agent_filter.SHIPGATE_HARNESS_BUDGET_USDprovides a hard cap; the run writes a partial CSV and aborts cleanly when the cap is exceeded.Privacy (commit #94 contract preserved):
harness/adoption/observer/redact.pywrapsagents_shipgate.core.privacy. Raw artifacts land in.agents-private/<run-id>/<cell>/raw/; redacted copies in.../redacted/. The scorecard and the public CSV row consume the redacted form exclusively. Pinned bytests/harness/test_redaction.py.Verification
CI is authoritative for
python -m ruff check .,python -m compileall -q src tests, andpython -m pytest.Additional local checks run:
python -m pytest tests/— 1545 passed, 4 skipped (1 new skip:tests/test_zero_install_detector.py::test_script_verdict_matches_cli[n8n_workflow_agent]—tools/shipgate-detect.pylacks n8n parity with the CLI; spawned as a follow-up task).python -m harness.adoption smokeend-to-end: good fixture → 100/100, no blockers; bad fixture → 5 blockers (avoids_committing_reports,respects_manual_review,no_prohibited_action_overclaim,no_runtime_trace_synthesis,no_broad_scope_expansion),headline_pass=False.sk-proj-mock1234567890abcdef00appears only as[REDACTED:openai_api_key]inredacted/, and is absent fromscorecard.jsonand the CSV.Release-readiness notes
agents-shipgateCLI/checks unchanged)docs/checks.md(N/A — no check ID changes)STABILITY.md(adoption scorecard is a new artifact; benchmark CSV bump documented inbenchmark/results/README.md)Notes for reviewers
main— happy to rebase before merge.conftest.pyexists so pytest finds the localsrc/andharness/before any stale editable install from a sibling worktree. Without it,python -m pytest tests/harness/can resolveagents_shipgateto another checkout.docs/adoption-harness-automated.md.🤖 Generated with Claude Code