Skip to content

feat(core): per-test regression as default exit for agentv compare; add --update-baseline flag #1158

@christso

Description

@christso

Part of #1155.

Objective

Fix an existing AgentV outlier behavior: agentv compare's pairwise mode currently exits based on mean delta (determineExitCode(comparison.summary.meanDelta) at apps/cli/src/commands/compare/index.ts:259,532). Industry convention (Jest, Betterer, Braintrust, Langfuse) is per-test regression detection — any test dropping below baseline - threshold fails CI. Mean aggregation can hide serious per-test drops behind unrelated improvements.

Additionally: add --update-baseline (short -u) so the regression-gating workflow has a write-back-on-success step, completing the loop against AgentV's existing sidecar <eval>.baseline.jsonl convention.

Design latitude

Breaking change: default exit logic

  • Pairwise mode (two positional files, or single-manifest --baseline/--candidate): exit 1 if any per-test score in candidate drops below baseline_score - threshold. Remove the mean-delta exit path.
  • Matrix / N-way mode (N targets in one manifest, no --baseline): unchanged — per-test doesn't naturally apply when each test has N scores, one per target.

New flag: --update-baseline / -u

  • Only meaningful in pairwise mode with a file-path first positional.
  • On success (no regression): overwrite the first positional with the candidate's scores (the contents of the second positional).
  • On failure: do not touch the baseline file.
  • Error clearly if used in matrix mode or if the first positional is not a writable path.

Resulting CLI

# Pairwise, per-test exit by default (behavior change from today):
agentv compare baseline.jsonl candidate.jsonl

# Regression-gating workflow against a sidecar baseline:
agentv compare evals/my-eval.baseline.jsonl runs/latest/my-eval.jsonl --update-baseline

No --ratchet, --snapshot, --per-test, or --baseline-file — positional ordering establishes baseline-vs-candidate; per-test is the default; --update-baseline is the only new flag.

Acceptance signals

  • Pairwise mode catches an injected single-test regression on a toy eval (previously mean-delta would have let it through).
  • Pairwise mode still exits 0 when all per-test deltas are within threshold.
  • Matrix mode exit behavior unchanged.
  • --update-baseline writes candidate scores to the first positional only on success.
  • --update-baseline errors cleanly in matrix mode or with non-writable first positional.
  • Release notes / changelog call out the default behavior change for anyone relying on mean-delta exit semantics.

Non-goals

  • Not a hosted dashboard or visualization layer.
  • Not a remote baseline store — sidecar file + git remains the persistence model.
  • Not auto-sidecar resolution (matching result files to their sidecar baselines in a directory) in this issue — follow-up if users need it.

Lineage

Metadata

Metadata

Assignees

No one assigned

    Labels

    coreAnything pertaining to core functionality of AgentVenhancementNew feature or request

    Type

    No type
    No fields configured for issues without a type.

    Projects

    Status

    Backlog

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions