Skip to content

feat: lapp MCP server v0.2 — editor, grep, benchmark, distribution#1

Merged
php-workx merged 72 commits into
mainfrom
feat/mcp-server-v0.2
Apr 17, 2026
Merged

feat: lapp MCP server v0.2 — editor, grep, benchmark, distribution#1
php-workx merged 72 commits into
mainfrom
feat/mcp-server-v0.2

Conversation

@php-workx
Copy link
Copy Markdown
Owner

@php-workx php-workx commented Apr 15, 2026

Summary

All development since the initial v0.1.0 skeleton, bringing lapp from a basic hash-line MCP server to a production-grade file editing tool with benchmarking.

Core features

  • Editor: precise line-anchored editing with indentation preservation, multiline block lookup/replace (lapp_replace_block), BOM round-trip, no-op detection, atomic rename with Windows fallback, concurrent lock handling
  • Grep: fixed-string (literal=true) and regex search with punctuation tolerance
  • CLI: --root, --limit, --block, --allow, --only-tools, --version flags
  • Instructions: lapp-tools.md routing policy for CLAUDE.md and OpenCode instructions[]

Benchmark

  • SWE-bench A/B harness for token-efficiency measurement
  • Multi-model parallel runs with per-model result dirs
  • V1 and V2 suite definitions with metric summaries
  • OpenCode runner integration

Distribution spec

  • specs/distribution_v1.md: comprehensive setup tool design covering 7+ agent targets, per-client config formats, skill installation, and instructions injection

Test plan

  • Verify go test ./... passes
  • Run benchmark suite against at least one model
  • Confirm MCP server starts with go run ./cmd/lapp and responds to tool calls
  • Check that lapp_grep literal=true handles punctuation correctly
  • Verify lapp_replace_block preserves indentation on multiline replacements

🤖 Generated with Claude Code

Summary by CodeRabbit

  • New Features

    • Structured, limited grep (text/literal/structured), multiline block find/replace/insert/apply helpers, CLI flag to expose only selected tools, end-to-end benchmark tooling (prepare/fetch/run/report) with V1/V2 suites.
  • Bug Fixes

    • Preserve BOM and literal \n, fixed file-write permissions and clearer Windows write errors, validate out-of-range anchors, no-temp behavior for no-op edits, preserve indentation on replacements.
  • Documentation

    • Expanded architecture, plans, usage guides, tool policy, and benchmark specs.
  • Tests

    • Expanded integration/unit coverage, version-flag test, CI workflow, pre-commit/pre-push hooks.
  • Chores

    • Lint/format configs, toolchain/tool manifests, gitleaks/golangci/ruff configs.

php-workx added 30 commits April 4, 2026 07:48
Security (P0/P1):
- handleGrep: validate path param via EvalSymlinks+root prefix check before
  WalkDir; prevents arbitrary filesystem traversal outside --root
- handleGrep: per-file CheckPath in WalkDir callback enforces block list and
  symlink resolution on every matched file (.env, *.key etc. now silently skipped)
- handleGrep: use fileio.ReadFile instead of os.ReadFile so BOM is stripped
  before hashing; grep refs now match lapp_read/lapp_edit refs exactly
- handleWrite: temp file now uses random 8-hex-char suffix (fileio.RandomHex)
  and O_EXCL|0600 per §9.1; prevents collision on case-insensitive filesystems

Correctness (P2):
- editor.ApplyEdits: self-correct now only triggers when model supplied at least
  one non-empty ref that fails ParseRef (hasAnyRef && !hasValidHashlineRef);
  structurally empty edits now correctly return ERR_INVALID_EDIT
- fileio: add ERR_PERMISSION_DENIED constant; ReadFile and WriteFile now map
  os.IsPermission errors to ErrPermissionDenied instead of ErrFileNotFound/
  ErrWriteFailed
- fileio.CleanupOrphans: removes *.lapp.tmp files older than 5 minutes on
  startup; called from main.go before server.New (§9.1)

Maintenance (P3):
- go mod tidy: remove // indirect annotation from mcp-go direct dependency
- Export fileio.RandomHex (was unexported randomHex)

Tests (+9):
- TestGrepPathOutsideRoot, TestGrepBlockedFile, TestGrepBOMFileHashConsistency
- TestWriteAtomicNoOrphan, TestReadFilePermissionDenied
… document Note deviation

- lap-wce3: lower go.mod minimum from 1.25.0 to 1.24 (spec §2 requires 1.24+)
- lap-y20r: add lapp_grep to stderr hint and README CLAUDE.md entry
- lap-jnt8: remove unused ReadRequest/WriteRequest types from types.go
- lap-8csq: document SelfCorrectResult.Note always-set as known stateless deviation
…nsert_before, ERR_LINE_OUT_OF_RANGE

- lap-ccwb: NormalizeNewlines now only replaces literal \n when content has
  zero real newlines; preserves intentional backslash-n in code content
- lap-sl1c: validateOne rejects BOF (0:) and EOF (EOF:) anchors on insert_before
  per §6.1; those anchors are insert_after-only
- lap-1u6m: verifyHashes distinguishes out-of-range refs (OutOfRange=true) from
  hash mismatches; ValidateEdits returns ERR_LINE_OUT_OF_RANGE when any ref
  targets a non-existent line, taking priority over ERR_HASH_MISMATCH

Tests: +14 new test functions covering all four spec cases per ticket
…on message

- lap-1kgr: handleGrep capped at 100 files and DefaultLimit output lines;
  returns truncation note with count when cap is reached
- lap-frjk: handleWrite chmod temp file to 0644 before rename so new source
  files have conventional permissions (was 0600, overly restrictive)
- lap-srw6: split WriteFile rename into renameAtomic() with build-tag files;
  rename_windows.go detects ERROR_SHARING_VIOLATION and returns actionable
  message per §9.1; rename_unix.go keeps existing POSIX behavior
…ersion flag

- lap-u5rj: TestBOMRoundTrip — read BOM file, edit line 2, confirm BOM bytes
  still at offset 0 and content updated correctly
- lap-pixp: TestNoOpLeavesNoTempFile — no *.lapp.tmp files created when
  all edits produce identical content (ERR_NO_OP short-circuits before write)
- lap-dvpt: TestConcurrentEditsSerializedByLock — two goroutines edit same
  file concurrently; file remains coherent (no corruption)
- lap-2gb3: TestVersionFlag — subprocess test builds binary, runs --version,
  verifies non-empty stdout and exit 0
- parseHashRef in server_test.go: tighten regex from [A-Z]{2} to
  [ZPMQVRWSNKTXJBYH]{2} (spec alphabet only) — was silently accepting invalid
  hash chars in every round-trip test
- minInt helper: remove redundant function; use built-in min (available since
  Go 1.21, well within our go 1.24 minimum)
- BuildSelfCorrectResult: rename local var cap→displayLines to avoid shadowing
  the built-in cap() function
- handleGrep: use fileio.Err* constants (not editor.Err*) for path errors in
  the grep handler — all other handlers already used fileio.* consistently
- handleGrep: count separator '...' lines toward outputLines cap so effective
  response size cannot exceed DefaultLimit regardless of gap density
- WriteFile: add os.IsPermission checks to Stat and OpenFile failure paths,
  returning ErrPermissionDenied instead of opaque ErrWriteFailed
- handleWrite: check os.IsPermission on createErr and writeErr; return
  ErrPermissionDenied instead of raw 'cannot write file: ...' strings
… LCS guard, ctxLines cap, test hardening

- ERR_LINE_OUT_OF_RANGE: collect all out-of-range line numbers and report
  them together so the model can fix the entire batch in one retry
- FormatMismatchError: handle out-of-range entries with actionable guidance
  instead of an empty remapping entry
- generateDiff: guard LCS against O(n×m) memory blowup on files >5000 lines;
  fall back to positional heuristic and omit diff body
- RenameAtomic: export (rename_unix.go, rename_windows.go, fileio.go) so
  handleWrite can reuse it — Windows sharing-violation message now fires on
  lapp_write paths too, not just WriteFile paths
- handleGrep: cap ctxLines at 10 to prevent context-line explosion
- server_test: add extractTextNoFail for goroutine-safe result extraction;
  harden TestNoOpLeavesNoTempFile WalkDir error handling; rewrite
  TestConcurrentEditsSerializedByLock with full read→edit cycle and
  explicit content coherence check
…urement

Three scripts + a placeholder instances file:

  benchmark/select.py   — filter SWE-bench Verified (HuggingFace) to a small
                          set of single/dual-file patch instances; writes
                          instances.json (run once, result committed)

  benchmark/run.py      — for each instance: clone repo at base_commit, run
                          Config A (standard Read/Edit/Write tools) then Config B
                          (lapp_read/lapp_edit/lapp_write/lapp_grep) via the
                          claude CLI; write results/{instance_id}.json

  benchmark/report.py   — read results/ and print comparison table: output
                          tokens, input tokens, turns, patch similarity, cost

Measurement approach:
  - output tokens is the primary metric (lapp's core claim)
  - input tokens tracked separately (lapp adds hashline overhead on reads)
  - patch similarity (difflib on changed lines) is a proxy for correctness;
    low score != wrong fix — agent may take a different valid approach
  - Bash-based edits bypass tool accounting; noted in report footer

Usage:
  pip install datasets
  python benchmark/select.py          # populate instances.json (once)
  python benchmark/run.py             # A/B runs (~10 min per instance)
  python benchmark/report.py         # comparison table
…ling, stdin prompt

- work_dir.resolve() — /var/folders/... is a symlink to /private/var/...; lapp
  rejects paths that don't match its EvalSymlinks'd root (same bug as pm-20260404-002)
- capture_diff: use rglob to recurse into subdirectories; diff --label for clean output
- MAX_TURNS: 10→20 to give lapp enough turns for read+hash+edit sequences
- error_max_turns: preserve token counts instead of discarding as a hard error
- prepare.py: BENCHMARK_IDS env var support; fix duplicate FILES_DIR.mkdir
- fetch_instances.py renamed from select.py (shadowed stdlib select module)
- run.py: prompt via stdin to avoid Commander.js variadic --allowedTools consuming it
Without grep, Config B was forced to paginate large files (lapp_read 2000
lines at a time) to locate the target line. This inflated turns 3-4x on
large files and dominated the token comparison unfairly.

With grep: model finds the exact LINE#HASH in 1 call, edits in 1 call.
matplotlib: 9 turns → 4, output tokens +92% → -27%.
Total output tokens across 3 instances: essentially parity (+0.3%).
- Remove @morphllm/opencode-morph-plugin from ~/.config/opencode/opencode.json
  and its instructions reference; uninstalled the npm package
- Add AGENT=opencode support to run.py:
  - _oc_config(): merges lapp MCP into existing opencode config via
    OPENCODE_CONFIG_CONTENT (instead of replacing it — the root cause of
    the earlier broken test)
  - run_opencode(): parses tool_use/step_finish event stream; aggregates
    tokens across turns; records tools_used for adoption analysis
  - run_instance(): dispatches to run_opencode or run_claude based on AGENT
  - OPENCODE_MODEL env var selects model (default: opencode/gpt-5-nano)
  - OC-specific prompts use lowercase tool names and lapp_lapp_* MCP names
- Add import copy for config deep-merge
Mirrors the Morph plugin pattern: an always-on routing policy loaded via
the OpenCode 'instructions' array that tells the model when and how to
use lapp tools vs native tools.

Decision table covers: file size thresholds, targeted vs full-context edit
workflows, new-file creation, hash mismatch recovery, fallback policy, and
anti-patterns.

To enable in OpenCode, add to ~/.config/opencode/opencode.json:
  {
    "instructions": ["/path/to/lapp/instructions/lapp-tools.md"]
  }

Finding: routing works correctly with the instruction loaded.
gpt-5-nano is too small to reliably execute the grep→edit workflow;
a capable model (Sonnet, GPT-4o) is needed to measure net token impact.
The old prompts gave the agent the exact unified diff to apply.
This disadvantages lapp (must still read for hashes while native Edit
can use old_string straight from the diff) and is unrealistic (agents
receive task descriptions, not patches).

New design:
- prepare.py writes _task.txt alongside _patch.diff (SWE-bench
  problem_statement — the real GitHub issue text)
- build_prompt() reads _task.txt; _patch.diff retained for scoring only
- All 4 prompt templates (A/B × claude/opencode) use {task} not {diff}
- Agent must read the file first in both configs — equal footing
- The only measurement difference is now the edit step itself:
  old_string reproduction (A) vs. hash anchor (B)

Also corrects the 8K instruction overhead claim: the instruction file is
~750 tokens, not 8K. Higher totals in prior comparison came from
additional turns accumulating context, not instruction file size.
- lapp-tools.md: 3288→940 chars, ~750→210 tokens. Same routing rules,
  no prose. Decision table + two workflow patterns + four rules.

- Minimax-m2.7:cloud results (OpenCode, real-world task prompts):
  astropy: A=2403 B=1278 (-46.8%) — lapp wins clearly, 100% sim vs 4%
  django:  A=4388 B=9113 (+107.7%) — lapp struggles on complex regex,
           minimax takes 10 turns with repeated grep/read cycles

  astropy result confirms lapp's advantage on capable models.
  django result shows model difficulty with regex content in hash matching.
The benchmark exposed that lapp_grep fails when the search term itself
contains regex metacharacters — e.g. searching for the string
'(?:\S+(?:\S*)?@)?' causes the regex engine to interpret the parens,
backslashes, and quantifiers as regex syntax rather than literal chars.

Fix: add a boolean 'literal' parameter. When true, the pattern is escaped
with regexp.QuoteMeta() before compilation, producing a fixed-string match
identical to 'grep -F'. Default false preserves existing behaviour.

Changes:
- server.go: register 'literal' param; apply QuoteMeta in handleGrep
- server_test.go: callGrepLiteral helper + two tests:
  TestGrepLiteralMatchesRegexMetachars — finds a line whose content is
    itself a regex (parens, \S+, ?, @)
  TestGrepLiteralInvalidRegexPattern — an invalid regex accepted as literal
- instructions/lapp-tools.md: add literal=true row to decision table and
  code-search workflow pattern (~266 tokens total, up from 210)
- run.py: RESULTS_SUBDIR env var routes results to results/<subdir>/
- report.py: --dir <name> flag selects results subdir; RESULTS_SUBDIR also works

Results with 3 ollama-cloud models on astropy + django (real-world task prompts):

  glm-5.1:       -47.0% output tokens overall — best result; clean 3-call
                 workflow (grep→read→edit) on astropy in 4 turns each
  minimax-m2.7:  +32.4% on astropy (mixed lapp+native); django Config A timed out
  qwen3-coder:   -77.3% on astropy (lapp wins) but Config B timed out on django;
                 qwen uses 22 turns on Config A (very chatty), bash calls present
Timing:
- run_opencode: wall_ms (subprocess wall time), turns_ms (per-turn
  duration from step_start/step_finish event timestamps)
- run_claude: wall_ms from duration_ms in JSON output; turns_ms=[] since
  stream-json is not used
- _err(): wall_ms=0, turns_ms=[] for error records
- report.py: A/B wall and t/turn columns; --dir flag for subdir selection

Task redesign:
- parse_patch_change(diff): parses unified diff into [{file,old,new}] —
  the exact content to find and the exact replacement
- build_prompt(): uses _patch.diff (not _task.txt); constructs a targeted
  'find this exact content, replace with this' prompt
- _indent(): helper for readable prompt formatting
- All 4 prompt templates simplified to {filepath} + {changes}

Result with glm-5.1 on targeted prompts:
  Both instances: 3 turns each (read + edit + verify), files correctly changed
  A wall: 52-57s total  A t/turn: ~14s (dominated by one slow API call)
  B wall: 10-21s total  B t/turn: 0.5-0.8s (grep+edit very fast)
  The timing data reveals the bottleneck: one 42s turn in Config A is the
  model generating a large old_string for the edit call.
- TIMEOUT 300s -> 120s per config; >2 minutes for a targeted single-file edit is benchmark-fail
- Add fetched fixture files for astropy__astropy-13033 and pydata__xarray-3151 so the multiline benchmark slice is reproducible
Models were copying refs directly from lapp_read/lapp_grep output in the form
LINE#HASH:CONTENT and then sending anchors like  into lapp_edit.
The editor parser expected  exactly, so ApplyEdits classified these as
having no valid hashline refs and returned SELF_CORRECT, causing retry loops
and fallback to native read/edit in devstral and kimi traces.

Fix:
- add normalizeRef() in internal/editor/editor.go
- trim one trailing colon from refs that contain a hash component
- keep special anchors  and  unchanged
- apply normalization in:
  - ApplyEdits self-correct detection
  - validateOne() insert-after/insert-before anchor parsing
  - parseAddressing() anchor/start/end parsing
  - hash verification now uses normalized refs via pe.edit

Tests:
- TestRoundTrip_ReadEditAcceptsColonSuffixedAnchor
- TestRoundTrip_ReadEditAcceptsColonSuffixedRangeRefs
- full ok  	github.com/lapp-dev/lapp/cmd/lapp	(cached)
ok  	github.com/lapp-dev/lapp/internal/editor	(cached)
ok  	github.com/lapp-dev/lapp/internal/fileio	(cached)
ok  	github.com/lapp-dev/lapp/internal/server	(cached)
ok  	github.com/lapp-dev/lapp/pkg/hashline	(cached) passes
Root cause from kimi trace:
- model correctly set literal=true
- but still regex-escaped punctuation in the search term, e.g.
  cright\[-right\.shape\[0\]:, -right\.shape\[1\]:\] = 1
- literal grep searched for a string containing backslashes that does not
  exist in the file, missed, then broadened the search and burned extra turns

Fix:
- add literalVariants() in internal/server/server.go
- for literal=true, try two fixed-string variants per line:
  1. exact pattern
  2. punctuation de-escaped from regex style (\[ -> [, \. -> ., etc.)
- keep existing exact literal behavior; the de-escaped variant only helps when
  the exact literal does not match file content

Tests:
- TestGrepLiteralMatchesRegexEscapedPunctuation covers the exact kimi failure
- full go test ./... passes

Impact:
- kimi astropy B-side: 7 turns -> 3, 55s -> 38s
- kimi overall: +84.9% output -> -6.5% output with 100% similarity on both tasks
- timeout: 120s per config is benchmark-fail for targeted single-file edits
- multiline B prompts now explicitly require start/end range replacement
  instead of single-anchor replace
- build_prompt now emits first-line/last-line markers for multiline changes
  and handles insertion-only/deletion-only hunks without crashing
- qwen480 removed from active result set by operator decision (not tracked)
The stricter benchmark prompts ("exactly one grep/edit", "do not drop the hash")
backfired on qwen-next and minimax, increasing turns, output tokens, and
error rates. Restore the simpler prompt regime while keeping the useful
"Do not explain your steps" instruction.

Also removed stale local result dirs qwen-next-v3 and minimax-v4 (not tracked).
Establish a real benchmark contract in code instead of ad-hoc prose:

- benchmark/v1_suites.json
  - versioned suite manifest
  - official V1 suites:
    * targeted-singleline
    * targeted-multiline-pure-replacement
  - explicit primary metrics:
    correctness_similarity, wall_ms, num_turns, output_tokens, input_tokens

- benchmark/run.py
  - suite-aware runner via --suite <name>
  - stores benchmark_version, suite, agent, and model metadata in result JSON
  - stores correctness_similarity per side alongside output/input/wall/turns

- benchmark/v1_report.py
  - summarizes the four first-class metrics side-by-side per model and suite
  - no weighted score; correctness, wall time, turn count, and token count are
    all reported as primary benchmark outputs
  - falls back to diff-vs-reference similarity when older results do not yet
    store correctness_similarity

This moves the benchmark from exploratory slices toward a robust V1 framework
that can be rerun consistently across harnesses and models.
- run.py now writes results to canonical paths:
  benchmark/results/<suite>/<agent__model-slug>/
  instead of ad-hoc flat directories with -v2/-v3 suffixes
- adds helpers for model slugging and canonical results directory selection
- v1_report.py now discovers nested suite/model result dirs recursively
  and ignores benchmark/results/_legacy/
- V1 summary now prints the actual model identifier from result metadata
  rather than legacy directory names

This removes the confusing exploratory naming from the V1 benchmark view while
preserving old ad-hoc result dirs under a legacy bucket.
Tooling:
- add lapp_find_block for exact multiline block lookup returning start/end refs
- add lapp_grep structured output mode (format=text|structured)
- accept full display-line refs in lapp_edit normalization
- improve self-correct messaging for missing #HASH refs
- keep existing colon-suffixed ref and regex-escaped literal grep fixes

Tests:
- structured grep output tests
- multiline find_block tests
- full display-line ref acceptance tests
- explicit self-correct message test
- full go test ./... passes

Benchmark:
- add grep-format variant plumbing (metadata + prompt support + canonical result path suffix)
- canonical V1 result layout by suite/model
- v1 report improved to human-readable table with first-class metrics side by side
- official V1 multiline suite fetched and run under canonical paths
- saved detailed implementation plan under docs/plans/

Validation:
- go test ./...
- python3 -m py_compile benchmark/run.py benchmark/v1_report.py benchmark/prepare.py
- python3 benchmark/v1_report.py
Minimax was still underperforming after the ref/grep fixes because it often
produced the correct replacement text with the wrong leading whitespace on the
first lapp_edit call, then had to read back and repair the line in a second
edit turn.

Fix:
- in editor.applyOne, when applying a single-line EditReplace on a single
  anchored line, preserve the original line's leading indentation and apply it
  to the replacement text
- this keeps the edit local to the anchored line and prevents accidental block
  shifts caused by model-generated whitespace drift

Test:
- TestApplyEdits_SingleReplacePreservesIndentation
- full go test ./... passes

Benchmark impact (official V1 targeted-singleline, Minimax):
- B-side correctness: 75% -> 100%
- output: 1778 -> 613 (-65.5% vs prior broken run; now -21.6% vs A)
- input: 230399 -> 119492 (-48.1% vs prior broken run; now -39.2% vs A)
- wall: 52s -> 41s (and native A no longer dominates the comparison)
Follow-up on multiline Minimax investigation:
- lapp_find_block now supports normalize_whitespace (default true) so block
  matching tolerates shared indentation shifts instead of requiring exact
  leading whitespace
- benchmark multiline prompts now tell models to use lapp_find_block with
  normalize_whitespace=true for multiline changes
- normalizeRef now accepts LINE:HASH as a model separator mistake and
  normalizes it to LINE#HASH
- add server regressions for:
  - whitespace-normalized block matching
  - colon-separated hash refs (e.g. 245:BS)

Validation:
- go test ./...
- python3 -m py_compile benchmark/run.py benchmark/v1_report.py benchmark/prepare.py

Impact:
- these fixes remove additional formatting friction for multiline workflows,
  but Minimax still times out on the official multiline xarray-3095 case,
  indicating the remaining issue is broad edit planning / over-search rather
  than a simple protocol mismatch.
…ement

Adds a higher-level multiline edit primitive that composes exact block lookup
with one range replacement inside lapp, reducing the model's need to choreograph
find_block + start/end edit calls manually.

Changes:
- register lapp_replace_block in server
- implement handleReplaceBlock(path, old_content, new_content, normalize_whitespace)
- reuse shared block matching helper findBlockRanges()
- lapp_find_block now defaults to normalize_whitespace=true
- multiline benchmark prompts now prefer lapp_replace_block / lapp_lapp_replace_block
- add tests for:
  - exact block replacement
  - whitespace-normalized block replacement
  - ambiguous block match error

Validation:
- go test ./internal/server -run 'Test(ReplaceBlock|FindBlock)' -v
- go test ./...

Benchmark effect:
- official V1 multiline rerun shows improved xarray-3151 behavior for several
  models (Gemma, Kimi, Minimax) but xarray-3095 still times out broadly,
  indicating a remaining hard benchmark case rather than a simple tool-protocol bug.
Remove pydata__xarray-3095 from the official V1 multiline suite.

Rationale:
- despite being a real-world case, it is a broad multi-region refactor that
  frequently times out even on native-tool baselines
- it does not fit the intended scope of targeted-multiline-pure-replacement
- keeping it in V1 was conflating tool-efficiency benchmarking with larger
  semantic planning/refactor ability

The official V1 multiline suite now contains the clean replacement case:
- pydata__xarray-3151

xarray-3095 remains useful as a future harder suite candidate, but is no
longer part of the canonical V1 benchmark.
Follow-up on Minimax/Gemma/GLM investigation.

Changes:
- lapp_replace_block now rebases the new block onto the old block's shared
  base indentation while preserving relative internal indentation
- this is the multiline analogue of the single-line indentation preservation
  fix that previously rescued Minimax on the single-line suite
- benchmark V1 multiline suite now includes a second clean canonical case:
  matplotlib__matplotlib-20676
- add fixture files for the new multiline case
- add stronger replace_block normalization test covering nested indentation

Validation:
- go test ./internal/server -run 'Test(ReplaceBlock|FindBlock)' -v
- go test ./...

Benchmark impact:
- Minimax official multiline suite: 1/2 -> 2/2 complete, B correctness 12% -> 94%,
  wall 22s -> 43s total for 2 cases, output 373+400 -> 999 total
- Gemma official multiline suite: B correctness improves materially on xarray-3151,
  but matplotlib case still times out
- GLM official multiline suite remains unstable and times out on both cases
php-workx and others added 2 commits April 16, 2026 09:48
Dev tool dependencies (golangci-lint v2, x/sys v0.42+) require
Go 1.25+, so go.mod cannot be lowered to 1.24. The spec's Go 1.24
floor is met for building the binary, but dev tooling needs 1.25+.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
CI's govulncheck reported 15 vulnerabilities in the Go 1.25.0
standard library (crypto/x509, crypto/tls, encoding/asn1, etc.),
all fixed in go1.25.2 through go1.25.9. Adding `toolchain go1.25.9`
makes actions/setup-go install the patched version via
go-version-file, resolving all findings.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 6

♻️ Duplicate comments (3)
benchmark/prepare.py (1)

90-93: ⚠️ Potential issue | 🟠 Major

Don't treat any non-empty fixture directory as complete.

If a previous fetch failed after writing only some files, this branch marks the instance as "already fetched" on the next run. benchmark/run.py then benchmarks against incomplete fixtures, and _patch.diff / _task.txt can be regenerated on top of that partial state. Gate this early return on a completeness check or a success sentinel written only after the full instance is prepared.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@benchmark/prepare.py` around lines 90 - 93, Currently prepare.py treats any
non-empty dest_dir (FILES_DIR / iid) as "already fetched"; change this to check
a success sentinel or explicit completeness marker instead. Update the
early-return in the fetch/prep logic to only return when dest_dir exists AND a
sentinel file (e.g., ".complete" or "FETCH_SUCCESS") is present; ensure the code
that finishes preparing an instance writes that sentinel atomically (write temp
file then rename) after all files are written. Refer to dest_dir, FILES_DIR and
iid in prepare.py and add the sentinel creation at the end of the successful
prepare routine and the sentinel check in the beginning to avoid false positives
from partial fetches.
benchmark/run.py (1)

973-984: ⚠️ Potential issue | 🟠 Major

LAPP_STRATEGY is still effectively V2-only.

This branch only resolves the strategy override for version == "v2", so ad-hoc/V1 runs still fall back to the grep defaults in run_instance(). As a result, LAPP_STRATEGY=native-edit or lapp-replace-block does not do what the interface advertises outside V2, and the non-V2 native-edit path still unnecessarily depends on lapp being installed.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@benchmark/run.py` around lines 973 - 984, The LAPP_STRATEGY override is only
resolved for suite_version == "v2", so non-V2 runs still use grep defaults and
may incorrectly require the lapp binary; fix by resolving the strategy
unconditionally and wiring it into run_instance. Change the v2_strategy/v2-only
logic to compute a single `strategy` (use or extend resolve_v2_strategy or add a
new resolve_strategy that validates against V2_STRATEGIES and the allowed non-V2
options), replace uses of `v2_strategy` with `strategy`, and update the
`needs_lapp` computation (currently using `lapp_binary` and `needs_lapp`) so
that when `strategy == "native-edit"` or other pure-Python strategies
`needs_lapp` is false; ensure run_instance consumes this `strategy` instead of
falling back to grep defaults.
internal/server/server.go (1)

677-695: ⚠️ Potential issue | 🟠 Major

Structured grep can still exceed the match cap inside one file.

The pre-check only stops traversal before the next file. If the current file has 10k matches, this loop appends all 10k entries to structuredMatches before WalkDir exits, so maxStructuredMatches is not actually enforced.

Possible fix
 		if format == fmtStructured {
 			for _, m := range matches {
+				if len(structuredMatches) >= maxStructuredMatches {
+					return errCapReached
+				}
 				before := []string{}
 				after := []string{}
 				for k := max(1, m-ctxLines); k < m; k++ {
 					before = append(before, hashline.FormatLine(lines[k-1], k))
 				}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@internal/server/server.go` around lines 677 - 695, The structuredMatches
slice can grow beyond maxStructuredMatches while iterating matches for a single
file; inside the loop over matches add a guard that checks
len(structuredMatches) >= maxStructuredMatches and stops appending further
grepStructuredMatch entries (and breaks out of the matches loop), and ensure the
WalkDir traversal also stops early (e.g., return a sentinel error or set a flag
the outer walker checks) so no more files are processed once
maxStructuredMatches is reached; key symbols to modify: the for _, m := range
matches loop, structuredMatches, maxStructuredMatches, grepStructuredMatch, and
the surrounding WalkDir/visitor handling so the cap is enforced globally.
🧹 Nitpick comments (3)
specs/distribution_v1.md (1)

39-52: Add language specifiers to fenced code blocks.

Several code blocks in this spec lack language specifiers (lines 39, 75, 168, 176, 406, 412, 629, 647, 939). Adding them improves syntax highlighting and accessibility.

Example fix
-```
-npx -y `@morphllm/morph-setup`
+```text
+npx -y `@morphllm/morph-setup`
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@specs/distribution_v1.md` around lines 39 - 52, The fenced code blocks in
specs/distribution_v1.md are missing language specifiers (e.g., the block
starting with "npx -y `@morphllm/morph-setup`" and other blocks noted at lines 39,
75, 168, 176, 406, 412, 629, 647, 939); update each triple-backtick fence to
include an appropriate language tag (use "text" for plain CLI/psuedo-output,
"bash" or "sh" for shell commands, or other specific languages where applicable)
so the blocks become ```text or ```bash (etc.), ensuring each opening fence
matches a closing fence. Include the same language tag for multiline examples
and keep content unchanged aside from adding the specifier.
.github/workflows/ci.yml (1)

89-89: Inconsistent govulncheck invocation with justfile.

The CI uses the full module path golang.org/x/vuln/cmd/govulncheck while the justfile uses govulncheck. Both work, but consider aligning for consistency.

Suggested fix for consistency
-      - run: go tool -modfile=tools.mod golang.org/x/vuln/cmd/govulncheck ./...
+      - run: go tool -modfile=tools.mod govulncheck ./...
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In @.github/workflows/ci.yml at line 89, The CI step uses the full module
invocation "go tool -modfile=tools.mod golang.org/x/vuln/cmd/govulncheck ./..."
which is inconsistent with the justfile's use of "govulncheck"; update the CI
step to call the same command name used in the justfile (replace the full module
path invocation with "govulncheck" while preserving the modfile usage) or
alternatively align the justfile to use the full module path—ensure you modify
the exact CI step containing "go tool -modfile=tools.mod
golang.org/x/vuln/cmd/govulncheck ./..." or the justfile entry for "govulncheck"
so both use the same invocation.
.golangci.yml (1)

59-63: Track complexity threshold reduction as technical debt.

The TODO comments indicate these thresholds should be lowered significantly (gocognit: 120→30, cyclop: 60→22). Consider creating a tracking issue to ensure this technical debt is addressed after handleGrep and related functions are refactored.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In @.golangci.yml around lines 59 - 63, The TODOs raising gocognit and cyclop
thresholds are temporary technical debt; create a tracking issue (or task) that
documents lowering gocognit from 120→30 and cyclop from 60→22 after refactoring,
link the issue in the TODO comments near the existing entries for gocognit and
cyclop, and reference the functions that must be refactored (handleGrep,
generateDiff, handleInsertBlock) so reviewers can follow progress and close the
task once those functions are fixed and thresholds are reduced.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@benchmark/files/pydata__xarray-3151/xarray/core/combine.py`:
- Around line 217-223: The except handler in
benchmark/files/pydata__xarray-3151/xarray/core/combine.py currently does an
implicit re-raise after raising a new ValueError; update that raise in the
except block (the ValueError(...) call shown) to use explicit exception chaining
(e.g., raise ValueError(...) from err or from None depending on whether you want
to preserve the original exception) so it no longer triggers Ruff B904; locate
the except block around the xarray.combine_nested path where the ValueError is
raised and add the "from ..." clause to the raise.

In `@benchmark/prepare.py`:
- Around line 195-198: The script prints failed fixture IDs but still exits
success; modify the failure handling in the if failed block to call sys.exit(1)
after printing the error message so the process returns a non-zero exit code;
ensure the prepare.py module has sys imported (or add an import if missing) and
update the branch that prints "Failed: {', '.join(failed)}" to terminate via
sys.exit(1).

In `@benchmark/report.py`:
- Around line 201-224: The code currently skips rows with side-level errors
(err_a or err_b) from the aggregates but does not increment the global
error_count, so those failures are not reflected in the final summary; update
the block that builds the flag (using err_a and err_b) to also increment
error_count for each side-level failure (e.g., if err_a: error_count += 1; if
err_b: error_count += 1) before the aggregate-skip conditional, ensuring you
reference the existing variables err_a, err_b, and error_count (and keep the
rest of the print/aggregate logic unchanged).

In `@benchmark/run.py`:
- Around line 514-543: _oc_config currently reads and merges the user's
OPENCODE_CONFIG which leaks developer-specific MCPs and instructions; instead
construct a minimal, deterministic benchmark config (start from an explicit
minimal dict rather than json.loads(OPENCODE_CONFIG.read_text())) and only
inject the lapp MCP under cfg["mcp"]["lapp"] and the single instructions_path
(from INSTRUCTIONS_FILE) when present; remove or guard any code that merges
arbitrary keys from OPENCODE_CONFIG so Config A/B are reproducible across
machines.

In `@internal/editor/editor.go`:
- Around line 705-719: The positional fallback overcounts when lines are shifted
(e.g., a single top insertion makes every downstream line appear changed);
update the fallback in the block guarded by maxLCSLines to compute a minimal
O(n) changed-line count by finding the longest common prefix and longest common
suffix between original and updated (use variables original, updated, shorter,
maxLCSLines), then compute changed = (len(original)-prefix-suffix) +
(len(updated)-prefix-suffix) (clamp to >=0) instead of counting per-position
mismatches; keep the same textual placeholder diff (the fmt.Sprintf return) but
return this corrected changed value so LinesChanged is accurate for large files.

In `@internal/server/server.go`:
- Around line 461-463: The call to os.MkdirAll(filepath.Dir(cleaned), ...) can
create directories outside the sandbox if an existing parent is a symlink;
before calling MkdirAll you must validate the entire parent chain for cleaned so
no component is a symlink that escapes the configured root. Implement a check
that walks from the configured root up to filepath.Dir(cleaned) (resolving each
path component with os.Lstat or similar), verifying each existing component is
not a symlink and that the resolved path remains within the root, and only then
call os.MkdirAll; reference the variables/functions cleaned, filepath.Dir,
os.MkdirAll and the existing CheckPath logic so the validation runs prior to
directory creation.

---

Duplicate comments:
In `@benchmark/prepare.py`:
- Around line 90-93: Currently prepare.py treats any non-empty dest_dir
(FILES_DIR / iid) as "already fetched"; change this to check a success sentinel
or explicit completeness marker instead. Update the early-return in the
fetch/prep logic to only return when dest_dir exists AND a sentinel file (e.g.,
".complete" or "FETCH_SUCCESS") is present; ensure the code that finishes
preparing an instance writes that sentinel atomically (write temp file then
rename) after all files are written. Refer to dest_dir, FILES_DIR and iid in
prepare.py and add the sentinel creation at the end of the successful prepare
routine and the sentinel check in the beginning to avoid false positives from
partial fetches.

In `@benchmark/run.py`:
- Around line 973-984: The LAPP_STRATEGY override is only resolved for
suite_version == "v2", so non-V2 runs still use grep defaults and may
incorrectly require the lapp binary; fix by resolving the strategy
unconditionally and wiring it into run_instance. Change the v2_strategy/v2-only
logic to compute a single `strategy` (use or extend resolve_v2_strategy or add a
new resolve_strategy that validates against V2_STRATEGIES and the allowed non-V2
options), replace uses of `v2_strategy` with `strategy`, and update the
`needs_lapp` computation (currently using `lapp_binary` and `needs_lapp`) so
that when `strategy == "native-edit"` or other pure-Python strategies
`needs_lapp` is false; ensure run_instance consumes this `strategy` instead of
falling back to grep defaults.

In `@internal/server/server.go`:
- Around line 677-695: The structuredMatches slice can grow beyond
maxStructuredMatches while iterating matches for a single file; inside the loop
over matches add a guard that checks len(structuredMatches) >=
maxStructuredMatches and stops appending further grepStructuredMatch entries
(and breaks out of the matches loop), and ensure the WalkDir traversal also
stops early (e.g., return a sentinel error or set a flag the outer walker
checks) so no more files are processed once maxStructuredMatches is reached; key
symbols to modify: the for _, m := range matches loop, structuredMatches,
maxStructuredMatches, grepStructuredMatch, and the surrounding WalkDir/visitor
handling so the cap is enforced globally.

---

Nitpick comments:
In @.github/workflows/ci.yml:
- Line 89: The CI step uses the full module invocation "go tool
-modfile=tools.mod golang.org/x/vuln/cmd/govulncheck ./..." which is
inconsistent with the justfile's use of "govulncheck"; update the CI step to
call the same command name used in the justfile (replace the full module path
invocation with "govulncheck" while preserving the modfile usage) or
alternatively align the justfile to use the full module path—ensure you modify
the exact CI step containing "go tool -modfile=tools.mod
golang.org/x/vuln/cmd/govulncheck ./..." or the justfile entry for "govulncheck"
so both use the same invocation.

In @.golangci.yml:
- Around line 59-63: The TODOs raising gocognit and cyclop thresholds are
temporary technical debt; create a tracking issue (or task) that documents
lowering gocognit from 120→30 and cyclop from 60→22 after refactoring, link the
issue in the TODO comments near the existing entries for gocognit and cyclop,
and reference the functions that must be refactored (handleGrep, generateDiff,
handleInsertBlock) so reviewers can follow progress and close the task once
those functions are fixed and thresholds are reduced.

In `@specs/distribution_v1.md`:
- Around line 39-52: The fenced code blocks in specs/distribution_v1.md are
missing language specifiers (e.g., the block starting with "npx -y
`@morphllm/morph-setup`" and other blocks noted at lines 39, 75, 168, 176, 406,
412, 629, 647, 939); update each triple-backtick fence to include an appropriate
language tag (use "text" for plain CLI/psuedo-output, "bash" or "sh" for shell
commands, or other specific languages where applicable) so the blocks become
```text or ```bash (etc.), ensuring each opening fence matches a closing fence.
Include the same language tag for multiline examples and keep content unchanged
aside from adding the specifier.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: fafd06c7-1062-4574-bd0e-9f6ba77204d0

📥 Commits

Reviewing files that changed from the base of the PR and between 0639a66 and 75b031c.

⛔ Files ignored due to path filters (2)
  • go.sum is excluded by !**/*.sum
  • tools.sum is excluded by !**/*.sum
📒 Files selected for processing (37)
  • .githooks/pre-commit
  • .githooks/pre-push
  • .github/workflows/ci.yml
  • .gitignore
  • .gitleaks.toml
  • .golangci.yml
  • .tickets/lap-1kgr.md
  • .tickets/lap-frjk.md
  • .tickets/lap-wce3.md
  • README.md
  • benchmark/files/astropy__astropy-13033/_task.txt
  • benchmark/files/django__django-10097/_task.txt
  • benchmark/files/pydata__xarray-3095/_task.txt
  • benchmark/files/pydata__xarray-3151/xarray/core/combine.py
  • benchmark/prepare.py
  • benchmark/report.py
  • benchmark/requirements.txt
  • benchmark/run.py
  • benchmark/v1_report.py
  • benchmark/v2_report.py
  • benchmark/v2_suites.json
  • cmd/lapp/main.go
  • cmd/lapp/main_test.go
  • go.mod
  • instructions/lapp-tools.md
  • internal/editor/editor.go
  • internal/editor/editor_test.go
  • internal/editor/types.go
  • internal/fileio/fileio.go
  • internal/fileio/fileio_test.go
  • internal/fileio/rename_windows.go
  • internal/server/server.go
  • internal/server/server_test.go
  • justfile
  • pkg/hashline/hashline_test.go
  • specs/distribution_v1.md
  • tools.mod
✅ Files skipped from review due to trivial changes (15)
  • .githooks/pre-push
  • .githooks/pre-commit
  • .gitignore
  • pkg/hashline/hashline_test.go
  • .gitleaks.toml
  • tools.mod
  • .tickets/lap-frjk.md
  • internal/fileio/fileio_test.go
  • .tickets/lap-wce3.md
  • benchmark/files/pydata__xarray-3095/_task.txt
  • benchmark/files/django__django-10097/_task.txt
  • benchmark/files/astropy__astropy-13033/_task.txt
  • .tickets/lap-1kgr.md
  • benchmark/v2_suites.json
  • instructions/lapp-tools.md
🚧 Files skipped from review as they are similar to previous changes (3)
  • go.mod
  • internal/fileio/rename_windows.go
  • cmd/lapp/main.go

Comment thread benchmark/files/pydata__xarray-3151/xarray/core/combine.py
Comment thread benchmark/prepare.py
Comment thread benchmark/report.py
Comment thread benchmark/run.py
Comment thread internal/editor/editor.go Outdated
Comment thread internal/server/server.go
php-workx and others added 10 commits April 16, 2026 11:54
prepare.py prints failed fixture IDs but exits 0, so CI/pipeline
callers cannot detect partial failure. Added sys.exit(1) after
printing the failure list.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Rows with err_a or err_b were excluded from aggregates but not
counted in error_count, making the summary line misleading (side
errors were invisible in "N complete pair(s), M error(s)").
Now each side-level error increments error_count.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The cap check only ran between files in WalkDir, so a single file
with many matches could push structuredMatches well beyond
maxStructuredMatches. Added a break guard inside the match loop.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
CI used the full module path (golang.org/x/vuln/cmd/govulncheck)
while the justfile uses the short tool name (govulncheck). Both
resolve via tools.mod but the short form is consistent.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Previously any non-empty dest_dir was treated as "already fetched",
silently skipping incomplete fetches from interrupted runs. Now a
._complete sentinel is written atomically after all files are
fetched, and only directories with the sentinel are skipped.
Also: exit non-zero on partial failures (sys.exit(1)).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
_oc_config now constructs a minimal deterministic config instead of
merging the user's OPENCODE_CONFIG, so Config A/B are reproducible
across machines without leaking developer-specific MCPs.

LAPP_STRATEGY is now resolved unconditionally (not just for v2
suites), so non-v2 runs can use native-edit to skip the lapp binary.
Renamed v2_strategy → strategy throughout.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The positional fallback counted per-index mismatches, which
overcounts when lines are shifted (e.g. one top insertion makes
every downstream line appear changed). Now computes longest common
prefix/suffix and derives changed = (len(a)-prefix-suffix) +
(len(b)-prefix-suffix), giving an accurate count. Removed now-unused
abs helper.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
If a symlink inside root points outside root, MkdirAll would follow
it and create orphan directories outside the sandbox. Now
EvalSymlinks resolves the parent before MkdirAll, and the path is
rejected if it escapes root. CheckPath still runs afterward for
the authoritative check.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Create lap-v8i8 to track lowering gocognit (120→30) and cyclop
(60→22) once handleGrep, generateDiff, and handleInsertBlock are
refactored. Reference the ticket in .golangci.yml TODOs.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
All 36 bare opening code fences in distribution_v1.md now have
appropriate language tags (text, markdown, json, toml, typescript,
bash, console) for proper rendering.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 8

♻️ Duplicate comments (1)
internal/server/server.go (1)

460-473: ⚠️ Potential issue | 🔴 Critical

The symlink precheck still misses partially-existing parent chains.

EvalSymlinks(parentDir) only helps when the whole parent path already exists. If an existing component inside root is a symlink to somewhere else and a deeper child is missing, this check is skipped and MkdirAll can still create directories outside the sandbox. Walk the existing parent chain component-by-component with Lstat/EvalSymlinks before calling MkdirAll.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@internal/server/server.go` around lines 460 - 473, The current symlink
precheck uses EvalSymlinks(parentDir) which only works if the full parent path
exists; instead implement a component-by-component walk from s.cfg.Root up to
filepath.Dir(cleaned) using Lstat/EvalSymlinks on each existing component (stop
when a component does not exist), and if any resolved component path does not
start with s.cfg.Root+string(os.PathSeparator) return
mcp.NewToolResultError(fileio.ErrPathOutsideRoot); perform this check before
calling os.MkdirAll (keep references to cleaned, parentDir, s.cfg.Root,
os.MkdirAll and CheckPath to locate where to insert the walk).
🧹 Nitpick comments (5)
benchmark/run.py (3)

364-365: Rename ambiguous variable l to line.

The variable l can be confused with 1 or I in some fonts.

Proposed fix
 def _indent(text: str) -> str:  # prefix every line with 4 spaces for readability
-    return "\n".join(f"    {l}" for l in text.splitlines())
+    return "\n".join(f"    {line}" for line in text.splitlines())
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@benchmark/run.py` around lines 364 - 365, Rename the ambiguous loop variable
`l` in the _indent function to `line` to avoid confusion with characters like
"1" or "I"; update the generator expression inside def _indent(text: str) -> str
to use `line` (e.g., f"    {line}" for line in text.splitlines()) so the
function name _indent and behavior remain unchanged but variable naming is
clearer.

617-619: Remove duplicate pass statement.

There are two consecutive pass statements in the exception handler.

Proposed fix
         except json.JSONDecodeError:
             pass
-            pass
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@benchmark/run.py` around lines 617 - 619, There are two consecutive pass
statements in the exception handler for json.JSONDecodeError; remove the
duplicate so the except json.JSONDecodeError: block contains only a single pass
(or the intended handling), i.e., edit the except json.JSONDecodeError handler
in run.py to delete the extra pass and leave a single pass (or replace with the
actual error handling if appropriate).

791-791: Remove extraneous f prefixes from strings without placeholders.

Lines 791, 800, and 932 use f-strings but contain no placeholders.

Proposed fix
-            print(f"    [A] Read → Edit  (claude)", flush=True)
+            print("    [A] Read → Edit  (claude)", flush=True)
...
-                print(f"    [B] Read → Edit  (claude)", flush=True)
+                print("    [B] Read → Edit  (claude)", flush=True)
...
-        print(f"ERROR: instances.json not found — run fetch_instances.py first.",
+        print("ERROR: instances.json not found — run fetch_instances.py first.",

Also applies to: 800-800, 932-932

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@benchmark/run.py` at line 791, The print statements using f-strings but
without placeholders (e.g., the call that prints "    [A] Read → Edit  (claude)"
and the similar prints at the other occurrences) should drop the unnecessary
f-prefix; locate the print calls in benchmark/run.py that use f"..." for static
strings (the three prints flagged) and replace them with plain string literals
("...") so they are regular prints without formatting overhead.
benchmark/report.py (1)

222-224: Consider splitting semicolon-separated statements.

Python style guides recommend one statement per line. While the current form groups related accumulators visually, splitting improves readability.

Proposed fix
         if not err_a and not err_b:
-            sum_a_out  += a_out;  sum_b_out  += b_out
-            sum_a_in   += a_in;   sum_b_in   += b_in
-            sum_a_cost += a_cost; sum_b_cost += b_cost
+            sum_a_out += a_out
+            sum_b_out += b_out
+            sum_a_in += a_in
+            sum_b_in += b_in
+            sum_a_cost += a_cost
+            sum_b_cost += b_cost
             sim_a_vals.append(sim_a)
             sim_b_vals.append(sim_b)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@benchmark/report.py` around lines 222 - 224, The three semicolon-separated
accumulator lines (sum_a_out += a_out; sum_b_out += b_out, sum_a_in += a_in;
sum_b_in += b_in, sum_a_cost += a_cost; sum_b_cost += b_cost) should be split
into individual statements for readability; update the block that updates
sum_a_out, sum_b_out, sum_a_in, sum_b_in, sum_a_cost, and sum_b_cost so each +=
is on its own line (e.g., separate lines for sum_a_out += a_out and sum_b_out +=
b_out) while preserving the same semantics.
benchmark/prepare.py (1)

50-52: Clarify operator precedence and simplify condition.

The and/or mix without explicit parentheses is confusing. The condition src is not None or (line.startswith("+++ ") and dst is None) appears to always be true when dst is None (which it is by default). This likely simplifies to just checking line.startswith("+++ ").

Proposed simplification
-        elif line.startswith("+++ ") and src is not None or (
-            line.startswith("+++ ") and dst is None
-        ):
+        elif line.startswith("+++ "):

If the intent is to only process +++ lines when src was set (modified files) or when starting fresh, consider clarifying with explicit parentheses and a comment explaining the edge case being handled.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@benchmark/prepare.py` around lines 50 - 52, The mixed and/or condition in the
elif is ambiguous; replace the current condition with an explicit, parenthesized
expression and a clarifying comment — e.g. change the line to `elif
line.startswith("+++ ") and (src is not None or dst is None):` and add a short
comment explaining the edge case (dst defaults to None, so this branch means
process "+++" lines when a src was seen or when starting fresh); if the intended
behavior is to always process "+++" lines regardless, simplify further to `elif
line.startswith("+++ ")` and remove the extra checks.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@benchmark/prepare.py`:
- Around line 1-3: Move the shebang '#!/usr/bin/env python3' to the very first
line of the file so it's recognized when executed directly; ensure the existing
'import os' stays after the shebang (e.g., place '#!/usr/bin/env python3' above
the 'import os' statement) and keep no blank lines before the shebang.

In `@internal/editor/editor.go`:
- Around line 643-648: The single-line indentation preservation currently runs
for any one-line replace (pe.startLine == pe.endLine) and should be limited to
anchor-based replaces; update the condition that calls preserveSingleLineIndent
so it also checks that the edit has an anchor (e.Anchor != "")—i.e., change the
if surrounding preserveSingleLineIndent to require pe.startLine == pe.endLine &&
pe.startLine >= 1 && pe.startLine <= len(lines) && e.Anchor != "" so only
anchor-based replaces preserve the original line indentation.

In `@internal/server/server.go`:
- Around line 816-818: The code currently discards the @@ hunk coordinates when
encountering lines starting with "@@ " (in the block handling where flush() is
called and inHunk=true), which causes ambiguous matching; modify the parser to
parse and save the old hunk range/offset from the @@ header (extract the
original start line and length) and pass those coordinates into lapp_apply_patch
(or try applying to that exact range first) before falling back to the existing
content-based search; make the same change in the second occurrence around the
1220-1230 region so both parsing locations preserve and prefer the hunk
coordinates when attempting to apply the patch.

In `@specs/distribution_v1.md`:
- Around line 377-378: The spec is inconsistent about OpenCode MCP
configuration; make a single normative V1 flow: state that OpenCode does NOT
require a file-based MCP config and instead auto-detects servers, and update all
mentions (including sections referencing mcpServers and instructions-only) to
require the installer to write the OpenCode binary to PATH and perform skill
registration (register the skill) rather than writing mcpServers config. Replace
any conflicting text that instructs writing MCP config under mcpServers or
treating instructions-only as sufficient with the unified flow that OpenCode
auto-discovers MCP servers and installers must place the binary on PATH and
register the skill.
- Around line 864-865: Update the spec entries currently titled "Binary
tampering" that only require verifying `--version` after download to mandate
cryptographic integrity checks before execution: require verification of a
published SHA256 (or stronger) checksum and an optional signature (e.g.,
detached GPG/PKCS#7) against a trusted, versioned keyring and fail the install
if verification fails; also update the duplicate occurrences of the "Binary
tampering" rows (the other table entry) to match this behavior and state that
`--version` may be used only as an additional sanity check after cryptographic
verification completes.
- Around line 311-314: The spec contains a conflicting policy for the --root
option: one sentence mandates defaulting --root to the current working directory
at setup time, while another recommends omitting --root and letting lapp default
to os.Getwd() at runtime; resolve this by choosing one behavior and updating the
wording consistently across the section (refer to the symbol --root and the
function/os behavior os.Getwd() and the binary name lapp): either state that
--root is omitted and lapp will use os.Getwd() by default, or state that --root
is set to the cwd at setup time and document how clients and global installs
receive that value, then remove the contradictory sentence so the section
uniformly describes the chosen approach.
- Around line 717-718: The spec incorrectly references the nonexistent built-in
module "node:tar"; update the text around the pipeline (zlib.createGunzip() →
tar.extract()) to call out the external tar npm package instead (e.g., use
import { extract } from 'tar' or require('tar') and invoke extract) or
explicitly state an alternative implementation using node:zlib
(zlib.createGunzip()) combined with a custom tar parser; replace the "node:tar"
mention with "tar (npm package)" and adjust the example invocation from
tar.extract() to the external package usage or the custom-parser approach so
readers know it's not a built-in.
- Around line 233-234: Each detectInstalled arrow function is checking
existsSync("~/...") which treats "~" literally; update each detectInstalled to
use path.join(os.homedir(), "<subdir>") instead (e.g.,
existsSync(path.join(os.homedir(), ".claude"))), and ensure the module
imports/uses the os and path modules (require or import os and path) so
path.join and os.homedir() are available; apply this change to all seven
detectInstalled occurrences referenced in the diff.

---

Duplicate comments:
In `@internal/server/server.go`:
- Around line 460-473: The current symlink precheck uses EvalSymlinks(parentDir)
which only works if the full parent path exists; instead implement a
component-by-component walk from s.cfg.Root up to filepath.Dir(cleaned) using
Lstat/EvalSymlinks on each existing component (stop when a component does not
exist), and if any resolved component path does not start with
s.cfg.Root+string(os.PathSeparator) return
mcp.NewToolResultError(fileio.ErrPathOutsideRoot); perform this check before
calling os.MkdirAll (keep references to cleaned, parentDir, s.cfg.Root,
os.MkdirAll and CheckPath to locate where to insert the walk).

---

Nitpick comments:
In `@benchmark/prepare.py`:
- Around line 50-52: The mixed and/or condition in the elif is ambiguous;
replace the current condition with an explicit, parenthesized expression and a
clarifying comment — e.g. change the line to `elif line.startswith("+++ ") and
(src is not None or dst is None):` and add a short comment explaining the edge
case (dst defaults to None, so this branch means process "+++" lines when a src
was seen or when starting fresh); if the intended behavior is to always process
"+++" lines regardless, simplify further to `elif line.startswith("+++ ")` and
remove the extra checks.

In `@benchmark/report.py`:
- Around line 222-224: The three semicolon-separated accumulator lines
(sum_a_out += a_out; sum_b_out += b_out, sum_a_in += a_in; sum_b_in += b_in,
sum_a_cost += a_cost; sum_b_cost += b_cost) should be split into individual
statements for readability; update the block that updates sum_a_out, sum_b_out,
sum_a_in, sum_b_in, sum_a_cost, and sum_b_cost so each += is on its own line
(e.g., separate lines for sum_a_out += a_out and sum_b_out += b_out) while
preserving the same semantics.

In `@benchmark/run.py`:
- Around line 364-365: Rename the ambiguous loop variable `l` in the _indent
function to `line` to avoid confusion with characters like "1" or "I"; update
the generator expression inside def _indent(text: str) -> str to use `line`
(e.g., f"    {line}" for line in text.splitlines()) so the function name _indent
and behavior remain unchanged but variable naming is clearer.
- Around line 617-619: There are two consecutive pass statements in the
exception handler for json.JSONDecodeError; remove the duplicate so the except
json.JSONDecodeError: block contains only a single pass (or the intended
handling), i.e., edit the except json.JSONDecodeError handler in run.py to
delete the extra pass and leave a single pass (or replace with the actual error
handling if appropriate).
- Line 791: The print statements using f-strings but without placeholders (e.g.,
the call that prints "    [A] Read → Edit  (claude)" and the similar prints at
the other occurrences) should drop the unnecessary f-prefix; locate the print
calls in benchmark/run.py that use f"..." for static strings (the three prints
flagged) and replace them with plain string literals ("...") so they are regular
prints without formatting overhead.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 2edecf2d-6a07-465d-9727-fab1dec7ca0b

📥 Commits

Reviewing files that changed from the base of the PR and between 75b031c and c0ca6fe.

📒 Files selected for processing (10)
  • .github/workflows/ci.yml
  • .golangci.yml
  • .tickets/lap-v8i8.md
  • benchmark/prepare.py
  • benchmark/report.py
  • benchmark/run.py
  • go.mod
  • internal/editor/editor.go
  • internal/server/server.go
  • specs/distribution_v1.md
✅ Files skipped from review due to trivial changes (3)
  • .github/workflows/ci.yml
  • .tickets/lap-v8i8.md
  • .golangci.yml
🚧 Files skipped from review as they are similar to previous changes (1)
  • go.mod

Comment thread benchmark/prepare.py Outdated
Comment thread internal/editor/editor.go Outdated
Comment thread internal/server/server.go
Comment thread specs/distribution_v1.md Outdated
Comment thread specs/distribution_v1.md Outdated
Comment thread specs/distribution_v1.md Outdated
Comment thread specs/distribution_v1.md Outdated
Comment thread specs/distribution_v1.md Outdated
php-workx and others added 5 commits April 16, 2026 15:07
Added ruff.toml excluding benchmark/files/ (SWE-bench fixtures are
upstream code snapshots that must not be modified). Fixed all
remaining ruff findings in harness code: E402 import ordering,
E702 semicolons, E741 ambiguous variable, F401 unused import,
F541 stray f-prefixes, W293 trailing whitespace.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…uation

Replace simple EvalSymlinks(parent) with a component-by-component walk
using Lstat to detect symlink escapes mid-path. Add oldStart field to
patchHunk with parseHunkOldStart() to extract @@ header line coordinates,
enabling disambiguation when a hunk matches multiple locations. Also fix
gocritic emptyStringTest lint (len(s)==0 → s=="").

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
preserveSingleLineIndent now only applies when e.Anchor is set.
Range-based replaces (replace_block, apply_patch) intentionally set
indentation and should not have it silently overridden.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- existsSync("~/.claude") → existsSync(path.join(os.homedir(), ".claude"))
- Resolve conflicting --root policy: omit --root (matches binary behavior)
- Clarify OpenCode auto-discovers stdio MCP servers on PATH
- node:tar → the tar npm package (node:tar does not exist)
- Binary tampering: SHA-256 checksum verification, --version as sanity check

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

♻️ Duplicate comments (1)
specs/distribution_v1.md (1)

240-276: ⚠️ Potential issue | 🟠 Major

Six detectInstalled functions still use literal ~ paths.

Node.js fs.existsSync() does not expand ~. Line 233 was corrected to use path.join(os.homedir(), ".claude"), but lines 240, 247, 254, 261, 268, and 275 still use literal ~ which will always return false.

🛠️ Proposed fixes for all remaining instances
   codex: {
     name: "codex",
     displayName: "Codex",
     skillsDir: ".codex/skills",
     globalSkillsDir: "~/.codex/skills",
-    detectInstalled: () => existsSync("~/.codex"),
+    detectInstalled: () => existsSync(path.join(os.homedir(), ".codex")),
   },
   cursor: {
     name: "cursor",
     displayName: "Cursor",
     skillsDir: ".cursor/skills",
     globalSkillsDir: "~/.cursor/skills",
-    detectInstalled: () => existsSync("~/.cursor"),
+    detectInstalled: () => existsSync(path.join(os.homedir(), ".cursor")),
   },
   windsurf: {
     name: "windsurf",
     displayName: "Windsurf",
     skillsDir: ".windsurf/skills",
     globalSkillsDir: "~/.codeium/windsurf/skills",
-    detectInstalled: () => existsSync("~/.codeium/windsurf"),
+    detectInstalled: () => existsSync(path.join(os.homedir(), ".codeium", "windsurf")),
   },
   opencode: {
     name: "opencode",
     displayName: "OpenCode",
     skillsDir: ".opencode/skills",
     globalSkillsDir: "~/.config/opencode/skills",
-    detectInstalled: () => existsSync("~/.config/opencode"),
+    detectInstalled: () => existsSync(path.join(os.homedir(), ".config", "opencode")),
   },
   amp: {
     name: "amp",
     displayName: "Amp",
     skillsDir: ".agents/skills",
     globalSkillsDir: "~/.config/agents/skills",
-    detectInstalled: () => existsSync("~/.config/amp"),
+    detectInstalled: () => existsSync(path.join(os.homedir(), ".config", "amp")),
   },
   antigravity: {
     name: "antigravity",
     displayName: "Antigravity",
     skillsDir: ".agent/skills",
     globalSkillsDir: "~/.gemini/antigravity/skills",
-    detectInstalled: () => existsSync("~/.gemini/antigravity"),
+    detectInstalled: () => existsSync(path.join(os.homedir(), ".gemini", "antigravity")),
   },
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@specs/distribution_v1.md` around lines 240 - 276, The detectInstalled arrow
functions for cursor, windsurf, opencode, amp, and antigravity use literal "~"
which fs.existsSync does not expand; update each detectInstalled (e.g.,
cursor.detectInstalled, windsurf.detectInstalled, opencode.detectInstalled,
amp.detectInstalled, antigravity.detectInstalled) to call
existsSync(path.join(os.homedir(), "<dir>")) using the same hidden-dir string
(e.g., ".cursor", ".codeium/windsurf", ".config/opencode", ".config/amp",
".gemini/antigravity") so paths are resolved against the user home directory
(assume path and os are available/imported as used elsewhere).
🧹 Nitpick comments (3)
specs/distribution_v1.md (1)

231-232: Consider clarifying that globalSkillsDir values require home directory expansion.

The globalSkillsDir values like "~/.claude/skills" are string literals that must be expanded at runtime (similar to detectInstalled). Document that the implementation should use path.join(os.homedir(), ...) when resolving these paths.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@specs/distribution_v1.md` around lines 231 - 232, Update the spec and
implementation notes to clarify that values for globalSkillsDir (e.g.,
"~/.claude/skills") require home-directory expansion at runtime: state that code
resolving globalSkillsDir should expand "~" using the OS home directory (use
os.homedir()) and join the remainder with path.join rather than treating the
string as a literal; mention the same behavior as detectInstalled and reference
the symbols skillsDir and globalSkillsDir so implementers know to apply
expansion when reading globalSkillsDir.
internal/server/server.go (2)

22-22: Version constant may be stale.

The version = "0.1.0" constant is used as the MCP server version (line 58). If this should reflect lapp v0.2, consider updating it or sourcing from a build-time variable.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@internal/server/server.go` at line 22, The hardcoded const version = "0.1.0"
is likely stale; update it or make it configurable at build time. Either change
the constant to the correct release (e.g., version = "0.2.0") so the MCP server
reports the right version, or replace the const with a package-level variable
(e.g., var Version = "dev") and document that it should be overridden via
-ldflags at build time; update any references that read the old constant (the
MCP server version usage) to use the new symbol.

162-183: Consider logging marshaling errors instead of discarding them.

Lines 166 and 181 discard the error from json.Marshal. While these structs are unlikely to fail marshaling, silently ignoring errors can mask unexpected issues (e.g., if future changes add unmarshalable fields).

Optional: Log marshaling failures
 func marshalGuardedStalePayload(payload editor.StaleRefRepairResult, apply func(*editor.StaleRefRepairResult)) string {
 	if apply != nil {
 		apply(&payload)
 	}
-	jsonBytes, _ := json.Marshal(payload)
+	jsonBytes, err := json.Marshal(payload)
+	if err != nil {
+		// Should never happen for well-formed structs
+		return fmt.Sprintf(`{"status":"error","message":"marshal failed: %s"}`, err)
+	}
 	return string(jsonBytes)
 }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@internal/server/server.go` around lines 162 - 183, Both
marshalGuardedStalePayload and marshalSuccessResponse currently ignore the error
returned by json.Marshal; change them to capture the error, and if marshaling
fails log the error (use the project's logger if available, e.g.,
serverLogger/processLogger, otherwise log.Printf) with contextual info
identifying the function and payload, then return a safe fallback (e.g., an
empty JSON string or a small JSON object with an error message) instead of
silently discarding the error. Ensure you update marshalGuardedStalePayload and
marshalSuccessResponse to check the err from json.Marshal, log it, and return
the chosen fallback.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@benchmark/report.py`:
- Around line 233-236: The current avg_sim_a/avg_sim_b (computed from
sim_a_vals/sim_b_vals) are unweighted means but the report/docs claim an
aggregate weighted by patch size; change the aggregation to a weighted average
using each instance's patch size (e.g., use the changed-line count from
reference_patch or another size field correlated with sim_* entries) by
computing sum(sim_i * size_i) / sum(size_i) (guarding against zero total
weight), or alternatively rename the metric and docs to "mean similarity";
update the computation that sets avg_sim_a and avg_sim_b to use the weighted
formula (and ensure you collect the corresponding size list aligned with
sim_a_vals/sim_b_vals) and adjust any downstream labels/comments accordingly.

In `@benchmark/run.py`:
- Around line 130-137: The workspace copy and diff capture include harness
metadata files `_task.txt` and `._complete`; update setup_work_dir() where files
are copied (the loop that checks item.name and copies to dst) to skip items
named `_task.txt` and `._complete` in addition to `_patch.diff`, and also modify
capture_diff() where work_dir.rglob() is used to ignore those same filenames so
they are neither scanned nor included in diffs; reference the copy loop (the
item.name checks) and the work_dir.rglob() loop to locate the two places to
change.
- Around line 875-885: canonical_results_dir currently omits the resolved
strategy, causing non-v2 strategies to share directories; modify
canonical_results_dir to include the resolved strategy in the output path (e.g.,
append a slugified strategy token to model_part similar to how variant and
grep-variant are handled), ensuring you only add it for non-v2 strategies (check
strategy != "v2") and obtain the resolved strategy from the appropriate source
(pass it in as an argument or call the existing resolver) so native-edit,
lapp-text-grep, lapp-replace-block produce distinct directories.

---

Duplicate comments:
In `@specs/distribution_v1.md`:
- Around line 240-276: The detectInstalled arrow functions for cursor, windsurf,
opencode, amp, and antigravity use literal "~" which fs.existsSync does not
expand; update each detectInstalled (e.g., cursor.detectInstalled,
windsurf.detectInstalled, opencode.detectInstalled, amp.detectInstalled,
antigravity.detectInstalled) to call existsSync(path.join(os.homedir(),
"<dir>")) using the same hidden-dir string (e.g., ".cursor",
".codeium/windsurf", ".config/opencode", ".config/amp", ".gemini/antigravity")
so paths are resolved against the user home directory (assume path and os are
available/imported as used elsewhere).

---

Nitpick comments:
In `@internal/server/server.go`:
- Line 22: The hardcoded const version = "0.1.0" is likely stale; update it or
make it configurable at build time. Either change the constant to the correct
release (e.g., version = "0.2.0") so the MCP server reports the right version,
or replace the const with a package-level variable (e.g., var Version = "dev")
and document that it should be overridden via -ldflags at build time; update any
references that read the old constant (the MCP server version usage) to use the
new symbol.
- Around line 162-183: Both marshalGuardedStalePayload and
marshalSuccessResponse currently ignore the error returned by json.Marshal;
change them to capture the error, and if marshaling fails log the error (use the
project's logger if available, e.g., serverLogger/processLogger, otherwise
log.Printf) with contextual info identifying the function and payload, then
return a safe fallback (e.g., an empty JSON string or a small JSON object with
an error message) instead of silently discarding the error. Ensure you update
marshalGuardedStalePayload and marshalSuccessResponse to check the err from
json.Marshal, log it, and return the chosen fallback.

In `@specs/distribution_v1.md`:
- Around line 231-232: Update the spec and implementation notes to clarify that
values for globalSkillsDir (e.g., "~/.claude/skills") require home-directory
expansion at runtime: state that code resolving globalSkillsDir should expand
"~" using the OS home directory (use os.homedir()) and join the remainder with
path.join rather than treating the string as a literal; mention the same
behavior as detectInstalled and reference the symbols skillsDir and
globalSkillsDir so implementers know to apply expansion when reading
globalSkillsDir.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: ae6be403-2241-4deb-8d1e-0d97c42ceb63

📥 Commits

Reviewing files that changed from the base of the PR and between c0ca6fe and f3eaedb.

📒 Files selected for processing (7)
  • benchmark/prepare.py
  • benchmark/report.py
  • benchmark/run.py
  • internal/editor/editor.go
  • internal/server/server.go
  • ruff.toml
  • specs/distribution_v1.md
✅ Files skipped from review due to trivial changes (1)
  • ruff.toml
🚧 Files skipped from review as they are similar to previous changes (1)
  • internal/editor/editor.go

Comment thread benchmark/report.py
Comment thread benchmark/run.py
Comment thread benchmark/run.py Outdated
php-workx and others added 3 commits April 16, 2026 18:16
Replace hardcoded const version = "0.1.0" with var Version = "dev"
overridden via -ldflags at build time. Add server.Version to justfile
ldflags. Capture and log json.Marshal errors in
marshalGuardedStalePayload and marshalSuccessResponse instead of
discarding them silently.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…egy in results dir

- setup_work_dir: skip _task.txt and ._complete in addition to _patch.diff
- capture_diff: filter ._complete and _task.txt from work_dir scan
- report.py: correct docstring from "weighted-average" to "mean" similarity
- canonical_results_dir: add strategy parameter so non-v2 strategies
  get distinct results directories instead of colliding

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…sDir note

All 7 detectInstalled functions now use path.join(os.homedir(), ...)
instead of literal ~ which Node.js does not expand. Added note that
globalSkillsDir values containing ~ also require runtime expansion.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

♻️ Duplicate comments (3)
specs/distribution_v1.md (3)

65-66: ⚠️ Potential issue | 🟠 Major

Resolve remaining --root contradiction across examples and policy.

Line 65 and Line 300/307 still present --root as default behavior, but Line 313 defines V1 default as omitting --root. Keep one normative behavior throughout to avoid divergent implementations.

Suggested spec cleanup
-| **Config entry** | `command: "npx"`, `args: ["-y", "@morphllm/morphmcp"]` | `command: "/path/to/lapp"`, `args: ["--root", "<cwd>"]` |
+| **Config entry** | `command: "npx"`, `args: ["-y", "@morphllm/morphmcp"]` | `command: "/path/to/lapp"` (omit `--root` by default) |

-For lapp, the MCP server entry points to the installed binary with `--root` set to the project directory:
+For lapp, the MCP server entry points to the installed binary (omit `--root` by default so runtime cwd is used):

Also applies to: 300-313

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@specs/distribution_v1.md` around lines 65 - 66, The spec currently
contradicts itself about whether the V1 default includes the --root flag; pick
one normative behavior and make it consistent: update the V1 default statement
(the paragraph referencing "V1 default") to state whether --root is included or
omitted, then update all examples and the two example rows that reference
`command: "/path/to/lapp", args: ["--root", "<cwd>"]` to match that chosen
default (either remove --root from examples and examples table if default omits
it, or add it everywhere if default includes it), and remove any
duplicate/conflicting mentions so the `--root` behavior is unambiguous across
the spec and examples.

449-455: ⚠️ Potential issue | 🟠 Major

OpenCode V1 flow is still internally inconsistent.

Line 377 defines “no MCP config entry for OpenCode,” but Line 452-455/Line 535/Line 783 still describe V1 paths that imply file MCP config or “instructions-only” ambiguity. Pick one normative V1 flow and remove alternatives in the normative sections.

Suggested spec cleanup
-- **Option B (Instructions-only, recommended for V1):** Since lapp's MCP server is a stdio binary (not a JS SDK), OpenCode can discover it via the standard MCP config mechanism. We:
-  1. Write the MCP server config to `~/.config/opencode/opencode.json` under the `mcpServers` key (if OpenCode supports it)
-  2. Copy `instructions/lapp-tools.md` to `~/.config/opencode/instructions/lapp-tools.md`
-  3. Add the path to the `instructions` array in `opencode.json`
+- **V1 (normative):** Do not write an OpenCode `mcpServers` entry. Install lapp on PATH, copy `instructions/lapp-tools.md` to `~/.config/opencode/instructions/lapp-tools.md`, and register it in `instructions[]`.

-| **OpenCode** | skip (or via plugin) | `~/.config/opencode/opencode.json` → `instructions[]` array | npm plugin (V2) | `~/.config/opencode/skills/` | symlink | Copy `instructions/lapp-tools.md` to `~/.config/opencode/instructions/` |
+| **OpenCode** | skip | `~/.config/opencode/opencode.json` → `instructions[]` array | npm plugin (V2) | `~/.config/opencode/skills/` | symlink | Copy `instructions/lapp-tools.md` to `~/.config/opencode/instructions/` |

-**V2:** Create a full `@lapp-dev/opencode-lapp-plugin` npm package that registers lapp's tools via the OpenCode plugin SDK (`@opencode-ai/plugin`). This would allow richer integration (custom tool metadata, environment setup). For V1, instructions-only is sufficient since lapp is a stdio MCP server that OpenCode discovers via standard config.
+**V2:** Create a full `@lapp-dev/opencode-lapp-plugin` npm package for richer OpenCode integration.

Also applies to: 535-536, 783-784

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@specs/distribution_v1.md` around lines 449 - 455, The spec is inconsistent
about the OpenCode V1 normative flow (references to “no MCP config entry for
OpenCode” conflict with the two alternatives under "Option A (Full plugin)" and
"Option B (Instructions-only)"); pick a single V1 normative flow (either the
plugin-based flow represented by "Option A (Full plugin)" or the
instructions-only flow represented by "Option B (Instructions-only), recommended
for V1") and remove the other alternative and any hedging language; update and
unify the wording in the sections that reference this behavior (the statement
"no MCP config entry for OpenCode" and the paragraphs currently under "Option
A/Option B" and the later mentions at the locations that currently correspond to
the paragraphs referencing V1 paths) so they consistently describe the chosen
flow, and ensure the references to where MCP server config and instructions are
written (the opencode.json mcpServers key and instructions path) match the
single chosen normative flow.

127-133: ⚠️ Potential issue | 🟠 Major

Add tar to declared dependencies to match the extraction design.

Line 717 explicitly requires the tar npm package to extract the downloaded tarball via tar.extract(), but it is not declared in the dependencies block (lines 127-133). The implementation spec cannot be built as written.

Suggested spec patch
   "dependencies": {
     "@clack/prompts": "^0.9.1",
     "chalk": "^5.4.1",
     "commander": "^13.1.0",
     "gray-matter": "^4.0.3",
     "json5": "^2.2.3",
-    "simple-git": "^3.27.0"
+    "simple-git": "^3.27.0",
+    "tar": "^7.4.3"
   },
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@specs/distribution_v1.md` around lines 127 - 133, The spec refers to using
the tar package (tar.extract()) but the dependencies list in the manifest (the
block containing "@clack/prompts", "chalk", "commander", etc.) does not include
"tar"; add "tar": "<appropriate-version>" to that dependencies object so the
implementation can require tar and call tar.extract() as specified.
🧹 Nitpick comments (4)
benchmark/report.py (2)

47-47: Docstring uses EN DASH instead of HYPHEN-MINUS.

The docstring contains an EN DASH character () instead of a standard hyphen (-). While this doesn't affect functionality, it may cause issues with some tools or IDEs.

📝 Suggested fix
-    Returns 0.0–1.0. A score of 1.0 means the agent produced exactly the
+    Returns 0.0-1.0. A score of 1.0 means the agent produced exactly the
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@benchmark/report.py` at line 47, Replace the EN DASH in the docstring text
"Returns 0.0–1.0." with a standard hyphen-minus so it reads "Returns 0.0-1.0.";
locate the docstring in benchmark/report.py (search for the string "Returns
0.0–1.0."), update the character, and save the file to ensure tools and IDEs
parse the range correctly.

128-132: Consider handling JSON decode errors gracefully.

If a result file contains malformed JSON, json.loads() will raise a JSONDecodeError and crash the report. Since benchmark runs may be interrupted, partially written files could exist.

🛡️ Suggested fix
     for p in paths:
         if not p.exists():
             print(f"WARNING: {p} not found — skipping", file=sys.stderr)
             continue
-        results.append(json.loads(p.read_text()))
+        try:
+            results.append(json.loads(p.read_text()))
+        except json.JSONDecodeError as e:
+            print(f"WARNING: {p} is not valid JSON — skipping: {e}", file=sys.stderr)
+            continue
     return results
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@benchmark/report.py` around lines 128 - 132, Wrap the JSON parsing in the
loop over paths (the for p in paths: block) with a try/except that catches
json.JSONDecodeError (and optionally ValueError) when calling
json.loads(p.read_text()); on error print a warning to sys.stderr including the
path p and the exception message and then continue so a malformed/partial result
file is skipped instead of crashing the report generation; ensure
results.append(...) only happens on successful parse.
benchmark/run.py (1)

359-360: Minor style nit: extra blank line before return.

There's an extra blank line before return "" which is inconsistent with the rest of the function.

📝 Suggested fix
             f"Only if {replace_tool} fails with multiple matches or stale refs, use {find_tool} once to recover the exact range, then {edit_tool}."
         )
-
-
     return ""
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@benchmark/run.py` around lines 359 - 360, Remove the extra blank line
immediately before the final return statement so the `return ""` is directly
contiguous with the preceding code block; update the trailing whitespace around
the `return ""` statement in the same function in benchmark/run.py to match the
file's existing style.
internal/server/server.go (1)

70-77: Bound s.guard growth for long-running sessions.

Each new canonical path gets a fileGuardState, but entries are never evicted. Over time this can grow unbounded in memory for wide file scans/edits.

Lightweight cleanup approach
+func (s *Server) maybeDeleteState(path string, st *fileGuardState) {
+	if st.recentFindBlocks == 0 && st.recentSearchOps == 0 && st.staleCount == 0 && st.lastAnchorSig == "" {
+		delete(s.guard, path)
+	}
+}

Then call it where counters are consumed/reset (for example after successful edit/replace/insert/apply-patch), and clear stale fields when a write succeeds for that file.

Also applies to: 87-117, 118-162

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@internal/server/server.go` around lines 70 - 77, The s.guard map in
Server.stateFor currently creates unbounded fileGuardState entries for every
path; add bounded-growth eviction and cleanup: implement a size limit or TTL for
entries stored in s.guard and an LRU or time-based eviction policy, and add a
cleanup method on the Server (e.g., pruneStaleGuards) to remove old entries;
call that cleanup from the locations where fileGuardState counters are
consumed/reset (the functions that perform edit/replace/insert/apply-patch) and
when a write succeeds for a given path clear/reset that file's stale fields
inside its fileGuardState so the map doesn't retain stale state. Ensure
concurrency safety by using the same locking used around s.guard when
adding/removing entries and reference fileGuardState, Server.stateFor, and the
write-success code paths when making these changes.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@internal/server/server.go`:
- Around line 806-808: The findBlockResult currently returns all matches and
full previews causing huge responses; update the response model and handlers to
cap results and expose truncation metadata. Add fields to findBlockResult such
as TotalMatches (int) and Truncated (bool)/ReturnedCount (int), and define
constants like maxFindBlockMatches and maxPreviewLength; in handleFindBlock (and
the similar logic around the block-match code at 1087-1105) only append up to
maxFindBlockMatches matches, truncate each preview to maxPreviewLength, set
TotalMatches to the full match count, set ReturnedCount to the number returned,
and set Truncated=true if any matches or previews were trimmed. Ensure the JSON
keys are exported and included in the response so clients can detect truncation.
- Around line 471-497: The current check-then-act gap is that you validate the
parent chain then call os.MkdirAll(filepath.Dir(cleaned), ...) which allows a
concurrent symlink swap to escape the sandbox; replace the single MkdirAll call
with a safe, component-by-component creation that re-checks each component under
s.cfg.Root as you create it: iterate the same parts used earlier (starting from
s.cfg.Root), for each part do os.Lstat(cur); if it exists ensure it's not a
symlink (or EvalSymlinks(cur) still has the root prefix), if it does not exist
call os.Mkdir(cur, 0o755) (not MkdirAll) and then immediately
re-Lstat/EvalSymlinks to confirm the new component still resolves under root;
return the same tool error on any resolution outside the sandbox. Use the
existing symbols cur, rel, os.Lstat, filepath.EvalSymlinks and remove the single
os.MkdirAll call.

In `@specs/distribution_v1.md`:
- Around line 864-865: The spec currently lists SHA-256 checksum verification
both in the V1 baseline ("Binary tampering | Verify SHA-256 checksum against
published hash after download, then verify `--version`...") and again under the
"future" mitigations, which duplicates/dilutes the requirement; keep SHA-256
checksum verification as a V1 requirement by removing the duplicate from the
future section and instead move/replace the future item to describe enhancements
only (e.g., signature verification, stronger provenance checks or automated
integrity signing). Update the "Binary tampering" entries so V1 contains only
the checksum + `--version` sanity check text and the future section references
only enhancements such as signature-based verification or verified release
metadata.

---

Duplicate comments:
In `@specs/distribution_v1.md`:
- Around line 65-66: The spec currently contradicts itself about whether the V1
default includes the --root flag; pick one normative behavior and make it
consistent: update the V1 default statement (the paragraph referencing "V1
default") to state whether --root is included or omitted, then update all
examples and the two example rows that reference `command: "/path/to/lapp",
args: ["--root", "<cwd>"]` to match that chosen default (either remove --root
from examples and examples table if default omits it, or add it everywhere if
default includes it), and remove any duplicate/conflicting mentions so the
`--root` behavior is unambiguous across the spec and examples.
- Around line 449-455: The spec is inconsistent about the OpenCode V1 normative
flow (references to “no MCP config entry for OpenCode” conflict with the two
alternatives under "Option A (Full plugin)" and "Option B (Instructions-only)");
pick a single V1 normative flow (either the plugin-based flow represented by
"Option A (Full plugin)" or the instructions-only flow represented by "Option B
(Instructions-only), recommended for V1") and remove the other alternative and
any hedging language; update and unify the wording in the sections that
reference this behavior (the statement "no MCP config entry for OpenCode" and
the paragraphs currently under "Option A/Option B" and the later mentions at the
locations that currently correspond to the paragraphs referencing V1 paths) so
they consistently describe the chosen flow, and ensure the references to where
MCP server config and instructions are written (the opencode.json mcpServers key
and instructions path) match the single chosen normative flow.
- Around line 127-133: The spec refers to using the tar package (tar.extract())
but the dependencies list in the manifest (the block containing
"@clack/prompts", "chalk", "commander", etc.) does not include "tar"; add "tar":
"<appropriate-version>" to that dependencies object so the implementation can
require tar and call tar.extract() as specified.

---

Nitpick comments:
In `@benchmark/report.py`:
- Line 47: Replace the EN DASH in the docstring text "Returns 0.0–1.0." with a
standard hyphen-minus so it reads "Returns 0.0-1.0."; locate the docstring in
benchmark/report.py (search for the string "Returns 0.0–1.0."), update the
character, and save the file to ensure tools and IDEs parse the range correctly.
- Around line 128-132: Wrap the JSON parsing in the loop over paths (the for p
in paths: block) with a try/except that catches json.JSONDecodeError (and
optionally ValueError) when calling json.loads(p.read_text()); on error print a
warning to sys.stderr including the path p and the exception message and then
continue so a malformed/partial result file is skipped instead of crashing the
report generation; ensure results.append(...) only happens on successful parse.

In `@benchmark/run.py`:
- Around line 359-360: Remove the extra blank line immediately before the final
return statement so the `return ""` is directly contiguous with the preceding
code block; update the trailing whitespace around the `return ""` statement in
the same function in benchmark/run.py to match the file's existing style.

In `@internal/server/server.go`:
- Around line 70-77: The s.guard map in Server.stateFor currently creates
unbounded fileGuardState entries for every path; add bounded-growth eviction and
cleanup: implement a size limit or TTL for entries stored in s.guard and an LRU
or time-based eviction policy, and add a cleanup method on the Server (e.g.,
pruneStaleGuards) to remove old entries; call that cleanup from the locations
where fileGuardState counters are consumed/reset (the functions that perform
edit/replace/insert/apply-patch) and when a write succeeds for a given path
clear/reset that file's stale fields inside its fileGuardState so the map
doesn't retain stale state. Ensure concurrency safety by using the same locking
used around s.guard when adding/removing entries and reference fileGuardState,
Server.stateFor, and the write-success code paths when making these changes.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 1b5b9bce-e862-4fc2-a4a7-ba7bb8b5dea0

📥 Commits

Reviewing files that changed from the base of the PR and between f3eaedb and 554cd03.

📒 Files selected for processing (5)
  • benchmark/report.py
  • benchmark/run.py
  • internal/server/server.go
  • justfile
  • specs/distribution_v1.md
🚧 Files skipped from review as they are similar to previous changes (1)
  • justfile

Comment thread internal/server/server.go
Comment thread internal/server/server.go
Comment thread specs/distribution_v1.md
php-workx and others added 3 commits April 17, 2026 09:29
handleFindBlock now limits results to maxFindBlockMatches (500) and
includes TotalMatches and Truncated fields in the JSON response so
clients can detect when results were truncated.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- report.py: Replace EN DASH with hyphen in docstring
- report.py: Handle JSON decode errors gracefully in load_results
- run.py: Remove extra blank line before return in strategy_hint

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…dedup SHA-256

- Make --root omission consistent across comparison table and MCP entry
  example (line 65, 300-307)
- Remove Option A/B hedging for OpenCode V1; state instructions-only as
  normative, npm plugin as V2 only
- Add tar npm package to dependencies list
- Remove duplicate "Checksum verification" from future considerations
  (already in V1 security table)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@php-workx
Copy link
Copy Markdown
Owner Author

Addressing nitpick comments from review #4122569056:

  1. report.py: EN DASH — Fixed in commit b754612. Replaced EN DASH with hyphen in docstring.
  2. report.py: JSON decode errors — Fixed in commit b754612. load_results() now catches json.JSONDecodeError and prints a warning instead of crashing.
  3. run.py: extra blank line — Fixed in commit b754612. Removed double blank line before return "" in strategy_hint.
  4. server.go: guard map unbounded growth — Deferring. The s.guard map tracks per-file multiline edit state for an MCP server session. Entries are lightweight (4 fields) and bounded by the number of distinct files edited in a single session. An LRU/TTL eviction policy adds complexity for a problem that doesn't manifest in practice — MCP server sessions are short-lived and the map is GC'd when the process exits. Noted for future hardening if long-running sessions become a concern.

@php-workx php-workx merged commit c838bbe into main Apr 17, 2026
8 checks passed
@php-workx php-workx deleted the feat/mcp-server-v0.2 branch April 17, 2026 07:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant