From 669b7c7dba33d0d0fb496044b4046a8d18e4b74b Mon Sep 17 00:00:00 2001 From: Amit Kumar Date: Thu, 14 May 2026 17:17:08 +0000 Subject: [PATCH] docs: complete brownfield reference documentation MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit User-requested full reference-doc set ahead of stopping codeiq work with Claude Code. Replaces the doc-wipe slate from PR #168 with a coherent, source-grounded reference tree. Created (15 files / ~2,200 lines / ~180 KB): README.md — user-facing entry + badges CLAUDE.md — agent-facing repo guide docs/00-project-overview.md — what / who / status docs/01-local-setup.md — build / test / common issues docs/02-architecture.md — components + tradeoffs docs/03-code-map.md — directory-by-directory tour docs/04-main-flows.md — index / enrich / mcp / review docs/05-configuration.md — env / flags / config files docs/06-data-model.md — Kuzu + SQLite schemas + taxonomy docs/07-integrations.md — Ollama + Sigstore + zero else docs/08-testing.md — strategy + fixtures + perf-gate docs/09-build-deploy-release.md — Goreleaser + cosign keyless docs/10-known-risks-and-todos.md — gotchas / debt / sec-sensitive docs/11-agent-handoff.md — one-stop brief for next agent docs/adr/0001-current-architecture.md — why the shape is what it is Grounding policy: * Every concrete claim points at a file. Uses `path/to/file:line` form where line-level matters. * Anything not directly verified is marked `Inference`. * Anything unknown is marked `Unknown`. * No code changes. Only docs. Co-Authored-By: Claude Opus 4.7 --- CLAUDE.md | 248 ++++++++++++++++++++++++++ README.md | 109 +++++++++++ docs/00-project-overview.md | 68 +++++++ docs/01-local-setup.md | 109 +++++++++++ docs/02-architecture.md | 108 +++++++++++ docs/03-code-map.md | 179 +++++++++++++++++++ docs/04-main-flows.md | 168 +++++++++++++++++ docs/05-configuration.md | 144 +++++++++++++++ docs/06-data-model.md | 167 +++++++++++++++++ docs/07-integrations.md | 127 +++++++++++++ docs/08-testing.md | 126 +++++++++++++ docs/09-build-deploy-release.md | 175 ++++++++++++++++++ docs/10-known-risks-and-todos.md | 100 +++++++++++ docs/11-agent-handoff.md | 172 ++++++++++++++++++ docs/adr/0001-current-architecture.md | 169 ++++++++++++++++++ 15 files changed, 2169 insertions(+) create mode 100644 CLAUDE.md create mode 100644 README.md create mode 100644 docs/00-project-overview.md create mode 100644 docs/01-local-setup.md create mode 100644 docs/02-architecture.md create mode 100644 docs/03-code-map.md create mode 100644 docs/04-main-flows.md create mode 100644 docs/05-configuration.md create mode 100644 docs/06-data-model.md create mode 100644 docs/07-integrations.md create mode 100644 docs/08-testing.md create mode 100644 docs/09-build-deploy-release.md create mode 100644 docs/10-known-risks-and-todos.md create mode 100644 docs/11-agent-handoff.md create mode 100644 docs/adr/0001-current-architecture.md diff --git a/CLAUDE.md b/CLAUDE.md new file mode 100644 index 00000000..5e62b701 --- /dev/null +++ b/CLAUDE.md @@ -0,0 +1,248 @@ +# CLAUDE.md — codeiq + +> **Repo-specific instructions for Claude Code (and any AI coding agent with similar tooling).** Read this in full before making changes. For the full doc set see [`docs/`](docs/); for the one-stop agent brief see [`docs/11-agent-handoff.md`](docs/11-agent-handoff.md). + +## What this project is + +A deterministic code-knowledge-graph CLI + stdio MCP server. Pure static analysis — no AI in the index/enrich pipeline. Single static Go binary. CGO mandatory. + +- **Module path:** `github.com/randomcodespace/codeiq` +- **Entry:** [`cmd/codeiq/main.go`](cmd/codeiq/main.go) → [`internal/cli/root.go`](internal/cli/root.go) +- **Tech stack pinned in [`go.mod`](go.mod):** Go 1.25.10 toolchain, Kuzu 0.11.3, SQLite 1.14.44, MCP SDK v1.6, tree-sitter (smacker), cobra 1.10.2. + +## Architecture in 10 lines + +1. `codeiq index ` walks files (git ls-files → fallback), parses with tree-sitter / structured / regex, runs **100 detectors**, dedup-merges into a graph, and writes to **SQLite cache** at `.codeiq/cache/codeiq.sqlite`. +2. `codeiq enrich ` loads cache, runs linkers + LayerClassifier + intelligence extractors + ServiceDetector, then BulkLoads into **Kuzu** at `.codeiq/graph/codeiq.kuzu/` and builds two FTS indexes (`code_node_label_fts`, `code_node_lexical_fts`). +3. `codeiq mcp ` opens Kuzu read-only and serves a stdio JSON-RPC MCP protocol with **10 tools** (6 mode-driven + `run_cypher` + `read_file` + `generate_flow` + `review_changes`). +4. `codeiq review` is the only LLM touch — diff + graph evidence → Ollama (`localhost:11434` default; `OLLAMA_API_KEY` flips to cloud). +5. Every other subcommand (`stats`, `find`, `query`, `cypher`, `flow`, `graph`, `topology`, `cache`, `plugins`, `version`) is a thin read-only consumer of the Kuzu store. + +## Critical rules + +### Read-only MCP + +The MCP server (`codeiq mcp`) is strictly read-only. `run_cypher` enforces this via [`MutationKeyword`](internal/graph/mutation.go) — regex gate that rejects CREATE/DELETE/DETACH/SET/REMOVE/MERGE/DROP/FOREACH/LOAD CSV/COPY and any CALL outside the allow-list (`db.*`, `show_*`, `table_*`, `current_setting`, `table_info`, **`query_fts_index`**). `read_file` is path-sandboxed to the indexed root. + +Belt-and-braces: Kuzu is opened with `OpenReadOnly` at the engine level too. + +### Determinism + +Same input ⇒ same output, byte-for-byte. Every detector ships a determinism test. Conventions: + +- Never iterate a Go `map` without sorting keys first. +- `GraphBuilder.Snapshot()` sorts nodes + edges by ID. +- Linker outputs go through `.Sorted()` at the call site. +- Detectors are stateless — no mutable struct fields. Method-local state only. + +### Detector registration choke point + +Adding a new detector under `internal/detector//` is **not enough**. The package leaf must be blank-imported in [`internal/cli/detectors_register.go`](internal/cli/detectors_register.go). Without that line, the Go linker drops the package's `init()` and the binary ships with no registration for that detector family. This was the #1 silent-failure bug during the Java→Go port — 15 language families silently produced 0 nodes before the auto-import check was added to the dev workflow. + +### Goroutine safety + +- File I/O and detector dispatch run on a worker pool (`opts.Workers`, default `2 × GOMAXPROCS`). +- Detectors must be stateless. Method-local state only. +- Kuzu reads serialize behind the [`Store.mu`](internal/graph/store.go) mutex; one query at a time. +- The intelligence extractor pool is also `2 × GOMAXPROCS`-bounded to keep tree-sitter heap under control (Phase A OOM fix). + +### Confidence ladder is monotonic + +``` +ConfidenceLexical ("LEXICAL", 0.6) — regex / textual pattern +ConfidenceSyntactic ("SYNTACTIC", 0.8) — AST / parse-tree match +ConfidenceResolved ("RESOLVED", 0.95) — SymbolResolver cross-file resolution +``` + +In `mergeNode`, the higher-confidence node wins. The donor only fills properties the survivor doesn't already have (so a Spring detector's `framework=spring` stamp can't be overwritten by a generic detector's lower-confidence emission). + +### Phantom edge drop + +Edges with endpoints not in the node set get dropped at `Snapshot()`. Detectors emitting imports / depends-on edges across files must explicitly create the anchor nodes: + +- `base.EnsureFileAnchor(ctx, lang)` — emits a `:file:` node +- `base.EnsureExternalAnchor(ctx, lang, name)` — emits a `:external:` node + +See [`internal/detector/base/imports_helpers.go`](internal/detector/base/) and the gotcha note in [`docs/10-known-risks-and-todos.md`](docs/10-known-risks-and-todos.md). + +## Build / test / run commands + +```bash +# Build +CGO_ENABLED=1 go build -o /usr/local/bin/codeiq ./cmd/codeiq + +# Test — full suite (884+ tests, ~30s) +CGO_ENABLED=1 go test ./... -count=1 + +# Race detector (CI-equivalent) +CGO_ENABLED=1 go test ./... -race -count=1 + +# Single package +CGO_ENABLED=1 go test ./internal/detector/jvm/java/... -count=1 + +# Static analysis (mirrors go-ci.yml) +go vet ./... +"$(go env GOPATH)/bin/staticcheck" ./... # honnef.co/go/tools@2025.1.1 +"$(go env GOPATH)/bin/gosec" -exclude=G104,G115,G202,G204,G301,G304,G306,G401,G404,G501 ./... +"$(go env GOPATH)/bin/govulncheck" ./... + +# Smoke: index + enrich + stats on the canonical fixture +codeiq index testdata/fixture-minimal +codeiq enrich testdata/fixture-minimal +codeiq stats testdata/fixture-minimal + +# MCP wiring for Claude Code / Cursor +codeiq mcp /path/to/repo +``` + +## Layout + +``` +codeiq/ +├── cmd/codeiq/main.go — entry; 5-line shim into internal/cli +├── cmd/extcheck/main.go — build-time helper (Inference) +├── internal/ +│ ├── analyzer/ — index + enrich pipelines, GraphBuilder, ServiceDetector +│ ├── buildinfo/ — Version/Commit/Date with debug.BuildInfo fallback +│ ├── cache/ — SQLite analysis cache (5 tables, CacheVersion=6) +│ ├── cli/ — cobra subcommands + detectors_register.go CHOKE POINT +│ ├── detector/ — 100 detectors organized by family +│ │ ├── auth/ csharp/ frontend/ generic/ golang/ iac/ +│ │ ├── jvm/java/ jvm/kotlin/ jvm/scala/ +│ │ ├── markup/ proto/ python/ script/shell/ sql/ +│ │ ├── structured/ systems/{cpp,rust}/ typescript/ +│ │ └── base/ — shared helpers (NOT detectors) +│ ├── flow/ — architecture-flow diagram engine +│ ├── graph/ — Kuzu facade + FTS + mutation gate +│ ├── intelligence/ — Lexical enricher + per-language extractors +│ ├── mcp/ — MCP server + 10 tools +│ ├── model/ — CodeNode, CodeEdge, NodeKind, EdgeKind, Confidence, Layer +│ ├── parser/ — tree-sitter + structured parsers +│ ├── query/ — service / topology / stats / dead-code Cypher templates +│ └── review/ — PR-review pipeline (diff + Ollama) +├── parity/ — parity harness (build tag `parity`); mostly idle +├── testdata/ — fixture-minimal, fixture-multi-lang +├── scripts/ — release / git-setup shell helpers +├── .github/workflows/ — go-ci, perf-gate, release-go, release-darwin, security, scorecard +├── .goreleaser.yml — Goreleaser v2 (CGO multi-arch + Cosign + Syft) +├── go.mod / go.sum +├── docs/ — Full reference doc tree (see docs/README equivalent in this file's sibling README.md) +├── CLAUDE.md — this file +├── AGENTS.md — short pointer to CLAUDE.md (Inference, may be regenerated) +└── README.md — user-facing entry +``` + +## Gotchas (kept terse — full list in [`docs/10-known-risks-and-todos.md`](docs/10-known-risks-and-todos.md)) + +### Build / install + +- **CGO mandatory.** `CGO_ENABLED=0` fails at link time. Kuzu, SQLite, tree-sitter all CGO. +- **Module is at repo root.** Post-PR-#162 hoist. Stale instructions saying `cd go && go build` are wrong. +- **`go install …@latest` may resolve to a poisoned version.** Deleted tags (`v0.1.0`, `v0.3.0`, `v1.0.0`) live on at `proxy.golang.org` with old layouts. Use an explicit `@v0.4.1` (or later never-previously-used version). + +### Pipeline + +- **Detector blank-import is mandatory.** Forget [`detectors_register.go`](internal/cli/detectors_register.go) and the family ships dead. `codeiq plugins list` is the quick check. +- **Determinism over all else.** Map iteration without sort = silent regression. Determinism tests will catch you. +- **Phantom edges drop at Snapshot.** Use `base.EnsureFileAnchor` / `EnsureExternalAnchor`. + +### Kuzu 0.11.3 (current) + +- **Native FTS bundled.** `INSTALL fts` is a no-op when bundled. `CALL CREATE_FTS_INDEX('', '', [cols])` + `CALL QUERY_FTS_INDEX('
', '', '')` work. +- **Parameterized `LIMIT $lim` / `SKIP $skip`** — use them. The old `fmt.Sprintf("LIMIT %d", n)` pattern is gone after PR #159. +- **`[]string` accepted directly for `IN $param`.** The old `stringsToAny` widener is gone (PR #159). +- **Mutation gate allow-lists `CALL QUERY_FTS_INDEX`.** Write-side `CREATE_FTS_INDEX` / `DROP_FTS_INDEX` stay blocked under `OpenReadOnly`. +- **Recursive pattern upper bound is still literal-only.** `[*1..N]` — `N` must be inline. We use `fmt.Sprintf` here; depth comes from a clamped `--max-depth` (default 10). +- **`EXISTS { … }` subqueries don't see outer-scope `$param`.** Inline static lists as rel-pattern alternations. +- **List comprehension on path nodes is broken.** Use `properties(nodes(p), 'id')`, not `[n IN nodes(p) | n.id]`. + +### Bulk-load CSV + +- **`DELIM='|'` + `QUOTE='"'` + `ESCAPE='"'`** in every Kuzu COPY. Required for RFC-4180 round-trip from Go's `csv.Writer`. Three production bugs in series taught us this (#150 commas, #153 pipes inside fields). +- **Service IDs are path-qualified.** `service::`. Two modules sharing a name don't collide on Kuzu PK (#151). + +### TOML quoted keys + +- **`unquote()` on both the key AND the section header.** Airflow's `.cherry_picker.toml` had `"check_sha" = "..."` which used to ship as `"check_sha"` (with quotes) into node IDs. Fixed in PR #152. + +### MCP + +- **MCP SDK v1.6 quirks:** + - No `NewStdioTransport(in, out)` helper. `StdioTransport{}` zero-value binds `os.Stdin`/`os.Stdout`. Tests use `NewInMemoryTransports()`. + - `Server.AddTool(t *Tool, h ToolHandler)` — two args, not aggregate. + - `CallToolRequest.Params` is `*CallToolParamsRaw{Arguments json.RawMessage}`. The wrapper in [`internal/mcp/tool.go`](internal/mcp/tool.go) unmarshals once. + - ToolHandler returns get JSON-marshaled by the SDK. **Special-case `string` returns** in `asSDKTool` so the Mermaid/DOT string from `generate_flow` doesn't double-encode. + +### Release pipeline + +- **`draft: true`** in `.goreleaser.yml` — every release lands as a draft, needs `gh release edit --draft=false`. +- **`release-darwin.yml` polls `release-go`** for 15 min with early-bail on upstream failure (PR #165 raised the budget from 90s). +- **Never re-use a deleted tag name.** `proxy.golang.org` caches version content immutably. + +## Adding a new detector + +1. Create `internal/detector//.go`: + ```go + package + + import ( + "github.com/randomcodespace/codeiq/internal/detector" + "github.com/randomcodespace/codeiq/internal/detector/base" + "github.com/randomcodespace/codeiq/internal/model" + ) + + type MyDetector struct{} + + func NewMyDetector() *MyDetector { return &MyDetector{} } + + func (MyDetector) Name() string { return "my_detector" } + func (MyDetector) SupportedLanguages() []string { return []string{"java"} } + func (MyDetector) DefaultConfidence() model.Confidence { return base.RegexDetectorDefaultConfidence } + + func init() { detector.RegisterDefault(NewMyDetector()) } + + func (MyDetector) Detect(ctx *detector.Context) *detector.Result { + // pattern matching → return detector.ResultOf(nodes, edges) or detector.EmptyResult() + } + ``` + +2. If the family `/` is **new** (no detector lived there before), blank-import it in [`internal/cli/detectors_register.go`](internal/cli/detectors_register.go). + +3. Write `_test.go` next to it. Three test cases required: + - Positive match + - Negative match (avoids false positives) + - Determinism (run twice, assert byte-identical output) + +4. `CGO_ENABLED=1 go test ./internal/detector//... -count=1` + +5. Smoke check: `codeiq plugins list | grep my_detector` should show the new detector. + +## Adding a new MCP tool mode + +If extending one of the 6 consolidated tools with a new mode: + +1. Edit the relevant tool builder in [`internal/mcp/tools_consolidated.go`](internal/mcp/tools_consolidated.go). +2. Add a parity-test entry in [`internal/mcp/tools_consolidated_parity_test.go`](internal/mcp/tools_consolidated_parity_test.go) covering arg-name mapping to the underlying narrow handler. +3. Update [`docs/04-main-flows.md`](docs/04-main-flows.md) MCP tool table. + +If adding a wholly new top-level MCP tool: + +1. Add a `toolXxx(d) Tool` builder somewhere under `internal/mcp/`. +2. Register it in `RegisterGraphUserFacing` / `RegisterConsolidated` / `RegisterFlow` (in [`internal/cli/mcp.go`](internal/cli/mcp.go) → `registerAllTools`). +3. Write an integration test in [`internal/mcp/integration_test.go`](internal/mcp/integration_test.go). + +## Permission discipline + +- **Never commit unless the user explicitly asks.** Agent-generated `*.md` files (plans, scratchpad) must be in `.gitignore` before any push. +- **Never push to `main` directly.** Always via PR. +- **Never bypass branch protection** with `gh pr merge --admin`. `go-ci.yml` and `security.yml` are required for a reason. +- **Never `git tag --force` a deleted version name.** `proxy.golang.org` cache poison. +- **Always use `t.TempDir()` in tests.** No test should write outside its tempdir. +- **For destructive ops** (`rm -rf`, `git push --delete`, `gh release delete`, `git reset --hard`): ask before doing, unless the operator explicitly authorized. + +## When in doubt + +- Read [`docs/11-agent-handoff.md`](docs/11-agent-handoff.md). +- Run the smoke test on `testdata/fixture-minimal` after any pipeline change. +- Use `git log -p --since="1 month"` to learn the recent change pattern. +- The user values terse output. Skip preamble. Show the change + verification command. diff --git a/README.md b/README.md new file mode 100644 index 00000000..279325a8 --- /dev/null +++ b/README.md @@ -0,0 +1,109 @@ +# codeiq + +**Deterministic code-knowledge-graph CLI + stdio MCP server. 100 detectors, 35+ languages. Pure static analysis — no AI in the index/enrich pipeline; LLM use is opt-in for PR review.** + +

+ Latest release + CI + Security + Go 1.25.10 + License + 100 Detectors + 35+ Languages + MCP Stdio + Kuzu 0.11.3 +

+ +codeiq scans a codebase, builds a deterministic graph of services / endpoints / entities / infra / auth / framework usage, and exposes it via: + +- a CLI (`codeiq index → enrich → query/stats/find/cypher/topology/flow`) +- a stdio MCP server (10 read-only tools for Claude Code / Cursor) +- an LLM PR review (`codeiq review`, default backend Ollama local; cloud via `OLLAMA_API_KEY`) + +Same input ⇒ same output, every time. Detector emissions are confidence-tagged (`LEXICAL` / `SYNTACTIC` / `RESOLVED`); the graph builder dedup-merges with confidence-aware property union and drops phantom edges at snapshot. + +## Install + +### Pre-built (Linux / macOS) + +```bash +curl -L https://github.com/RandomCodeSpace/codeiq/releases/latest/download/codeiq_$(uname -s | tr A-Z a-z)_$(uname -m | sed s/x86_64/amd64/).tar.gz | tar xz +sudo install codeiq /usr/local/bin/ +codeiq --version +``` + +Cosign keyless verification: +```bash +cosign verify-blob \ + --bundle checksums.sha256.cosign.bundle \ + --certificate-identity-regexp 'https://github.com/RandomCodeSpace/codeiq/.github/workflows/release-go.yml@.*' \ + --certificate-oidc-issuer https://token.actions.githubusercontent.com \ + checksums.sha256 +``` + +### From source (Go 1.25.0+ with CGO toolchain) + +```bash +CGO_ENABLED=1 go install github.com/randomcodespace/codeiq/cmd/codeiq@latest +``` + +Or: +```bash +git clone https://github.com/RandomCodeSpace/codeiq.git +cd codeiq +CGO_ENABLED=1 go build -o /usr/local/bin/codeiq ./cmd/codeiq +``` + +## Quickstart + +```bash +codeiq index /path/to/repo # scan → SQLite cache (.codeiq/cache/codeiq.sqlite) +codeiq enrich /path/to/repo # load cache → Kuzu graph (.codeiq/graph/codeiq.kuzu) + build FTS indexes +codeiq stats /path/to/repo +codeiq find endpoints /path/to/repo +codeiq query consumers /path/to/repo +codeiq topology /path/to/repo +codeiq flow overview /path/to/repo --format mermaid +codeiq mcp /path/to/repo # stdio MCP server (for Claude Code / Cursor) +codeiq review /path/to/repo --base origin/main --head HEAD # local Ollama +``` + +## MCP integration + +Add to your MCP client config (`.mcp.json`): + +```json +{ + "mcpServers": { + "code-mcp": { + "command": "codeiq", + "args": ["mcp", "/path/to/repo"] + } + } +} +``` + +Ten user-facing tools: six mode-driven (`graph_summary`, `find_in_graph`, `inspect_node`, `trace_relationships`, `analyze_impact`, `topology_view`) plus `run_cypher` (read-only Cypher escape hatch), `read_file`, `generate_flow`, `review_changes`. + +## Documentation + +| File | Topic | +|---|---| +| [`docs/00-project-overview.md`](docs/00-project-overview.md) | What it is, who it's for, current status | +| [`docs/01-local-setup.md`](docs/01-local-setup.md) | Prereqs, build, test, common issues | +| [`docs/02-architecture.md`](docs/02-architecture.md) | Components, data flow, tradeoffs | +| [`docs/03-code-map.md`](docs/03-code-map.md) | Directory-by-directory tour | +| [`docs/04-main-flows.md`](docs/04-main-flows.md) | index / enrich / mcp / review lifecycles | +| [`docs/05-configuration.md`](docs/05-configuration.md) | env vars, `codeiq.yml`, CLI flags | +| [`docs/06-data-model.md`](docs/06-data-model.md) | Kuzu + SQLite schemas, NodeKind/EdgeKind taxonomy | +| [`docs/07-integrations.md`](docs/07-integrations.md) | External systems (Ollama, GitHub OIDC, Sigstore) | +| [`docs/08-testing.md`](docs/08-testing.md) | Test strategy, fixtures, perf-gate | +| [`docs/09-build-deploy-release.md`](docs/09-build-deploy-release.md) | Goreleaser, CI, supply-chain | +| [`docs/10-known-risks-and-todos.md`](docs/10-known-risks-and-todos.md) | Gotchas, debt, security-sensitive areas | +| [`docs/11-agent-handoff.md`](docs/11-agent-handoff.md) | One-stop brief for future AI agents | +| [`docs/adr/0001-current-architecture.md`](docs/adr/0001-current-architecture.md) | Why the architecture is what it is | +| [`CLAUDE.md`](CLAUDE.md) | Repo-specific instructions for Claude Code | + +## License + +[MIT](LICENSE) diff --git a/docs/00-project-overview.md b/docs/00-project-overview.md new file mode 100644 index 00000000..c863f687 --- /dev/null +++ b/docs/00-project-overview.md @@ -0,0 +1,68 @@ +# 00 — Project overview + +## What it is + +**codeiq** is a deterministic code-knowledge-graph builder. It scans a codebase, extracts 100 detector-defined patterns (REST endpoints, message queues, DB queries, auth filters, frontend components, IaC resources, …), and writes them as a graph of typed nodes + edges into an embedded Kuzu database. A read-only stdio MCP server then exposes the graph to LLM agents (Claude Code, Cursor) so they can answer "where is this called", "what depends on this", "what's the blast radius if I change X" without ever executing the code. + +**The pipeline contains zero LLM calls.** Same input → same output, byte-for-byte. The only LLM touch-point is the opt-in `codeiq review` subcommand which uses the indexed graph as evidence for an Ollama (local or cloud) chat completion to produce PR review comments. + +Source-of-truth entry point: [`cmd/codeiq/main.go`](../cmd/codeiq/main.go) → [`internal/cli/root.go`](../internal/cli/root.go). + +## Target users + +- **Developers** working on polyglot or unfamiliar codebases who need a fast structural map (`codeiq find endpoints`, `codeiq topology`). +- **AI coding agents** (Claude Code, Cursor) that need ground-truth structural facts about a codebase before suggesting changes. The MCP server is the primary integration path. +- **Reviewers** who want LLM-assisted PR review grounded in actual call/depends-on relationships rather than the diff alone (`codeiq review`). + +## Core features + +| Feature | Where | +|---|---| +| Static-analysis pipeline (FileDiscovery → tree-sitter / regex → 100 detectors → GraphBuilder → SQLite cache → Kuzu) | [`internal/analyzer/`](../internal/analyzer/) | +| 100 detectors across 35+ languages | [`internal/detector/`](../internal/detector/) (see [03-code-map](03-code-map.md)) | +| Kuzu embedded graph with native FTS (BM25-ranked search) | [`internal/graph/`](../internal/graph/) | +| 10 MCP tools (6 consolidated mode-driven + 4 specialized) | [`internal/mcp/`](../internal/mcp/) | +| Deterministic graph (confidence-aware node merge, canonical edge dedup, phantom-edge drop) | [`internal/analyzer/graph_builder.go`](../internal/analyzer/graph_builder.go) | +| LLM-driven PR review (Ollama local + cloud) | [`internal/review/`](../internal/review/) | +| Single static Go binary, ~25 MB | [`cmd/codeiq/main.go`](../cmd/codeiq/main.go) + Goreleaser | + +## Current status — v0.4.1 / v0.4.2 (in flight) + +Production-ready surface: +- **CLI subcommands** (index / enrich / mcp / stats / query / find / cypher / flow / graph / topology / review / cache / plugins / version) — all wired, all backed by tests. +- **MCP server** — 10 user-facing tools, read-only, mutation gate on `run_cypher`. Used in real Claude Code / Cursor configs. +- **CGO build** — linux/amd64, linux/arm64, darwin/arm64 release artifacts via Goreleaser with SBOMs + Cosign keyless signatures. +- **883+ tests** passing (CI: `go-ci` workflow runs vet + test -race + staticcheck + gosec + govulncheck). +- **OOM-fix verified** at `~/projects/`-scale (49k files / 187k nodes / 414k edges, peak RSS ~2 GiB). +- **Native Kuzu FTS** (v0.11.3) with BM25 ranking for label + lexical searches. + +Experimental / partial: +- `codeiq review` — works end-to-end against local Ollama; Ollama Cloud path tested but the default endpoint is local. Output format is markdown or JSON. +- `parity/` harness — build tag `parity`, compares cache + graph outputs across runs; used during the Java→Go port, now mostly idle. + +Not implemented (despite mentions in older docs): +- **`codeiq config `** — CLAUDE.md historically listed this; no `internal/cli/config.go` exists. The root `--config` flag still loads `codeiq.yml`. +- **REST API / web UI** — deleted in Phase 6 cutover (PR #132). Never coming back. + +## Production-ready vs experimental + +| Surface | State | Notes | +|---|---|---| +| CLI core (index / enrich / stats / find / query / cypher) | Production | 880+ tests, perf-gate CI | +| MCP stdio server (10 tools) | Production | Read-only; mutation gate tested | +| Kuzu 0.11.3 + native FTS | Production | Migrated from 0.7.1 with CONTAINS fallback retained | +| Goreleaser release pipeline | Production | Cosign keyless via GitHub OIDC + Sigstore Rekor | +| `codeiq review` (LLM PR review) | Beta | Works; quality depends on the LLM endpoint | +| `parity/` harness | Idle | Phase 5 / Java parity verification; build-tag gated | +| Detector coverage | Mixed | 100 detectors; some are lexical-only (regex), AST refinement is a per-detector concern | + +## Release history (after the v0.4.0 reset) + +All earlier tags (`v0.0.x` … `v0.3.0`, `v1.0.0`) were deleted from GitHub because the Go module proxy (proxy.golang.org) permanently caches every published version's content — reusing a deleted tag name serves the old (often Python-prototype) zip. v0.4.0 is the first never-used version after the cleanup. + +| Tag | Date | Notes | +|---|---|---| +| v0.4.0 | 2026-05-14 | Fresh start. Includes OOM fix + Kuzu 0.11.3 + native FTS + module hoist + 5 enrich correctness fixes. | +| v0.4.1 | 2026-05-14 | CI/dependency hygiene patch (release-darwin race fix + Dependabot bumps). | + +See [`docs/adr/0001-current-architecture.md`](adr/0001-current-architecture.md) for the architectural decisions behind today's shape. diff --git a/docs/01-local-setup.md b/docs/01-local-setup.md new file mode 100644 index 00000000..08ee6dff --- /dev/null +++ b/docs/01-local-setup.md @@ -0,0 +1,109 @@ +# 01 — Local setup + +## Prerequisites + +| Tool | Version | Why | +|---|---|---| +| Go | **≥ 1.25.0** (toolchain pinned to 1.25.10 in `go.mod`) | Module minimum is clamped by `modelcontextprotocol/go-sdk` v1.6. CI runs 1.25.10. | +| C toolchain (`gcc` or `clang`) + `build-essential` | recent | Required for CGO. Both Kuzu (`github.com/kuzudb/go-kuzu`) and SQLite (`github.com/mattn/go-sqlite3`) link against system C/C++ libraries. | +| `git` | any modern | File discovery uses `git ls-files` first, falls back to filesystem walk. | +| `CGO_ENABLED=1` | env | Mandatory. The binary cannot be built with `CGO_ENABLED=0`. | + +Optional: + +| Tool | When | +|---|---| +| Ollama (local or `OLLAMA_API_KEY` for cloud) | For `codeiq review` only. | +| `cosign` v2+ | To verify release-artifact signatures. | +| `syft` | If you want to regenerate SBOMs locally. The release pipeline uses Goreleaser's bundled Syft. | + +## Install + +### Build from source + +```bash +git clone https://github.com/RandomCodeSpace/codeiq.git +cd codeiq +CGO_ENABLED=1 go build -o /usr/local/bin/codeiq ./cmd/codeiq +codeiq --version +``` + +### `go install` (post-v0.4.0 module layout) + +```bash +CGO_ENABLED=1 go install github.com/randomcodespace/codeiq/cmd/codeiq@latest +``` + +Version reporting works without explicit ldflags — see [`internal/buildinfo/buildinfo.go`](../internal/buildinfo/buildinfo.go), which falls back to `runtime/debug.BuildInfo` (`Main.Version`, `Settings[vcs.revision/vcs.time/vcs.modified]`) when the goreleaser `-ldflags -X buildinfo.Version=...` path didn't run. + +### Pre-built binary + +See [README.md](../README.md#install). + +## Run + +```bash +codeiq index /path/to/repo +codeiq enrich /path/to/repo +codeiq stats /path/to/repo +codeiq mcp /path/to/repo # stdio MCP — wire to Claude Code / Cursor +``` + +State lands at `/.codeiq/cache/codeiq.sqlite` and `/.codeiq/graph/codeiq.kuzu/`. Both paths are gitignored by default (see this repo's [`.gitignore`](../.gitignore)). + +## Test + +```bash +# Full suite (884+ tests, ~30s) +CGO_ENABLED=1 go test ./... -count=1 + +# With race detector +CGO_ENABLED=1 go test ./... -race -count=1 + +# Single package +CGO_ENABLED=1 go test ./internal/detector/jvm/java/... -count=1 + +# Verbose +CGO_ENABLED=1 go test ./internal/mcp/... -v +``` + +CI runs the race-detector path on every PR. See [`08-testing.md`](08-testing.md). + +## Required services + +For the **core CLI + MCP** path: **none.** The binary is self-contained; CGO embeds Kuzu + SQLite + tree-sitter. + +For `codeiq review`: + +| Service | How | +|---|---| +| Ollama (local) | `ollama serve` + `ollama pull llama3.1:latest` (or your model of choice). Default endpoint `http://localhost:11434`. | +| Ollama Cloud | Set `OLLAMA_API_KEY=`; the review client switches to cloud. | + +The review client only needs OpenAI-compatible chat completions — see [`internal/review/`](../internal/review/) for the exact wire format. + +## Common setup issues + +### "package github.com/randomcodespace/codeiq/cmd/codeiq found, but does not contain package …/cmd/codeiq" on `go install` + +You're hitting a poisoned proxy cache from a previously-deleted tag (`v0.1.0`, `v0.3.0`, `v1.0.0` all existed at one point with a different module layout). Use an explicit never-poisoned version: `@v0.4.0` or later, **not** `@latest` if it resolves to one of the dead tags. + +### `failed to find shared library: kuzu.so` (or libstdc++) + +CGO linked but the runtime can't find the C++ stdlib. On Debian/Ubuntu: `sudo apt-get install build-essential libstdc++6`. On Alpine: this project assumes glibc — Alpine is not supported without rebuilding Kuzu. + +### `go: module … needs Go ≥ 1.25.0` on `go build` + +Upgrade your toolchain. The MCP SDK v1.6 requires it; `go mod tidy` will rewrite anything lower back up to 1.25.0. + +### Tests pass locally but `gosec` fails in CI + +`go-ci.yml` excludes `G104,G115,G202,G204,G301,G304,G306,G401,G404,G501` (security false-positives that are project-acceptable). If you add a finding outside that allow-list, you need to fix it or extend the exclusion list in [`.github/workflows/go-ci.yml`](../.github/workflows/go-ci.yml) with a justification comment. + +### Enrich is slow / spends time on tree-sitter + +The single biggest knob: `--copy-threads=N` and `--max-buffer-pool=BYTES` on `codeiq enrich`. Defaults are `min(4, GOMAXPROCS)` and 2 GiB respectively, picked for 16 GiB hosts. Bump on bigger hardware; lower on tighter envelopes. See [`internal/cli/enrich.go`](../internal/cli/enrich.go). + +### `.codeiq/` accumulates after every run + +This is on purpose — it's the cache + graph. To force a re-index, `rm -rf /.codeiq/` and re-run `index` + `enrich`. The [`.gitignore`](../.gitignore) keeps `.codeiq/` out of git. diff --git a/docs/02-architecture.md b/docs/02-architecture.md new file mode 100644 index 00000000..810b7cfc --- /dev/null +++ b/docs/02-architecture.md @@ -0,0 +1,108 @@ +# 02 — Architecture + +## One-page summary + +codeiq is a CLI binary plus an MCP stdio server. They share the same Go module and Cobra command tree; the MCP server is just `codeiq mcp` and reads from the same Kuzu graph the rest of the CLI writes to. + +``` + ┌──────────────────────────────────────────────────────────┐ + │ codeiq (single static binary) │ + │ │ + source │ ┌───────────┐ ┌──────────┐ ┌─────────────┐ │ + tree ─────►│ │ index │───►│ SQLite │───►│ enrich │ │ + │ │ (analyzer)│ │ cache │ │ (analyzer) │ │ + │ └───────────┘ └──────────┘ └──────┬──────┘ │ + │ │ │ + │ ┌────────────────────────────────────┐ ▼ │ + │ │ Read-only consumers: │ ┌─────────────┐ │ + │ │ stats, find, query, cypher, │ │ Kuzu │ │ + │ │ flow, graph, topology, review │◄┤ graph │ │ + │ │ mcp (stdio JSON-RPC, 10 tools) │ │ │ │ + │ └────────────────────────────────────┘ └─────────────┘ │ + └──────────────────────────────────────────────────────────┘ +``` + +## Components + +| Component | Package | What it does | +|---|---|---| +| **CLI** | [`internal/cli/`](../internal/cli/) | Cobra command tree. One file per subcommand. Root is [`root.go`](../internal/cli/root.go). Every detector package is blank-imported in [`detectors_register.go`](../internal/cli/detectors_register.go) — the **registration choke point** (forget it and the binary ships with an empty detector registry). | +| **Analyzer (index)** | [`internal/analyzer/`](../internal/analyzer/) | Orchestrates: file discovery → parse → detector pool → GraphBuilder → SQLite writes. Entry: `analyzer.Run()` in [`analyzer.go`](../internal/analyzer/analyzer.go). | +| **Analyzer (enrich)** | [`internal/analyzer/enrich.go`](../internal/analyzer/enrich.go) | Loads cache → applies linkers (topic, entity, module-containment) → LayerClassifier → LexicalEnricher → LanguageEnricher → ServiceDetector → Kuzu bulk-load via COPY FROM. | +| **Detectors** | [`internal/detector/`](../internal/detector/) | 100 implementations of the [`detector.Detector` interface](../internal/detector/detector.go). Each registers itself in `init()` with `detector.RegisterDefault(...)`. | +| **Parser** | [`internal/parser/`](../internal/parser/) | Tree-sitter wrappers for Java/Python/TypeScript/Go, plus a hand-rolled structured parser for YAML/JSON/TOML/INI/properties. Falls back to regex-only on parse failure. | +| **GraphBuilder** | [`internal/analyzer/graph_builder.go`](../internal/analyzer/graph_builder.go) | Confidence-aware dedup (`mergeNode`), canonical `(source, target, kind)` edge dedup, deterministic `Snapshot()` with phantom-edge drop. | +| **Graph (Kuzu facade)** | [`internal/graph/`](../internal/graph/) | Wraps `github.com/kuzudb/go-kuzu` v0.11.3. Read-only mode (`OpenReadOnly`) used by MCP + stats. Mutation gate ([`mutation.go`](../internal/graph/mutation.go)) rejects write-side Cypher on read-only opens; allow-lists `CALL QUERY_FTS_INDEX`. | +| **SQLite cache** | [`internal/cache/`](../internal/cache/) | Five tables (cache_meta, files, nodes, edges, analysis_runs). WAL mode. `CacheVersion = 6`. | +| **Intelligence layer** | [`internal/intelligence/`](../internal/intelligence/) | Lexical enricher + per-language extractors (java, python, typescript, golang) that surface high-signal lexical features (doc comments, config keys) for the lexical-FTS index. | +| **MCP server** | [`internal/mcp/`](../internal/mcp/) | Stdio JSON-RPC 2.0 over `modelcontextprotocol/go-sdk` v1.6. 10 user-facing tools (see [03-code-map](03-code-map.md) for the full list). | +| **Query layer** | [`internal/query/`](../internal/query/) | Cypher templates for service / topology / stats / dead-code / cycle-detection. Used by the CLI subcommands and the MCP delegation layer. | +| **Flow** | [`internal/flow/`](../internal/flow/) | Architecture-flow diagram engine (mermaid / dot / yaml output). Reads Kuzu; doesn't write. | +| **Review** | [`internal/review/`](../internal/review/) | Diff parser + Ollama-compatible chat client + ReviewService. Pulls evidence from the Kuzu graph (callers, depends-on) and asks the LLM for review comments. | + +## Data flow + +### `codeiq index ` + +1. **File discovery** ([`internal/analyzer/file_discovery.go`](../internal/analyzer/file_discovery.go)) — `git ls-files` first, dir-walk fallback. Maps extension → `parser.Language` via [`parser.LanguageFromExtension`](../internal/parser/parser.go). +2. **Worker pool** (default `2 × GOMAXPROCS`, override via `--workers`). Each worker: + - Reads file content + - Parses (tree-sitter for {Java, Python, TS, Go}, structured for YAML/JSON/TOML/INI/properties, regex-only fallback) + - Runs every `Detector` whose `SupportedLanguages()` covers the file's language +3. **GraphBuilder** aggregates emissions, dedup-merging nodes (confidence-aware property union — see `mergeNode`) and edges (canonical `(source, target, kind)`). +4. **Cache writes** in batches of `--batch-size` (default 500) — JSON-serialized nodes/edges keyed by content hash so subsequent runs can incrementally skip unchanged files. + +Returns `analyzer.Stats{Files, Nodes, Edges, DedupedNodes, DedupedEdges, DroppedEdges}` — the dedup/drop counters are visible to the operator so graph hygiene is diagnosable. + +### `codeiq enrich ` + +1. Read every row from SQLite cache. +2. Re-snapshot (sort) for determinism. +3. **Linkers** ([`internal/analyzer/linker/`](../internal/analyzer/linker/)) — TopicLinker, EntityLinker, ModuleContainmentLinker — emit cross-file edges (e.g. `consumes` between a Kafka-producer detector and a Kafka-consumer detector that both reference the same topic name). +4. **LayerClassifier** stamps every node with one of `frontend | backend | infra | shared | unknown`. +5. **LexicalEnricher + LanguageEnricher** populate `prop_lex_comment` and `prop_lex_config_keys` for the lexical FTS index. +6. **ServiceDetector** ([`internal/analyzer/service_detector.go`](../internal/analyzer/service_detector.go)) walks the filesystem for build files (pom.xml, package.json, go.mod, Cargo.toml, …) and emits one `SERVICE` node per detected module, plus `CONTAINS` edges to its child nodes. **IDs are path-qualified** (`service::`) so two modules sharing a name don't collide on Kuzu primary key. +7. **Kuzu BulkLoad** ([`internal/graph/bulk.go`](../internal/graph/bulk.go)) — CSV staging with `DELIM='|', QUOTE='"', ESCAPE='"'` (RFC-4180), batches of 50k rows. +8. **`CreateIndexes()`** ([`internal/graph/indexes.go`](../internal/graph/indexes.go)) — `INSTALL fts; LOAD EXTENSION fts;` then `CALL CREATE_FTS_INDEX` over `(label, fqn_lower)` and `(prop_lex_comment, prop_lex_config_keys)`. + +### `codeiq mcp ` + +1. Open Kuzu read-only (`OpenReadOnly`) — mutation gate enforces. +2. Register 10 tools via the registry in [`internal/mcp/server.go`](../internal/mcp/server.go). +3. Bind to `os.Stdin`/`os.Stdout` via `mcpsdk.StdioTransport{}`. +4. Serve. Each tool call → Cypher → JSON response. Every stat/find/query CLI subcommand has an MCP analog. + +See [04-main-flows.md](04-main-flows.md) for per-flow entry points and failure modes. + +## Storage choices + +| Surface | Engine | Why | +|---|---|---| +| Analysis cache | **SQLite** (`mattn/go-sqlite3` 1.14.44, WAL mode) | Cheap incremental dedup. Content-hash keyed so an unchanged file skips re-parse. | +| Graph store | **Kuzu** v0.11.3 (`kuzudb/go-kuzu`) | Embedded — no separate daemon. Property-graph model + native Cypher. Bundled FTS (v0.11.3+). | +| FTS index | **Kuzu native FTS** (BM25) | Replaced CONTAINS predicates from the v0.7.1 era. Auto-suffix `*` on single-token queries preserves prefix-match UX. CONTAINS fallback retained for pre-enrich graphs. | + +Both stores live under `/.codeiq/`. They're gitignored. + +## External systems + +- **Ollama** (HTTP, default `http://localhost:11434`) — only used by `codeiq review`. The OpenAI-compat `/v1/chat/completions` endpoint. +- **Ollama Cloud** — alternate base URL when `OLLAMA_API_KEY` is set. +- **GitHub OIDC + Sigstore Fulcio + Rekor** — release-time only; signs `checksums.sha256` keyless. No runtime touch. + +That's the entire external-system list. **No telemetry, no analytics, no auto-update.** + +## Important tradeoffs + +| Choice | Tradeoff | +|---|---| +| **CGO mandatory** | Cross-compile is harder; CGO_ENABLED=0 builds don't work. Buys embedded Kuzu + SQLite + tree-sitter — no separate daemons. | +| **Detector registration choke point** ([`detectors_register.go`](../internal/cli/detectors_register.go)) | Forgetting the blank import silently ships an empty registry. Buys: Go linker drops unimported packages → small binary. | +| **Lower-cased columns in CodeNode (`label_lower`, `fqn_lower`)** | Schema-level duplication. Originally for case-insensitive CONTAINS; now redundant with FTS but kept for fallback. | +| **Single-table polymorphic CodeNode** | Every NodeKind shares one Kuzu table with `kind` as a column. Simpler queries, but loses type-discriminated index optimizations Kuzu could do with per-label tables. | +| **Inline LIMIT for recursive `[*1..N]` patterns** | Kuzu still requires the upper bound to be a literal. Detected, contained, documented in [10-known-risks-and-todos.md](10-known-risks-and-todos.md). | +| **Mutation gate via regex keyword filter** | Pure string-level matching (`CREATE`, `MERGE`, `DELETE`, etc.). Not a full Cypher parser — adversarial inputs might bypass via formatting tricks. Belt-and-braces alongside Kuzu's own `OpenReadOnly` system flag. | +| **No telemetry / no auto-update** | Operator has to track new releases. Buys: zero data collection, zero runtime network. | +| **Goreleaser `draft: true`** | Every release needs manual `gh release edit --draft=false`. Buys: maintainer review before broadcast. | + +See [`docs/adr/0001-current-architecture.md`](adr/0001-current-architecture.md) for the decision rationale. diff --git a/docs/03-code-map.md b/docs/03-code-map.md new file mode 100644 index 00000000..7e737e3c --- /dev/null +++ b/docs/03-code-map.md @@ -0,0 +1,179 @@ +# 03 — Code map + +> All paths are repo-root-relative. Module: `github.com/randomcodespace/codeiq`. CGO required everywhere. ~395 Go files in `internal/` + `cmd/` + `parity/`. + +## Top level + +``` +codeiq/ +├── cmd/ — main package(s) +├── internal/ — production code (393 .go files) +├── parity/ — parity harness (build tag `parity`, 7 .go files) +├── testdata/ — fixtures (fixture-minimal, fixture-multi-lang) +├── scripts/ — release / git-setup shell helpers +├── .github/workflows/ — 6 workflows: go-ci, perf-gate, release-go, release-darwin, security, scorecard +├── .goreleaser.yml — Goreleaser v2 config (CGO multi-arch + cosign + syft) +├── .gitignore +├── go.mod — module root post-hoist (PR #162) +├── go.sum +└── LICENSE — MIT +``` + +## `cmd/` + +| File | Purpose | +|---|---| +| [`cmd/codeiq/main.go`](../cmd/codeiq/main.go) | The only ship-able binary. `func main()` is a thin `os.Exit(cli.Execute())` shim — all logic lives in `internal/cli`. | +| [`cmd/extcheck/main.go`](../cmd/extcheck/main.go) | Build-time helper (not shipped). Inference: external-link / extension verification — verify with `head cmd/extcheck/main.go` if you touch it. | + +## `internal/cli/` — Cobra command tree + +13 subcommand files (excluding tests). Every subcommand registers itself via `init()` (see [`root.go`](../internal/cli/root.go)). + +| File | Cobra `Use` | Subcommand class | +|---|---|---| +| [`root.go`](../internal/cli/root.go) | `codeiq` | Root + persistent flags (`--config`, `--no-color`, `--json`, `-v/--verbose`) | +| [`detectors_register.go`](../internal/cli/detectors_register.go) | n/a | **Registration choke point.** Blank-imports every detector leaf package — forget to add yours and the binary ships with an empty registry. | +| [`index.go`](../internal/cli/index.go) | `index [path]` | Scan → SQLite cache. Flags: `--batch-size`, `-w/--workers`. | +| [`enrich.go`](../internal/cli/enrich.go) | `enrich [path]` | Cache → Kuzu graph + FTS indexes. Flags: `--graph-dir`, `--memprofile`, `--max-buffer-pool`, `--copy-threads`. | +| [`mcp.go`](../internal/cli/mcp.go) | `mcp [path]` | Stdio MCP server. Flags: `--graph-dir`, `--max-results`, `--max-depth`, `--query-timeout`. | +| [`stats.go`](../internal/cli/stats.go) | `stats [path]` | Categorized graph statistics. Flags: `--category`, `--graph-dir`. | +| [`query.go`](../internal/cli/query.go) | `query ` | Parent: `consumers`, `producers`, `callers`, `dependencies`, `dependents`. | +| [`find.go`](../internal/cli/find.go) | `find ` | Parent: `endpoints`, `guards`, `entities`, `topics`, `queues`, `services`, `databases`, `components`. | +| [`cypher.go`](../internal/cli/cypher.go) | `cypher [path]` | Raw read-only Cypher. Mutation gate rejects writes. | +| [`flow.go`](../internal/cli/flow.go) | `flow [path]` | Architecture-flow diagrams (overview / ci / deploy / runtime / auth). | +| [`graph_cmd.go`](../internal/cli/graph_cmd.go) | `graph [path]` | Export full graph (json/yaml/mermaid/dot). | +| [`topology.go`](../internal/cli/topology.go) | `topology ` | Parent: full map + `service-detail`, `blast-radius`, `bottlenecks`, `circular`, `dead`, `path`. | +| [`review.go`](../internal/cli/review.go) | `review [path]` | LLM-driven PR review via Ollama. | +| [`cache.go`](../internal/cli/cache.go) | `cache ` | Parent: `info`, `list`, `inspect`, `clear`. | +| [`plugins.go`](../internal/cli/plugins.go) | `plugins ` | Parent: `list`, `inspect`. | +| [`version.go`](../internal/cli/version.go) | `version` | Build info from [`internal/buildinfo`](../internal/buildinfo/). | + +**No `config.go`** — the historical `codeiq config ` subcommand was never implemented. The root `--config` flag still loads `codeiq.yml`. + +## `internal/analyzer/` — pipeline orchestration + +| File | Role | +|---|---| +| [`analyzer.go`](../internal/analyzer/analyzer.go) | Index pipeline entry: FileDiscovery → parser → detectors → GraphBuilder → cache. `analyzer.Run()`. | +| [`enrich.go`](../internal/analyzer/enrich.go) | Enrich pipeline entry: cache → linkers → LayerClassifier → LexicalEnricher → LanguageEnricher → ServiceDetector → Kuzu BulkLoad. Tunable knobs: `EnrichOptions.StoreBufferPoolBytes`, `StoreCopyThreads`. | +| [`graph_builder.go`](../internal/analyzer/graph_builder.go) | Confidence-aware `mergeNode`, canonical edge dedup, deterministic `Snapshot()` (sorts + drops phantom edges). Nils internal maps after snapshot for memory hygiene (Phase A OOM fix). | +| [`file_discovery.go`](../internal/analyzer/file_discovery.go) | `git ls-files` first, dir-walk fallback. `DefaultExcludeDirs` skips `node_modules`, `vendor`, `target`, `.git`, etc. | +| [`service_detector.go`](../internal/analyzer/service_detector.go) | Walks the filesystem for build files (pom.xml, package.json, go.mod, Cargo.toml, pyproject.toml, …); emits one `SERVICE` node per module + `CONTAINS` edges. **Path-qualified IDs** (PR #151). | +| [`layer_classifier.go`](../internal/analyzer/layer_classifier.go) | Stamps Layer (frontend/backend/infra/shared/unknown) on every node. | +| [`linker/`](../internal/analyzer/linker/) | Cross-file linkers — TopicLinker, EntityLinker, ModuleContainmentLinker. | + +## `internal/detector/` — 100 detectors + +All implement [`detector.Detector`](../internal/detector/detector.go): +```go +type Detector interface { + Name() string + SupportedLanguages() []string + DefaultConfidence() model.Confidence + Detect(ctx *Context) *Result +} +``` + +Each registers itself in `init()` with `detector.RegisterDefault(NewMyDetector())`. The category subdirectory **must** also be blank-imported in [`internal/cli/detectors_register.go`](../internal/cli/detectors_register.go). + +| Category | Path | Headcount (approx) | +|---|---|---| +| auth | `internal/detector/auth/` | OAuth/JWT/SSO scanners | +| frontend | `internal/detector/frontend/` | React, Vue, Svelte, Angular, routes | +| iac | `internal/detector/iac/` | Terraform, Bicep, Dockerfile, CloudFormation | +| jvm/java | `internal/detector/jvm/java/` | ~37 — Spring REST, Spring Security, ActiveMQ, gRPC, JPA, Quarkus, … | +| jvm/kotlin | `internal/detector/jvm/kotlin/` | Ktor routes, Kotlin structures | +| jvm/scala | `internal/detector/jvm/scala/` | Scala structures | +| python | `internal/detector/python/` | FastAPI, Flask, Django, SQLAlchemy, Pydantic | +| typescript | `internal/detector/typescript/` | TS / JS / Node frameworks | +| golang | `internal/detector/golang/` | gin, echo, chi, gRPC server | +| systems/cpp | `internal/detector/systems/cpp/` | C/C++ structures | +| systems/rust | `internal/detector/systems/rust/` | Rust / Cargo / actix / axum | +| csharp | `internal/detector/csharp/` | ASP.NET Core, EF Core, Azure SDK | +| markup | `internal/detector/markup/` | Markdown | +| proto | `internal/detector/proto/` | gRPC `.proto` files | +| sql | `internal/detector/sql/` | Migrations, raw SQL | +| structured | `internal/detector/structured/` | YAML, JSON, TOML, K8s, Helm, OpenAPI | +| script/shell | `internal/detector/script/shell/` | PowerShell, Bash | +| generic | `internal/detector/generic/` | Cross-language detectors (imports, references) | +| base | `internal/detector/base/` | **Not detectors.** Shared helpers — `RegexDetectorDefaultConfidence`, `StructuredDetectorDefaultConfidence`, `EnsureFileAnchor`, `EnsureExternalAnchor`, etc. Used by every detector category. | + +Sample (Spring REST): [`internal/detector/jvm/java/spring_rest.go`](../internal/detector/jvm/java/spring_rest.go). + +## `internal/graph/` — Kuzu facade + +| File | Role | +|---|---| +| [`store.go`](../internal/graph/store.go) | Open / OpenReadOnly / OpenWithOptions. BufferPoolBytes default 2 GiB (`DefaultBufferPoolBytes`). | +| [`schema.go`](../internal/graph/schema.go) | Single `CodeNode` table + one REL table per `EdgeKind`. `ApplySchema()`. | +| [`bulk.go`](../internal/graph/bulk.go) | `BulkLoadNodes` / `BulkLoadEdges`. CSV staging with `DELIM='|', QUOTE='"', ESCAPE='"'`. Batched at 50k rows (env override `CODEIQ_BULK_BATCH_SIZE`). | +| [`cypher.go`](../internal/graph/cypher.go) | `Cypher(query, args)` and `CypherRows(query, args, maxRows)`. Mutation gate applies on read-only stores. | +| [`mutation.go`](../internal/graph/mutation.go) | `MutationKeyword(query)` returns the first blocked keyword (CREATE, DELETE, DETACH, SET, REMOVE, MERGE, DROP, FOREACH, LOAD CSV, COPY). CALL gate allow-lists `db.*`, `show_*`, `table_*`, `current_setting`, `table_info`, `query_fts_index`. | +| [`indexes.go`](../internal/graph/indexes.go) | `CreateIndexes()` builds two FTS indexes via `CALL CREATE_FTS_INDEX`. `SearchByLabel` / `SearchLexical` route through `QUERY_FTS_INDEX` with CONTAINS fallback for pre-enrich graphs. | +| [`reads.go`](../internal/graph/reads.go) | `Count`, `CountEdges`, `CountNodesByKind`, `CountNodesByLayer`, `FindByID`, `FindByKindPaginated`, `FindIncomingNeighbors`, etc. | + +## `internal/cache/` — SQLite analysis cache + +| File | Role | +|---|---| +| [`cache.go`](../internal/cache/cache.go) | `Open(path)`, transactional batch writes. | +| [`schema.go`](../internal/cache/schema.go) | 5 tables: `cache_meta` (reserved-word workaround uses `meta_key`/`meta_value`), `files`, `nodes`, `edges`, `analysis_runs`. `CacheVersion = 6`. | +| [`hasher.go`](../internal/cache/hasher.go) | Content hash for file-level dedup. | +| [`inspect.go`](../internal/cache/inspect.go) | Backend for `codeiq cache list / inspect`. | + +## `internal/intelligence/` + +| Subdir | Role | +|---|---| +| `extractor/` | `LanguageExtractor` interface + per-language impls (java, python, typescript, golang). Per-file tree-sitter parse, then walk to surface high-signal lexical features. | +| `lexical/` | LexicalEnricher + QueryService (FullTextStore interface). Populates `prop_lex_comment`, `prop_lex_config_keys`. Snippet extraction backs `evidence_pack` MCP mode. | + +## `internal/mcp/` + +| File | Role | +|---|---| +| [`server.go`](../internal/mcp/server.go) | `Server`, `Registry`, `Tool`. Uses `mcpsdk.Server.AddTool` / `mcpsdk.StdioTransport`. | +| [`tool.go`](../internal/mcp/tool.go) | `Tool` struct + JSON-RawMessage handler signature. | +| [`tools_consolidated.go`](../internal/mcp/tools_consolidated.go) | 6 mode-driven tools (`graph_summary`, `find_in_graph`, `inspect_node`, `trace_relationships`, `analyze_impact`, `topology_view`) + `review_changes`. Each delegates to underlying narrow handlers in `tools_graph.go` / `tools_intelligence.go` / `tools_topology.go`. | +| [`tools_graph.go`](../internal/mcp/tools_graph.go) | `run_cypher` + `read_file` user-facing tools + 18 narrow handlers (not user-facing). | +| [`tools_intelligence.go`](../internal/mcp/tools_intelligence.go) | Backing handlers for FTS-driven search modes. | +| [`tools_topology.go`](../internal/mcp/tools_topology.go) | Backing handlers for topology-view modes. | +| [`tools_flow.go`](../internal/mcp/tools_flow.go) | `generate_flow` tool (delegates to `internal/flow`). | +| [`tools_review.go`](../internal/mcp/tools_review.go) | `review_changes` tool — delegates to `internal/review`. | +| [`envelope.go`](../internal/mcp/envelope.go) | Error envelope helpers (`NewErrorEnvelope`, `RequestID`, `CodeInvalidInput`, etc.). | + +## `internal/query/` + +| File | Role | +|---|---| +| [`service.go`](../internal/query/service.go) | Service-level queries: `FindShortestPath`, `FindCycles`, `FindDeadCode`. | +| [`topology.go`](../internal/query/topology.go) | Topology projection — `Service`, `BlastRadius`, `Bottlenecks`, `Circular`. | +| [`stats.go`](../internal/query/stats.go) | Backing for `codeiq stats`. | + +## `internal/flow/`, `internal/review/`, `internal/parser/`, `internal/model/` + +| Package | Headline | +|---|---| +| [`internal/flow/`](../internal/flow/) | `Generate(view, format, store)` — 5 views (overview, ci, deploy, runtime, auth), 4 formats (json, mermaid, dot, yaml). | +| [`internal/review/`](../internal/review/) | `Client` (Ollama HTTP) + `ReviewService` (diff + graph evidence → LLM prompt → review JSON). Default base URL `http://localhost:11434`; cloud when `OLLAMA_API_KEY` set. | +| [`internal/parser/`](../internal/parser/) | `parser.Tree`, tree-sitter wrappers, structured parser for YAML/JSON/TOML/INI/properties. `ParseStructured` dispatches by language. | +| [`internal/model/`](../internal/model/) | Canonical types: `CodeNode`, `CodeEdge`, `NodeKind` (34 values), `EdgeKind` (28 values), `Confidence` (LEXICAL/SYNTACTIC/RESOLVED), `Layer` (frontend/backend/infra/shared/unknown). | +| [`internal/buildinfo/`](../internal/buildinfo/) | `Version`, `Commit`, `Date`, `Dirty`, `Platform`, `GoVersion`, `Features`. `init()` falls back to `runtime/debug.BuildInfo` when no `-ldflags -X`. | + +## `parity/` + +Build-tag `parity` harness. Compares cache + graph outputs of two runs (Java side vs Go side, or two Go runs). Used heavily during the Java → Go port; now mostly idle. Build with `go build -tags parity ./parity/...`. + +## `testdata/` + +| Path | Purpose | +|---|---| +| [`testdata/fixture-minimal/`](../testdata/fixture-minimal/) | 5-file fixture used by `index_test.go` and as a smoke target. **`README.md` is content** — it's part of the fixture, not project docs. | +| `testdata/fixture-multi-lang/` | Multi-service polyglot fixture used by the perf-gate workflow + multi-language enrich tests. | + +## Why the directory structure matters + +- **`internal/`** is Go-stdlib-enforced — nothing outside `github.com/randomcodespace/codeiq/...` can import from it. This is why every public surface (CLI subcommands, MCP tools) is a thin wrapper around `internal/` packages. +- **Detector registration is a choke point** ([`detectors_register.go`](../internal/cli/detectors_register.go)). The Go linker drops unimported packages even if they have `init()` functions — without the blank import, the detector ships dead. +- **One CodeNode table for all 34 NodeKinds** simplifies Cypher (no UNION over per-label tables) at the cost of label-index optimizations Kuzu could theoretically apply. See [02-architecture.md tradeoffs](02-architecture.md#important-tradeoffs). diff --git a/docs/04-main-flows.md b/docs/04-main-flows.md new file mode 100644 index 00000000..81a24c0b --- /dev/null +++ b/docs/04-main-flows.md @@ -0,0 +1,168 @@ +# 04 — Main flows + +Every flow lists **entry point**, **key files**, and **failure modes**. Authentication / login is **not present** — codeiq has no user-facing service; the only auth-adjacent code is detector logic that *finds* auth patterns in scanned codebases. + +## 1. Index flow — `codeiq index ` + +**Entry point:** [`internal/cli/index.go`](../internal/cli/index.go) → [`analyzer.Run()`](../internal/analyzer/analyzer.go). + +**Steps:** + +1. File discovery via `git ls-files` (fallback: `filepath.Walk` with `DefaultExcludeDirs` excluding `node_modules`, `vendor`, `target`, `.git`, `dist`, `build`, `.gradle`, `.idea`, `__pycache__`, `.tox`, `.eggs`, `venv`, `.venv`). +2. Extension → `parser.Language` mapping in [`parser.LanguageFromExtension`](../internal/parser/parser.go). +3. **Worker pool** (default `2 × GOMAXPROCS`, override `--workers`). Per-file: + - Read content. + - Parse: tree-sitter for {Java, Python, TypeScript, Go}; structured parser for YAML/JSON/TOML/INI/properties; regex-only fallback otherwise. + - Iterate every `Detector` whose `SupportedLanguages()` covers the file's language. Pass `detector.Context{FilePath, Language, Content, Tree, …}`. +4. **GraphBuilder** ([`graph_builder.go`](../internal/analyzer/graph_builder.go)): + - `mergeNode` performs confidence-aware union: donor only fills keys the survivor doesn't have, so a `Spring` detector's `framework=spring` stamp survives a generic `auth` detector's overwrite attempt. + - Edges are deduped on canonical `(source_id, target_id, kind)`. +5. **Snapshot** sorts nodes + edges by ID for determinism and drops "phantom edges" — edges whose endpoint isn't in the node set. Visible via `analyzer.Stats.DroppedEdges`. +6. **Cache write** in batches (`--batch-size`, default 500). Nodes + edges go to `nodes` / `edges` tables keyed by file content hash. JSON-serialized in the `data` column. + +**Entry point keys / important files:** + +- [`internal/cli/index.go`](../internal/cli/index.go) — flag parsing + analyzer wiring +- [`internal/analyzer/analyzer.go`](../internal/analyzer/analyzer.go) — pipeline loop +- [`internal/analyzer/file_discovery.go`](../internal/analyzer/file_discovery.go) — `git ls-files` + filesystem fallback +- [`internal/parser/parser.go`](../internal/parser/parser.go) — language detection +- [`internal/detector/detector.go`](../internal/detector/detector.go) — `Detector` interface + `Default` registry +- [`internal/analyzer/graph_builder.go`](../internal/analyzer/graph_builder.go) — dedup + snapshot +- [`internal/cache/cache.go`](../internal/cache/cache.go) — batched writes + +**Failure modes:** + +- **Empty registry** — detector category not blank-imported in [`detectors_register.go`](../internal/cli/detectors_register.go) → 0 emissions for that language. Symptom: `codeiq plugins list` doesn't show the detector. Already-bitten by this; the auto-import check is one of the most important correctness invariants. +- **Tree-sitter parse error** — detector falls back to regex-only path; some emissions degrade in fidelity (e.g. `framework` may be missing). Logged at `-v`. +- **Large file** — Tree-sitter has memory cost ~O(file_size); the worker pool concurrency × tree size can OOM if `--workers` is too high. Default `2 × GOMAXPROCS` is safe up to ~50k files / 15 GiB hosts. +- **Cache write contention** — SQLite WAL handles concurrent reads + one writer. Writes are batched on a single channel; backpressure shows up as slow `index`. + +## 2. Enrich flow — `codeiq enrich ` + +**Entry point:** [`internal/cli/enrich.go`](../internal/cli/enrich.go) → [`analyzer.RunEnrich(EnrichOptions)`](../internal/analyzer/enrich.go). + +**Steps:** + +1. Open SQLite cache read-only. +2. Stream every cached node + edge into a `GraphBuilder` to re-snapshot (sort). +3. **Linkers** ([`internal/analyzer/linker/`](../internal/analyzer/linker/)) — TopicLinker, EntityLinker, ModuleContainmentLinker — emit cross-file edges by name matching (e.g. a `produces topic="users.created"` and a `consumes topic="users.created"` get linked even though they live in different files). +4. **LayerClassifier** stamps `layer = frontend | backend | infra | shared | unknown` on every node using filename heuristics + framework hints. +5. **Intelligence layer** ([`internal/intelligence/extractor/`](../internal/intelligence/extractor/)): + - `ExtractFromTree` runs once per file (tree-sitter parsed once, not per-node — Phase A OOM fix). + - Surfaces `prop_lex_comment` (doc comments / JSDoc / docstring text) and `prop_lex_config_keys` (extracted key lists from YAML/JSON config files). + - Per-file goroutine pool is `2 × GOMAXPROCS`-bounded (Phase A OOM fix). +6. **ServiceDetector** ([`service_detector.go`](../internal/analyzer/service_detector.go)) walks the FS for build files (`pom.xml`, `package.json`, `go.mod`, `Cargo.toml`, `pyproject.toml`, `setup.py`, `Gemfile`, `composer.json`, `Package.swift`, `mix.exs`, `pubspec.yaml`, `stack.yaml`, `build.zig`, `dune-project`, `DESCRIPTION`, `BUILD`, `BUILD.bazel`, plus `.csproj`/`.fsproj`/`.vbproj`/`.gemspec`/`.cabal`/`.nimble` suffixes). One `SERVICE` node per module + `CONTAINS` edges to its child nodes. **IDs are path-qualified** (`service::`). +7. **Kuzu BulkLoad** ([`internal/graph/bulk.go`](../internal/graph/bulk.go)): + - Open Kuzu writable with `BufferPoolBytes` capped at 2 GiB (override `--max-buffer-pool=N`). + - Apply schema (idempotent — single `CodeNode` table + 28 REL tables). + - Write CSV staging files with `csv.Writer{Comma: '|'}`. + - `COPY
FROM '' (header=false, DELIM='|', QUOTE='"', ESCAPE='"')` — explicit QUOTE+ESCAPE so Kuzu honors Go's RFC-4180 quoting. + - Batches of 50,000 rows (override `CODEIQ_BULK_BATCH_SIZE` env). +8. **FTS** ([`internal/graph/indexes.go`](../internal/graph/indexes.go)): + - `INSTALL fts; LOAD EXTENSION fts;` + - `CALL DROP_FTS_INDEX('CodeNode', '');` (idempotent) + - `CALL CREATE_FTS_INDEX('CodeNode', 'code_node_label_fts', ['label', 'fqn_lower']);` + - `CALL CREATE_FTS_INDEX('CodeNode', 'code_node_lexical_fts', ['prop_lex_comment', 'prop_lex_config_keys']);` + +**Tunable knobs (CLI flags on `enrich`):** + +- `--memprofile=` — writes a Go heap profile (`pprof.WriteHeapProfile`). Analyze with `go tool pprof -top -inuse_space `. +- `--max-buffer-pool=N` — Kuzu BufferPoolSize override (bytes). Default 2 GiB. +- `--copy-threads=N` — Kuzu `MaxNumThreads`. Default `min(4, GOMAXPROCS)`. + +**Failure modes:** + +- **Duplicate primary key on COPY** — historically bit on `service:` collisions across modules. Fixed by path-qualified IDs (#151). Symptom: `Copy exception: Found duplicated primary key value service:checkout`. +- **CSV "expected N values per row, but got more"** — JSON property values containing commas (#150 added pipe delim) or pipes (#153 added explicit `QUOTE`/`ESCAPE`). All known instances fixed. +- **TOML quoted-key emission** — `"check_sha" = ...` made it through with literal quotes in node IDs, breaking edge PK lookup. Fixed in `parseTOML` via `unquote()` on the key (#152). +- **OOM** — Phase A+B+C fix landed: parse-once-per-file, bounded extractor pool, 2 GiB BufferPool cap, `Snapshot()` nils dedup maps. Verified at ~/projects/-scale (49k files): peak RSS 1.8–2.2 GiB. +- **FTS extension missing** — Kuzu 0.11.3+ bundles it; `INSTALL fts` is a no-op when bundled. Pre-0.11.3 graphs fall through to CONTAINS predicates via the fallback path. + +## 3. MCP server flow — `codeiq mcp ` + +**Entry point:** [`internal/cli/mcp.go`](../internal/cli/mcp.go) → [`mcp.Server.Serve()`](../internal/mcp/server.go). + +**Steps:** + +1. Open Kuzu **read-only** (`graph.OpenReadOnly(path, query_timeout)`). Mutation gate is active for every `s.Cypher(...)` call. +2. Build `mcp.Deps` (store + intelligence + flow + review + max-results + max-depth caps). +3. Register tools via 3 helper functions: + - `RegisterGraphUserFacing(srv, d)` → `run_cypher` + `read_file`. + - `RegisterFlow(srv, d)` → `generate_flow`. + - `RegisterConsolidated(srv, d)` → 6 mode-driven tools + `review_changes`. +4. Bind transport: `mcpsdk.StdioTransport{}` (zero value binds `os.Stdin`/`os.Stdout`). +5. `Server.Serve(ctx, transport)` — blocks until stdin closes or context cancels. + +**Tool list (10 user-facing):** + +| Tool | Modes / params | +|---|---| +| `graph_summary` | `overview` / `categories` / `capabilities` / `provenance` | +| `find_in_graph` | `nodes` / `edges` / `text` / `fuzzy` / `by_file` / `by_endpoint` | +| `inspect_node` | `neighbors` / `ego` / `evidence` / `source` | +| `trace_relationships` | `callers` / `consumers` / `producers` / `dependencies` / `dependents` / `shortest_path` | +| `analyze_impact` | `blast_radius` / `trace` / `cycles` / `circular_deps` / `dead_code` / `dead_services` / `bottlenecks` | +| `topology_view` | `summary` / `service` / `service_deps` / `service_dependents` / `flow` | +| `run_cypher` | Escape hatch — read-only Cypher. `CALL QUERY_FTS_INDEX` allow-listed. | +| `read_file` | Read source file content. Path-sandboxed to the indexed root. Full file or line range. | +| `generate_flow` | Architecture-flow diagrams. Views: `overview` / `ci` / `deploy` / `runtime` / `auth`. Formats: `json` / `mermaid` / `dot` / `yaml`. | +| `review_changes` | LLM-driven git-diff review via Ollama. Reads graph + shells out to `git`; never writes to `.codeiq/`. | + +**Key files:** + +- [`internal/mcp/server.go`](../internal/mcp/server.go) — `Server`, `Registry`, `Serve()` +- [`internal/mcp/tool.go`](../internal/mcp/tool.go) — `Tool` struct + `asSDKTool` conversion (special-cases string returns for `generate_flow`) +- [`internal/mcp/tools_consolidated.go`](../internal/mcp/tools_consolidated.go) — 6 mode-driven tools +- [`internal/mcp/tools_graph.go`](../internal/mcp/tools_graph.go) — narrow tool builders (Go-API delegation targets) + `run_cypher` + `read_file` +- [`internal/graph/mutation.go`](../internal/graph/mutation.go) — `MutationKeyword` regex gate + +**Failure modes:** + +- **`run_cypher` blocked** — query contains CREATE/DELETE/SET/REMOVE/MERGE/DROP/FOREACH/LOAD CSV/COPY/DETACH or a non-allow-listed CALL. Surfaced as a regular tool-call error with the blocked keyword named. +- **Cypher binder error** — Kuzu's parser surfaces "Variable n is not in scope" or "Parameter X not found in EXISTS subquery" for known binder limitations. The query layer codes around these (e.g. `properties(nodes(p), 'id')` instead of list comprehension). +- **Path traversal in `read_file`** — sandboxed to the indexed root. Attempted `../` resolves outside the root → error envelope. +- **MCP arg-name mismatches** — historically the 6 consolidated tools delegated with wrong arg names (PR #149 fix). Parity tests in [`internal/mcp/tools_consolidated_parity_test.go`](../internal/mcp/tools_consolidated_parity_test.go) lock the names down. + +## 4. PR-review flow — `codeiq review` + +**Entry point:** [`internal/cli/review.go`](../internal/cli/review.go) → [`review.NewService(...).Review(ctx, ...)`](../internal/review/). + +**Steps:** + +1. Shell out to `git diff ..` for the diff. +2. Parse the diff into hunks ([`internal/review/diff.go`](../internal/review/diff.go) — Inference based on filename). +3. For each touched file path, query Kuzu for evidence: + - Nodes defined in the file + - Inbound semantic edges to those nodes (callers, depends-on) +4. Build the LLM prompt: diff + evidence + review-style guidance. +5. POST to Ollama `/v1/chat/completions` (OpenAI-compatible). Default base URL `http://localhost:11434`. If `OLLAMA_API_KEY` is set, switch to Ollama Cloud. +6. Parse the response into structured review JSON, or render as Markdown if `--format markdown`. + +**Key files:** + +- [`internal/review/client.go`](../internal/review/client.go) — Inference: HTTP client wrapping `/v1/chat/completions` +- [`internal/review/service.go`](../internal/review/service.go) — Inference: orchestration glue +- [`internal/review/graphctx.go`](../internal/review/graphctx.go) — Kuzu queries for change-context evidence + +**Failure modes:** + +- **No Ollama running** — connection refused on localhost:11434. Falls back to a clear error rather than hanging. +- **Model unavailable** — `ollama run` returns 404 for unknown model; surfaced as a clean error. +- **HTTP/2 SETTINGS infinite-loop CVE** — the Go 1.25.10 toolchain pin includes the fix for GO-2026-4918, reachable via `review.Client.Review` (per [`.github/workflows/go-ci.yml`](../.github/workflows/go-ci.yml) comment). +- **Stale graph evidence** — if the diff touches files that haven't been re-indexed, evidence is partial. The review still runs; quality is operator's responsibility. + +## 5. Error handling + +There is no centralized error-handling module. Conventions: + +| Layer | Pattern | +|---|---| +| CLI subcommands | Return `error` from `RunE`. Cobra prints + sets exit code (1 for usage error, 2 for runtime). | +| Detector | `Detect(ctx) *Result` — nil-tolerant. Detectors return `EmptyResult()` on no match; never panic on malformed input. | +| Graph layer | Every `s.Cypher(...)` returns `(rows, error)`. Mutation-gate rejections surface as `graph: write query rejected on read-only store (blocked keyword: X)`. | +| MCP tool handler | Catches errors, wraps in `NewErrorEnvelope(code, err, RequestID(ctx))` so the MCP protocol surface stays well-formed. | +| Logging | `fmt.Fprintln(os.Stderr, ...)` with verbosity controlled by root `-v` flag. No structured-logging library. Inference: shipping concise to stay supply-chain-clean. | + +## 6. Background jobs / data ingestion + +codeiq does not run background jobs. Every action is operator-driven (`codeiq `). The CI perf-gate is the closest thing to a scheduled job — it runs `index` + `enrich` against `testdata/fixture-multi-lang` on every PR. diff --git a/docs/05-configuration.md b/docs/05-configuration.md new file mode 100644 index 00000000..e3f6cf18 --- /dev/null +++ b/docs/05-configuration.md @@ -0,0 +1,144 @@ +# 05 — Configuration + +## Resolution order + +Last write wins: + +1. Built-in defaults (in code) +2. `~/.codeiq/config.yml` (Inference — referenced in the root command help text; verify by reading the loader) +3. `/codeiq.yml` (Inference — same as above; CLAUDE.md historically described this resolution chain) +4. `CODEIQ_
_` environment variables (Inference) +5. CLI flags + +> **Inference:** the YAML loader exists per the root-flag help text (`Path to codeiq.yml (default: ./codeiq.yml then ~/.codeiq/config.yml)`), but a top-level `codeiq config` subcommand for validation / explanation was never implemented (no `internal/cli/config.go`). Treat YAML config as a low-friction override knob, not a full schema-validated surface. + +## Persistent root flags + +Apply to every subcommand. Source: [`internal/cli/root.go`](../internal/cli/root.go). + +| Flag | Default | Effect | +|---|---|---| +| `--config ` | `""` (auto: `./codeiq.yml` then `~/.codeiq/config.yml`) | Override the config file path. | +| `--no-color` | `false` | Disable ANSI color in output. | +| `--json` | `false` | Emit JSON where applicable. | +| `-v`, `--verbose` | `0` (count) | Verbose logging. Repeatable: `-v` / `-vv` / `-vvv`. | +| `--version` | `false` | Print version + exit (alias of `codeiq version`). | + +## Per-subcommand flags (high signal) + +### `codeiq index` +| Flag | Default | Effect | +|---|---|---| +| `--batch-size` | `500` | Cache write batch size. | +| `-w`, `--workers` | `2 × GOMAXPROCS` | Detector pool concurrency. | + +### `codeiq enrich` +| Flag | Default | Effect | +|---|---|---| +| `--graph-dir ` | `/.codeiq/graph/codeiq.kuzu` | Override Kuzu store location. | +| `--memprofile=` | unset | Write Go heap profile (`pprof.WriteHeapProfile`). | +| `--max-buffer-pool=N` | 2 GiB | Override Kuzu `BufferPoolSize`. Bytes. | +| `--copy-threads=N` | `min(4, GOMAXPROCS)` | Override Kuzu `MaxNumThreads`. | + +### `codeiq mcp` +| Flag | Default | Effect | +|---|---|---| +| `--graph-dir ` | `/.codeiq/graph/codeiq.kuzu` | | +| `--max-results` | 500 | Cap returned rows per tool call. | +| `--max-depth` | 10 | Cap recursive-pattern depth (ego graph / blast radius / shortest path). | +| `--query-timeout` | 30s | Per-Cypher timeout. | + +### `codeiq stats` +| Flag | Default | Effect | +|---|---|---| +| `--graph-dir ` | default location | | +| `--category` | (all) | `graph` / `languages` / `frameworks` / `infra` / `connections` / `auth` / `architecture`. | +| `--json` | inherited from root | | + +### `codeiq query [path]` +Sub: `consumers` / `producers` / `callers` / `dependencies` / `dependents`. Flag: `--graph-dir`. + +### `codeiq find [path]` +Sub: `endpoints` / `guards` / `entities` / `topics` / `queues` / `services` / `databases` / `components`. Flags: `--limit` (100), `--offset` (0), `--graph-dir`. + +### `codeiq cypher [path]` +Flags: `--graph-dir`, `--table` (table-format output), `--max-results` (500), `--query-timeout` (30s). + +### `codeiq flow [path]` +Views: `overview` / `ci` / `deploy` / `runtime` / `auth`. Flags: `--format` (json / mermaid / dot / yaml), `--out `, `--graph-dir`, `--query-timeout`. + +### `codeiq graph [path]` +Flags: `-f, --format` (json / yaml / mermaid / dot), `--out `, `--graph-dir`, `--query-timeout`. + +### `codeiq topology [path]` +Sub: bare = full map; `service-detail `, `blast-radius --depth 5`, `bottlenecks`, `circular`, `dead`, `path `. + +### `codeiq review [path]` +| Flag | Default | Effect | +|---|---|---| +| `--base` | `HEAD~1` | Diff base ref. | +| `--head` | `HEAD` | Diff head ref. | +| `--model` | provider default | Override LLM model. | +| `-o, --out ` | stdout | Write review to file. | +| `--format` | `markdown` | `markdown` or `json`. | +| `--focus ` | unset | Restrict evidence to specific paths. | + +### `codeiq cache ` +Sub: `info`, `list --limit --offset --json`, `inspect `, `clear --yes`. All support `--cache-path `. + +### `codeiq plugins ` +Sub: `list --language --json`, `inspect --json`. Source: [`internal/cli/plugins.go`](../internal/cli/plugins.go). + +## Environment variables + +| Variable | Effect | Inference? | +|---|---|---| +| `CGO_ENABLED` | Must be `1` to build. | Hard requirement. | +| `GOMAXPROCS` | Bounds the per-file extractor pool to `2 × GOMAXPROCS` (Phase A OOM fix). | Source: [`internal/intelligence/extractor/enricher.go`](../internal/intelligence/extractor/enricher.go). | +| `OLLAMA_API_KEY` | When set, `codeiq review` switches from local Ollama (`http://localhost:11434`) to Ollama Cloud. | Source: [`internal/review/`](../internal/review/) client. | +| `OLLAMA_HOST` | Inference: standard Ollama env (e.g. `http://my-ollama:11434`). Treated as the cloud base URL when also `OLLAMA_API_KEY`. Verify in [`internal/review/client.go`](../internal/review/client.go). | Inference | +| `CODEIQ_BULK_BATCH_SIZE` | Override the 50,000-row Kuzu COPY batch size. | Source: [`internal/graph/bulk.go`](../internal/graph/bulk.go). | +| `GH_TOKEN` / `GITHUB_TOKEN` | Used by CI (release-go.yml, release-darwin.yml) when calling `gh` CLI. Not used at runtime by the binary. | Source: workflows. | +| `OLLAMA_*` (other) | Unknown — verify in `internal/review/client.go`. | Unknown | + +## Config files + +| File | Tracked? | Used by | Purpose | +|---|---|---|---| +| `/codeiq.yml` | optional (user's repo) | `codeiq` runtime | Per-project override (e.g. exclude dirs, detector tuning). Inference: schema not formally documented in code. | +| `~/.codeiq/config.yml` | n/a | `codeiq` runtime | Per-user defaults. | +| `/.codeiq/cache/codeiq.sqlite` | **no** (gitignored) | runtime cache | Analysis cache. | +| `/.codeiq/graph/codeiq.kuzu/` | **no** (gitignored) | runtime graph | Kuzu store. | +| `/.codeiq/cache/embeddings.sqlite` | **no** (gitignored if present) | runtime | Inference: embedding cache (CLAUDE.md historical mention; verify in [`internal/intelligence/lexical/`](../internal/intelligence/lexical/) if relevant). | + +## Feature flags + +The project does not use a feature-flag system. Build-time switches are limited to: + +| Switch | Mechanism | +|---|---| +| Parity harness | Go build tag `parity` (see [`parity/`](../parity/)). | +| Detector registration | Blank-import gate in [`internal/cli/detectors_register.go`](../internal/cli/detectors_register.go). | +| Verbose logging | `-v` count flag on the root command. | + +## Secrets + +- **None required for the CLI / MCP core.** The binary makes no outbound HTTP calls during `index` / `enrich` / `mcp` / `stats` / `find` / `query` / `cypher` / `flow` / `graph` / `topology`. +- **`OLLAMA_API_KEY`** is the only runtime secret, used by `codeiq review` against Ollama Cloud. +- **GitHub OIDC** (`id-token: write` in workflows) is the only secret-equivalent in CI; mints an ephemeral Fulcio cert for cosign keyless signing. No long-lived signing key is stored anywhere. + +## Safe defaults + +| Concern | Default | +|---|---| +| Kuzu BufferPool | 2 GiB (was 80% of system RAM via `kuzu.DefaultSystemConfig`; capped after Phase A OOM fix) | +| Extractor pool | `2 × GOMAXPROCS` (was unbounded; capped after Phase A OOM fix) | +| MCP query timeout | 30 s | +| MCP `--max-results` | 500 | +| MCP `--max-depth` | 10 | +| Detector worker pool | `2 × GOMAXPROCS` | +| Cache batch | 500 | +| Bulk-load batch | 50,000 rows | +| Mutation gate | **on** for `OpenReadOnly` stores (CALL allow-list: `db.*`, `show_*`, `table_*`, `current_setting`, `table_info`, `query_fts_index`) | + +Do not expose real secrets in any sample configs. The only sample that ever existed was [`docs/codeiq.yml.example`](../) and it was deleted in #168 — when you re-create it, ship placeholder values only. diff --git a/docs/06-data-model.md b/docs/06-data-model.md new file mode 100644 index 00000000..62e07467 --- /dev/null +++ b/docs/06-data-model.md @@ -0,0 +1,167 @@ +# 06 — Data model + +Two storage layers, both embedded: + +| Layer | Engine | Path | Schema | +|---|---|---|---| +| Analysis cache | SQLite (`mattn/go-sqlite3` 1.14.44, WAL) | `/.codeiq/cache/codeiq.sqlite` | [`internal/cache/schema.go`](../internal/cache/schema.go) | +| Graph store | Kuzu 0.11.3 (`kuzudb/go-kuzu`) | `/.codeiq/graph/codeiq.kuzu/` | [`internal/graph/schema.go`](../internal/graph/schema.go) | + +The cache is a content-addressable scratchpad for the index pipeline. The Kuzu store is the canonical fact base every read path consumes. + +## SQLite analysis cache + +`CacheVersion = 6`. PRAGMAs: `journal_mode=WAL`, `synchronous=NORMAL`, `foreign_keys=ON`, `busy_timeout=5000`. Schema lives in [`internal/cache/schema.go`](../internal/cache/schema.go). + +### `cache_meta` +| Column | Type | Notes | +|---|---|---| +| `meta_key` | TEXT PRIMARY KEY | Reserved-word workaround (Java H2 parity; kept for byte-identical parity dumps). | +| `meta_value` | TEXT NOT NULL | Same workaround. | + +### `files` +| Column | Type | Notes | +|---|---|---| +| `content_hash` | TEXT PRIMARY KEY | SHA-2-derived (see [`hasher.go`](../internal/cache/hasher.go)). | +| `path` | TEXT NOT NULL | Repo-relative file path. | +| `language` | TEXT NOT NULL | `java` / `python` / `typescript` / `go` / …. | +| `parsed_at` | TEXT NOT NULL | ISO-8601 timestamp. | +| `status` | TEXT | Default `'DETECTED'`. | +| `detection_method` | TEXT | Default `'tree-sitter'`. | +| `file_type` | TEXT | Default `'source'`. | +| `snippet` | TEXT | Inference: optional source excerpt. | + +### `nodes` +| Column | Type | Notes | +|---|---|---| +| `row_id` | INTEGER PRIMARY KEY AUTOINCREMENT | | +| `id` | TEXT NOT NULL | Mirrors `CodeNode.id`. | +| `content_hash` | TEXT NOT NULL | FK to `files`. | +| `kind` | TEXT NOT NULL | `NodeKind` string. | +| `data` | TEXT NOT NULL | JSON-serialized `CodeNode`. | + +Index: `idx_nodes_content_hash`. + +### `edges` +| Column | Type | Notes | +|---|---|---| +| `source` | TEXT NOT NULL | Source node ID. | +| `target` | TEXT NOT NULL | Target node ID. | +| `content_hash` | TEXT NOT NULL | FK-ish back to file (no explicit FK constraint). | +| `kind` | TEXT NOT NULL | `EdgeKind` string. | +| `data` | TEXT NOT NULL | JSON-serialized `CodeEdge`. | + +Index: `idx_edges_content_hash`. + +### `analysis_runs` +| Column | Type | Notes | +|---|---|---| +| `run_id` | TEXT PRIMARY KEY | | +| `commit_sha` | TEXT | Optional git SHA at index time. | +| `timestamp` | TEXT NOT NULL | ISO-8601. | +| `file_count` | INTEGER NOT NULL | Files indexed in the run. | + +Index: `idx_analysis_runs_timestamp`. + +### Migrations + +There is **no migration tool** — `CacheVersion = 6` is encoded as a constant in [`internal/cache/schema.go`](../internal/cache/schema.go) and stored in `cache_meta`. A version mismatch on open triggers a hard reset (Inference; verify in [`cache.go`](../internal/cache/cache.go)) — the recovery is to `rm -rf .codeiq/cache/`. Cache is regenerable from source. + +## Kuzu graph store + +Single node table `CodeNode` for all 34 NodeKinds. One REL table per `EdgeKind` (28 tables). All schema in [`internal/graph/schema.go`](../internal/graph/schema.go). + +### `CodeNode` columns + +| Column | Type | Purpose | +|---|---|---| +| `id` | STRING (PK) | Format `:::` (e.g. `java:com/foo/Bar.java:class:Bar`). | +| `kind` | STRING | One of 34 `NodeKind` string values. | +| `label` | STRING | Display name (short identifier). | +| `fqn` | STRING | Fully-qualified name. | +| `file_path` | STRING | Source file path. | +| `line_start` | INT64 | Start line. Empty-string in CSV ⇒ NULL. | +| `line_end` | INT64 | End line. | +| `module` | STRING | Owning module/package. | +| `layer` | STRING | `frontend` / `backend` / `infra` / `shared` / `unknown`. | +| `language` | STRING | Source language. | +| `framework` | STRING | Framework stamp (e.g. `spring`, `quarkus`, `fastapi`). | +| `confidence` | STRING | `LEXICAL` / `SYNTACTIC` / `RESOLVED`. | +| `source` | STRING | Emitting detector name. | +| `label_lower` | STRING | `lower(label)`. Inference: kept for CONTAINS fallback. | +| `fqn_lower` | STRING | `lower(fqn)`. Same rationale. | +| `prop_lex_comment` | STRING | Doc-comment / docstring / JSDoc text — surfaced by [`intelligence/extractor`](../internal/intelligence/extractor/). FTS index 2 covers it. | +| `prop_lex_config_keys` | STRING | Surfaced config-key list (YAML/JSON keys). FTS index 2 covers it. | +| `props` | STRING | JSON-serialized catch-all property map. | + +### REL tables (one per EdgeKind) + +Every REL table has the shape `FROM CodeNode TO CodeNode` plus `id STRING, confidence STRING, source STRING, props STRING`. There are 28: + +`DEPENDS_ON`, `IMPORTS`, `EXTENDS`, `IMPLEMENTS`, `CALLS`, `INJECTS`, `EXPOSES`, `QUERIES`, `MAPS_TO`, `PRODUCES`, `CONSUMES`, `PUBLISHES`, `LISTENS`, `INVOKES_RMI`, `EXPORTS_RMI`, `READS_CONFIG`, `MIGRATES`, `CONTAINS`, `DEFINES`, `OVERRIDES`, `CONNECTS_TO`, `TRIGGERS`, `PROVISIONS`, `SENDS_TO`, `RECEIVES_FROM`, `PROTECTS`, `RENDERS`, `REFERENCES_TABLE`. + +### FTS indexes (Kuzu 0.11.3 native, BM25) + +| Index name | Columns | Purpose | +|---|---|---| +| `code_node_label_fts` | `label`, `fqn_lower` | Powers `SearchByLabel` + `find_in_graph` text mode. | +| `code_node_lexical_fts` | `prop_lex_comment`, `prop_lex_config_keys` | Powers `SearchLexical` + doc/config-key search. | + +Built at `codeiq enrich` time via [`graph.CreateIndexes()`](../internal/graph/indexes.go). Drop-then-create — idempotent across re-enrich. + +## Canonical taxonomy (from `internal/model/`) + +### NodeKind (34 values) + +`NodeModule`, `NodePackage`, `NodeClass`, `NodeMethod`, `NodeEndpoint`, `NodeEntity`, `NodeRepository`, `NodeQuery`, `NodeMigration`, `NodeTopic`, `NodeQueue`, `NodeEvent`, `NodeRMIInterface`, `NodeConfigFile`, `NodeConfigKey`, `NodeWebSocketEndpoint`, `NodeInterface`, `NodeAbstractClass`, `NodeEnum`, `NodeAnnotationType`, `NodeProtocolMessage`, `NodeConfigDefinition`, `NodeDatabaseConnection`, `NodeAzureResource`, `NodeAzureFunction`, `NodeMessageQueue`, `NodeInfraResource`, `NodeComponent`, `NodeGuard`, `NodeMiddleware`, `NodeHook`, `NodeService`, `NodeExternal`, `NodeSQLEntity`. + +### EdgeKind (28 values) + +`EdgeDependsOn`, `EdgeImports`, `EdgeExtends`, `EdgeImplements`, `EdgeCalls`, `EdgeInjects`, `EdgeExposes`, `EdgeQueries`, `EdgeMapsTo`, `EdgeProduces`, `EdgeConsumes`, `EdgePublishes`, `EdgeListens`, `EdgeInvokesRMI`, `EdgeExportsRMI`, `EdgeReadsConfig`, `EdgeMigrates`, `EdgeContains`, `EdgeDefines`, `EdgeOverrides`, `EdgeConnectsTo`, `EdgeTriggers`, `EdgeProvisions`, `EdgeSendsTo`, `EdgeReceivesFrom`, `EdgeProtects`, `EdgeRenders`, `EdgeReferencesTable`. + +### Confidence (ordered, integer-comparable) + +| Constant | String | Score | Source pattern | +|---|---|---|---| +| `ConfidenceLexical` | `"LEXICAL"` | `0.6` | Regex / textual pattern only | +| `ConfidenceSyntactic` | `"SYNTACTIC"` | `0.8` | AST / parse-tree match | +| `ConfidenceResolved` | `"RESOLVED"` | `0.95` | Resolved via SymbolResolver (cross-file resolution) | + +Merge rule (in `graph_builder.mergeNode`): the higher-confidence node wins on conflict; donor only fills properties the survivor doesn't already have. + +### Layer (5 values) + +`LayerFrontend` (`"frontend"`), `LayerBackend` (`"backend"`), `LayerInfra` (`"infra"`), `LayerShared` (`"shared"`), `LayerUnknown` (`"unknown"`). + +Stamped by [`LayerClassifier`](../internal/analyzer/layer_classifier.go) during `enrich`. Detectors leave it at `LayerUnknown` unless they have strong evidence (e.g. a `tsx` component → `LayerFrontend`). + +## ID conventions + +| Prefix | Where | Example | +|---|---|---| +| `java:` / `py:` / `ts:` / `go:` / `cs:` / … | Per-language node IDs | `java:com/foo/Bar.java:class:Bar` | +| `:file:` | File-anchor nodes | `py:file:scripts/check.py` | +| `:external:` | External-anchor nodes for imports | `py:external:glob` | +| `service::` | SERVICE nodes (path-qualified after PR #151) | `service:frontend/widgets/checkbox:checkbox` | +| `topic:` / `queue:` / `event:` | Cross-language messaging nodes | `topic:users.created` | +| `compose::service:` | docker-compose services | `compose:docker-compose.yml:service:db` | +| `grpc:service:` | gRPC services | `grpc:service:foo.v1.Greeter` | +| `proto::service:` | protobuf services | | + +ID format **matters** — the GraphBuilder dedup map keys off them, and the Kuzu BulkLoad COPY would abort on any duplicate primary key (this was the #151 bug). File-anchor + external-anchor nodes also exist specifically to keep imports edges from being dropped as "phantom" at snapshot. + +## Persistence assumptions + +- **Cache + graph are regenerable.** Both live under `.codeiq/`, gitignored, deletable at any time. `index` + `enrich` rebuild them. +- **No incremental enrich.** Today `enrich` does a full re-bulk-load. Per-file-incremental would require diffing against the existing graph; not implemented. +- **No retention / TTL.** Operator controls when to `rm -rf .codeiq/` to reset. +- **No multi-user / multi-tenant.** One indexed root per `.codeiq/` directory. Concurrent `codeiq enrich` on the same target would race. +- **Kuzu file format compatibility.** When you bump Kuzu, the on-disk format may change — Kuzu does not guarantee forward/backward compat across minor versions. Treat the graph store as **rebuildable from cache**, not a long-term archive. The cache uses SQLite, which is stable. + +## What this model is **not** + +- Not a code-search index (use `ripgrep`). +- Not an LSP server (use `gopls` / `pyright` / etc.). +- Not a dependency-resolution tool (use `go mod` / `pip` / `npm`). +- Not a SBOM tool (it gives a structural map, not a manifest). +- Not a compiler — detectors are pattern-matchers, not type-checkers. Cross-file resolution (`ConfidenceResolved`) is the closest it gets; that's intentional. diff --git a/docs/07-integrations.md b/docs/07-integrations.md new file mode 100644 index 00000000..3c503dd7 --- /dev/null +++ b/docs/07-integrations.md @@ -0,0 +1,127 @@ +# 07 — Integrations + +> codeiq is intentionally minimal on external dependencies. The CLI core (index / enrich / mcp / stats / find / query / cypher / flow / graph / topology / cache / plugins / version) has **zero runtime external dependencies**. Only `codeiq review` reaches the network, and only to a user-configured LLM endpoint. + +## Runtime integrations + +### 1. Ollama (LLM, opt-in) + +Used **only** by `codeiq review` and the `review_changes` MCP tool. Everything else stays offline. + +| Mode | Endpoint | Activation | +|---|---|---| +| Local | `http://localhost:11434` (default) | Default; no env vars. | +| Cloud | Ollama Cloud base URL | Set `OLLAMA_API_KEY=`. | + +API: OpenAI-compatible `/v1/chat/completions`. The client lives in [`internal/review/client.go`](../internal/review/client.go) (Inference — verify by reading; CLAUDE.md historically described this). + +**Local-dev alternative:** any OpenAI-compatible endpoint should work as long as it accepts `model` + `messages` per the chat API. Untested with non-Ollama backends. + +**Failure modes:** +- No service on `:11434` → connection refused, clean error. +- Model not pulled → 404 from the upstream, clean error. +- HTTP/2 SETTINGS infinite loop (Go std-lib CVE GO-2026-4918) → mitigated by toolchain pin `go 1.25.10` per [`.github/workflows/go-ci.yml`](../.github/workflows/go-ci.yml) header comment. + +### 2. Git (CLI shell-out) + +Used by: + +- **File discovery** during `index` — `git ls-files` to enumerate tracked files. Falls back to `filepath.Walk` when not a git repo. +- **`codeiq review`** — `git diff ..` (and reading `git rev-parse` for context). + +Both are direct `exec.Command` shell-outs. No remote git operations. + +**Failure modes:** +- `git` not in PATH → `index` falls through to dir-walk; `review` errors cleanly. +- Detached HEAD / shallow clone → diff may fail; depends on the operator's `--base` / `--head` refs. + +### 3. Kuzu (embedded; not a network integration) + +Kuzu 0.11.3 via `github.com/kuzudb/go-kuzu`. CGO links the embedded C++ engine into the binary; no separate process, no network. Storage at `/.codeiq/graph/codeiq.kuzu/`. + +The FTS extension is **bundled** in 0.11.3+ (was network-installed pre-0.11.3 — the `INSTALL fts` call is a no-op when bundled). + +### 4. SQLite (embedded) + +`github.com/mattn/go-sqlite3` 1.14.44. Same story — CGO-linked, on-disk WAL file at `/.codeiq/cache/codeiq.sqlite`. + +### 5. Tree-sitter (embedded) + +`github.com/smacker/go-tree-sitter` plus per-language grammar packages. CGO. Used for AST parsing of Java / Python / TypeScript / Go inside the index pipeline. + +## Build-time / CI integrations + +These run only in the GitHub Actions environment, not in the user's binary. + +### GitHub OIDC + Sigstore (Fulcio + Rekor) — release signing + +| Step | Source | +|---|---| +| Cosign keyless signing of `checksums.sha256` | Goreleaser `signs:` stanza in [`.goreleaser.yml`](../.goreleaser.yml) | +| Cosign keyless signing of darwin tarball | [`.github/workflows/release-darwin.yml`](../.github/workflows/release-darwin.yml) `cosign sign-blob --bundle` | +| Identity | `id-token: write` workflow permission → Fulcio ephemeral cert from GitHub OIDC claim | +| Transparency log | Sigstore Rekor (automatic via cosign) | +| Verification regex | `https://github.com/RandomCodeSpace/codeiq/.github/workflows/release-go.yml@.*` | + +No long-lived signing key exists anywhere. Each release mints a fresh ephemeral cert. + +### Syft (SBOM) + +| Step | Source | +|---|---| +| Linux archive SBOMs | Goreleaser `sboms:` stanza → `syft` CLI per archive → `spdx-json` | +| Darwin archive SBOM | [`.github/workflows/release-darwin.yml`](../.github/workflows/release-darwin.yml) → `syft "$ARCHIVE" --output spdx-json=...` | +| Tool source | `anchore/sbom-action/download-syft@v0.24.0` | + +### Goreleaser + +Config: [`.goreleaser.yml`](../.goreleaser.yml) (Goreleaser v2). Triggered by [`.github/workflows/release-go.yml`](../.github/workflows/release-go.yml). Cross-compiles linux/amd64 (native gcc) + linux/arm64 (`gcc-aarch64-linux-gnu`). Hooks: `go mod download` + `go test ./...` as pre-build gates. + +### Security scanners (CI) + +[`.github/workflows/security.yml`](../.github/workflows/security.yml) runs six independent jobs on every PR: + +| Job | Action | Purpose | +|---|---|---| +| `osv-scanner` | `google/osv-scanner-action` | Vulnerable dependency scan against `go.mod` | +| `trivy` | `aquasecurity/trivy-action` | Filesystem scan, HIGH/CRITICAL, `ignore-unfixed: true` | +| `semgrep` | Semgrep CLI | `p/security-audit + p/owasp-top-ten + p/golang`, severity ERROR | +| `gitleaks` | `gitleaks/gitleaks-action` | Full-history secret scan, `--exit-code 1` | +| `jscpd` | `kucherenko/jscpd-action` | 3% duplication threshold on `cmd internal` | +| `sbom` | `anchore/sbom-action` | SPDX + CycloneDX upload as run artifacts | + +Plus `go-ci.yml` runs `go vet`, `staticcheck@2025.1.1`, `gosec@v2.22.0`, `govulncheck@latest`. + +### OpenSSF Scorecard + +[`.github/workflows/scorecard.yml`](../.github/workflows/scorecard.yml) — weekly Monday cron + every push to main. Uploads SARIF to GitHub code-scanning. Best-effort; does not gate merge. + +### Dependabot + +[`.github/dependabot.yml`](../.github/dependabot.yml) — weekly Monday 08:00 UTC. + +| Ecosystem | Path | Groups | +|---|---|---| +| `gomod` | `/` | `kuzu`, `tree-sitter`, `mcp`, `cobra-viper`, `sqlite`, `test-libs` | +| `github-actions` | `/` | `actions` (catch-all) | + +## What's **not** integrated + +| Tech | Decision | +|---|---| +| Kafka / RabbitMQ / Pub-Sub | codeiq detects these in scanned codebases. It does not connect to one itself. | +| Postgres / MySQL / MongoDB | Same — detected, never connected to. | +| Redis / Memcached | Same. | +| Cloud SDKs (AWS / Azure / GCP) | Detectors recognize their SDK patterns; no SDK pulled in as a dependency. | +| Telemetry (OpenTelemetry, Datadog, Prometheus, etc.) | None. The binary emits stderr logs only. | +| Auto-update | None. Operator tracks releases manually. | +| Issue trackers (Linear, Jira, GitHub) | Not used by the binary. The `gh` CLI is used in workflows only. | + +## Local / dev alternatives + +| Production use | Local-dev alternative | +|---|---| +| Goreleaser-built binary | `CGO_ENABLED=1 go build ./cmd/codeiq` | +| Cosign-verified release | `go install github.com/randomcodespace/codeiq/cmd/codeiq@v0.4.0` (or any tagged version ≥ v0.4.0) | +| Ollama Cloud (for `review`) | Local Ollama; just `ollama serve` and pull a model | +| Per-language tree-sitter grammars (CGO) | Cannot mock — they're embedded | diff --git a/docs/08-testing.md b/docs/08-testing.md new file mode 100644 index 00000000..6cbbd5e5 --- /dev/null +++ b/docs/08-testing.md @@ -0,0 +1,126 @@ +# 08 — Testing + +## Framework + +Standard library `testing`. **No testify, no gomock, no third-party assertion lib.** This is intentional — keeps the supply-chain surface narrow and matches Go community defaults. + +Test file count per package (sampled from `find internal -name '*_test.go' | wc -l`): + +| Package | _test.go files | +|---|---| +| `internal/detector/` | **104** (every detector has positive + negative + determinism tests) | +| `internal/intelligence/` | 15 | +| `internal/cli/` | 13 | +| `internal/mcp/` | 12 | +| `internal/analyzer/` | 11 | +| `internal/graph/` | 7 | +| `internal/model/` | 6 | +| `internal/parser/` | 5 | +| `internal/flow/` | 4 | +| `internal/cache/` | 3 | +| `internal/query/` | 3 | +| `internal/review/` | 3 | +| `parity/` | 1 | + +**Total:** 884+ tests pass cleanly on `main`. Race detector passes too (CI runs `go test -race -count=1`). + +## How to run + +```bash +# Full suite (~30s on a 4-core dev machine) +CGO_ENABLED=1 go test ./... -count=1 + +# With race detector (CI default) +CGO_ENABLED=1 go test ./... -race -count=1 + +# Single package +CGO_ENABLED=1 go test ./internal/detector/jvm/java/... -count=1 + +# Single test (verbose) +CGO_ENABLED=1 go test ./internal/mcp/... -count=1 -run TestRunCypher -v + +# Race + verbose on one package +CGO_ENABLED=1 go test ./internal/analyzer/... -race -count=1 -v +``` + +Always use `-count=1` to bypass the Go test cache when you've changed CGO-linked deps (Kuzu / SQLite version bumps invalidate cached results). + +## Test structure + +### Detector tests — three required cases per detector + +Each detector ships three test cases: + +1. **Positive** — sample source that the detector matches; assert emitted nodes/edges by ID + properties. +2. **Negative** — sample source the detector must *not* match (avoids false-positives on adjacent frameworks). +3. **Determinism** — run the detector twice, assert byte-identical emissions. Catches map-iteration order bugs. + +Sample: [`internal/detector/jvm/java/spring_rest_test.go`](../internal/detector/jvm/java/spring_rest_test.go). Pattern is uniform across the 100 detectors. + +### Analyzer tests + +[`internal/analyzer/analyzer_test.go`](../internal/analyzer/analyzer_test.go), [`enrich_test.go`](../internal/analyzer/enrich_test.go), [`graph_builder_test.go`](../internal/analyzer/graph_builder_test.go). Use `t.TempDir()` for cache + graph paths, write a few synthetic source files, run the pipeline, assert on `Stats` + graph contents. + +### Graph layer tests + +[`internal/graph/store_test.go`](../internal/graph/store_test.go), [`schema_test.go`](../internal/graph/schema_test.go), [`bulk_test.go`](../internal/graph/bulk_test.go), [`cypher_test.go`](../internal/graph/cypher_test.go), [`indexes_test.go`](../internal/graph/indexes_test.go), [`readonly_test.go`](../internal/graph/readonly_test.go), [`reads_test.go`](../internal/graph/reads_test.go). Each opens Kuzu under `t.TempDir()`. Key regression tests: + +- `TestBulkLoadEdgesCommaInProperties` — JSON properties with commas (PR #150). +- `TestBulkLoadNodesCommaInProperties` — same on nodes. +- `TestBulkLoadEdgesPipeInTargetID` — Istio cluster names with pipes (PR #153). +- `TestMutationKeyword` — table of blocked keywords + allow-listed CALL prefixes. + +### MCP tests + +[`internal/mcp/`](../internal/mcp/) — registry tests, per-tool argument-validation tests, parity tests for the 6 consolidated mode-driven tools, integration test ([`integration_test.go`](../internal/mcp/integration_test.go)) that boots the server in-memory and exercises every tool surface. + +Notable: [`tools_consolidated_parity_test.go`](../internal/mcp/tools_consolidated_parity_test.go) — locks down the arg names dispatched from each consolidated mode to the underlying narrow handler. Caught the PR #149 dispatch-mismatch bug. + +### CLI tests + +[`internal/cli/`](../internal/cli/) — every subcommand has a smoke test that builds the cobra command, sets `SetArgs(...)`, and asserts on stdout/exit code. Uses `t.TempDir()` for state. + +### Fixtures + +Two checked-in fixtures: + +| Fixture | Purpose | +|---|---| +| [`testdata/fixture-minimal/`](../testdata/fixture-minimal/) | 5 files (`User.java`, `UserController.java`, `models.py`, `README.md`, `expected-divergence.json`). Used by `index_test.go` + as the manual smoke target. | +| [`testdata/fixture-multi-lang/`](../testdata/fixture-multi-lang/) | Multi-service polyglot fixture (Java/Spring + TS/React + Python + Go + IaC). Used by enrich tests + perf-gate CI. Contains its own `pom.xml`, sub-`package.json`, etc. — exercises the ServiceDetector. | + +**Both fixtures use `t.TempDir()` or in-test setup; no test writes outside its tempdir.** Verified across all 41 test-bearing packages. + +## CI gates + +| Workflow | What blocks merge | +|---|---| +| [`.github/workflows/go-ci.yml`](../.github/workflows/go-ci.yml) | `go vet`, `go test -race`, staticcheck@2025.1.1, gosec@v2.22.0, govulncheck@latest | +| [`.github/workflows/perf-gate.yml`](../.github/workflows/perf-gate.yml) | Wall ≤ 8s, nodes ≥ 40, phantom-drop ratio ≤ 50%, peak enrich RSS ≤ 300 MB on `testdata/fixture-multi-lang` | +| [`.github/workflows/security.yml`](../.github/workflows/security.yml) | OSV-Scanner, Trivy HIGH/CRITICAL, Semgrep (security-audit + owasp-top-ten + golang), Gitleaks full history, jscpd ≤ 3%, SBOM upload (non-blocking) | + +`go-ci.yml` and `security.yml` are required status checks per branch protection. Scorecard is best-effort and does not block. + +## Test count baselines (regression detection) + +The `perf-gate.yml` baseline numbers are versioned at the workflow level. The Go test suite has no formal baseline — operator runs `go test ./... -count=1 | tail -3` to confirm "880+ passed" before declaring a release. + +## Missing / risky areas (places to add tests) + +| Area | Risk | Suggested tests | +|---|---|---| +| **Mutation gate** | Regex-based; adversarial inputs may bypass via comments or string literals containing keywords. | Fuzz [`MutationKeyword`](../internal/graph/mutation.go) against synthesized Cypher with comment / string smuggling. | +| **CSV bulk-load** | Field-delimiter collisions caused 3 production bugs (#150 commas, #153 pipes, #152 quoted keys). | Property fuzzing: generate random property maps with `"`, `\|`, `\t`, `\n`, `\r`, `\0`, `` and round-trip through `BulkLoadNodes` + `Cypher SELECT`. | +| **Recursive-pattern depth** | `[*1..N]` upper bound must be a literal — operator-supplied depth is `fmt.Sprintf`'d in. Currently capped at 10 in MCP `--max-depth`, but no test asserts the cap. | Test that passing `radius=99999` to `inspect_node/ego` clamps to `MaxDepth` and the resulting Cypher is well-formed. | +| **OOM regression** | Phase A/B/C OOM fix has a CI perf-gate at 300 MB on fixture-multi-lang. Real ~/projects/-scale runs at 2 GiB — no automated test. | Inference: hard to test cheaply in CI. Manual benchmark on the user's tree before each release is the current bar. | +| **Ollama integration** | Tested with a mock HTTP server (Inference — verify in [`internal/review/`](../internal/review/) tests) but a real Ollama API contract change would slip through. | Pin the Ollama OpenAPI to a known-good shape; add a contract test that runs against `ollama serve` when an env flag is set. | +| **Tree-sitter grammar version drift** | If a grammar bumps and emits different node types, detectors that pattern-match on AST node names silently break. | Snapshot tests on a few canonical source files per language, comparing emitted detector results. | +| **Cross-version Kuzu upgrade** | On-disk format may change across minor versions. Today's tests use `t.TempDir()` so they always start fresh. | Inference: hard to test until a version-bump scenario. Document the "graph is rebuildable from cache" assumption (already done in [06-data-model.md](06-data-model.md)). | + +## Conventions + +- Determinism is a **non-negotiable** requirement. Every detector emits its results in sorted order; the GraphBuilder snapshots deterministically. Adding a non-deterministic test (e.g. timing-dependent) is grounds for rejection. +- Tests use only stdlib. If you find yourself reaching for testify, ask whether the project actually needs it (so far the answer has been no). +- Race detector must pass on every PR. If you add goroutines, `-race` will catch races; if it doesn't, your test isn't exercising the concurrency it claims to. +- Always `-count=1` after CGO-relevant changes (Kuzu / SQLite / tree-sitter version bumps). +- Test fixtures are **content** — `testdata/fixture-minimal/README.md` and `testdata/fixture-multi-lang/services/*/README.md` are part of the fixture, not project documentation. diff --git a/docs/09-build-deploy-release.md b/docs/09-build-deploy-release.md new file mode 100644 index 00000000..a65c5b5a --- /dev/null +++ b/docs/09-build-deploy-release.md @@ -0,0 +1,175 @@ +# 09 — Build, deploy, release + +## Build + +### Local + +```bash +CGO_ENABLED=1 go build -o codeiq ./cmd/codeiq +``` + +Result: ~25 MB static binary. CGO links Kuzu (libstdc++), SQLite, tree-sitter grammars. + +### With version info + +```bash +CGO_ENABLED=1 go build \ + -ldflags "-X 'github.com/randomcodespace/codeiq/internal/buildinfo.Version=v0.4.0' \ + -X 'github.com/randomcodespace/codeiq/internal/buildinfo.Commit=$(git rev-parse --short HEAD)' \ + -X 'github.com/randomcodespace/codeiq/internal/buildinfo.Date=$(date -u +%Y-%m-%dT%H:%M:%SZ)'" \ + -o codeiq ./cmd/codeiq +``` + +Or just let the BuildInfo fallback handle it ([`internal/buildinfo/buildinfo.go`](../internal/buildinfo/buildinfo.go)) — `go install …@v0.4.0` produces a binary that reads `runtime/debug.BuildInfo` and reports `v0.4.0` without any `-ldflags`. + +### Cross-compile + +CGO complicates cross-compile. The release pipeline uses a separate macOS runner for darwin/arm64 rather than cross-compile, and linux/arm64 uses `gcc-aarch64-linux-gnu` on the linux runner. + +Local cross-compile to linux/arm64: +```bash +sudo apt-get install gcc-aarch64-linux-gnu +CGO_ENABLED=1 GOOS=linux GOARCH=arm64 CC=aarch64-linux-gnu-gcc go build ./cmd/codeiq +``` + +Cross-compile to darwin from linux: **not supported** without OSXCross or a macOS host (Kuzu's CGO won't link cleanly). + +## Packaging + +Single static-binary distribution. Release artifacts are tarballs of `codeiq` + `LICENSE` (+ `README.md` / `CHANGELOG.md` when those files are present in the repo — globbed optionally so the release pipeline survives doc-wipe states). + +| Artifact | Contents | +|---|---| +| `codeiq__linux_amd64.tar.gz` | binary + LICENSE [+ README + CHANGELOG] | +| `codeiq__linux_arm64.tar.gz` | same | +| `codeiq__darwin_arm64.tar.gz` | same | +| `*.sbom.spdx.json` | Syft-generated SPDX SBOM per archive | +| `*.cosign.bundle` | Cosign keyless signature bundle per archive (darwin only) + `checksums.sha256.cosign.bundle` (linux global) | +| `checksums.sha256` | SHA-256 manifest for all tarballs | + +## Container usage + +**No Docker image is built or published.** The binary is the unit of distribution. There is a `.dockerignore` in the repo (Inference: leftover from the Java era; doesn't matter for the Go-only build). + +If you want a container, build one locally: + +```dockerfile +# Inference — not a project-supported pattern +FROM gcr.io/distroless/cc-debian12 +COPY codeiq /codeiq +ENTRYPOINT ["/codeiq"] +``` + +Note CGO ties the binary to glibc (not musl/Alpine). + +## CI/CD + +### Workflows + +| Workflow | Trigger | Purpose | +|---|---|---| +| [`go-ci.yml`](../.github/workflows/go-ci.yml) | push (main), PR | vet + test -race + staticcheck + gosec + govulncheck | +| [`perf-gate.yml`](../.github/workflows/perf-gate.yml) | push, PR, workflow_dispatch | index + enrich on `fixture-multi-lang`; assert wall ≤ 8s / nodes ≥ 40 / phantom-drop ≤ 50% / peak RSS ≤ 300 MB | +| [`release-go.yml`](../.github/workflows/release-go.yml) | tag `v*.*.*` push, workflow_dispatch | Goreleaser build for linux/amd64 + linux/arm64 + SBOMs + cosign keyless | +| [`release-darwin.yml`](../.github/workflows/release-darwin.yml) | tag `v*.*.*` push, workflow_dispatch | macOS runner builds darwin/arm64 + uploads to the Release created by release-go | +| [`security.yml`](../.github/workflows/security.yml) | push, PR, weekly cron | OSV-Scanner, Trivy, Semgrep, Gitleaks, jscpd, SBOM upload | +| [`scorecard.yml`](../.github/workflows/scorecard.yml) | push (main), weekly cron | OpenSSF Scorecard SARIF upload (best-effort) | + +`go-ci.yml` and `security.yml` are required status checks; merging is blocked until they're green. Scorecard is informational. + +### Release flow end-to-end + +``` +git tag v0.4.X && git push origin v0.4.X + │ + ├──► release-go.yml (ubuntu-latest) + │ 1. checkout, setup-go 1.25.10, install gcc-aarch64-linux-gnu + │ 2. install syft + cosign + │ 3. goreleaser release --clean + │ ├─ before.hooks: go mod download + go test ./... + │ ├─ build linux/amd64 (CC=gcc) + linux/arm64 (CC=aarch64-linux-gnu-gcc) + │ ├─ ldflags: -X buildinfo.{Version,Commit,Date,Dirty} + │ ├─ archive tar.gz (files glob: LICENSE*, README.md*, CHANGELOG.md*) + │ ├─ syft per archive → spdx-json + │ ├─ cosign sign-blob (keyless via id-token: write) on checksums.sha256 + │ └─ create draft GitHub Release with all artifacts + │ 4. actions/attest-build-provenance@v4 on dist/*.tar.gz + │ + └──► release-darwin.yml (macos-14) + 1. checkout, setup-go 1.25.10 + 2. go build CGO native (darwin/arm64) with ldflags + 3. tar.gz the archive + 4. syft → spdx-json + 5. cosign sign-blob --bundle archive.cosign.bundle + 6. POLL `gh release view $TAG` for up to 15 min, early-bail + if release-go.yml concluded failure/cancelled/timed_out + 7. gh release upload --clobber the darwin artifacts to the + same Release release-go created + 8. actions/attest-build-provenance on the darwin tarball + +After both succeed: + - Release exists as DRAFT (per .goreleaser.yml `draft: true`) + - Maintainer reviews + runs `gh release edit $TAG --draft=false` to publish +``` + +The 15-minute poll budget + early-bail check (PR #165) lets release-darwin tolerate the typical 4–8 minute Goreleaser pipeline without manual intervention. + +### Cosign verification (release consumer) + +```bash +cosign verify-blob \ + --bundle checksums.sha256.cosign.bundle \ + --certificate-identity-regexp 'https://github.com/RandomCodeSpace/codeiq/.github/workflows/release-go.yml@.*' \ + --certificate-oidc-issuer https://token.actions.githubusercontent.com \ + checksums.sha256 +``` + +For the darwin tarball directly: + +```bash +cosign verify-blob \ + --bundle codeiq_0.4.0_darwin_arm64.tar.gz.cosign.bundle \ + --certificate-identity-regexp 'https://github.com/RandomCodeSpace/codeiq/.github/workflows/release-darwin.yml@.*' \ + --certificate-oidc-issuer https://token.actions.githubusercontent.com \ + codeiq_0.4.0_darwin_arm64.tar.gz +``` + +## Deployment assumptions + +codeiq is a **developer-host tool**. It runs: + +- Locally as a CLI. +- Locally as a stdio MCP server (Claude Code / Cursor configs spawn it). +- Inside CI as a one-shot scanner (`codeiq index ` on a checked-out branch). + +It is **not** designed for: + +- Long-running services on shared infrastructure. +- Multi-tenant deployment. +- Publicly-exposed HTTP endpoints. (There are none — the MCP server is stdio-only.) + +Per the security policy at the time of the original SECURITY.md (deleted in #168, will be rewritten): the public-internet attack surface is zero by design. + +## Rollback notes + +| Scenario | Action | +|---|---| +| Bad release tag | Don't delete the tag — proxy.golang.org caches version content immutably. Cut a fresh `v0.4.X+1` with the rollback applied. | +| Bad merge on main | Standard `git revert ` → re-merge to main → ship next tag. | +| Cache / graph corruption on a user host | `rm -rf /.codeiq/` and re-run `index` + `enrich`. Both stores are regenerable. | +| Kuzu on-disk format change | If Kuzu bumps a minor version, the binary's embedded Kuzu can't read old `.codeiq/graph/`. Same fix: nuke + re-enrich. The cache is SQLite (stable across versions). | + +## Version history (post-reset) + +| Tag | Date | Notes | +|---|---|---| +| `v0.4.0` | 2026-05-14 | First release after the v0.0.x–v0.3.0 history reset. Includes OOM-fix saga + Kuzu 0.7.1 → 0.11.3 + native FTS + module hoist + 5 enrich correctness fixes. | +| `v0.4.1` | 2026-05-14 | CI/dependency hygiene. release-darwin race fix + Dependabot bumps. | + +Earlier tags (`v0.0.x`, `v0.1.x`, `v0.2.x`, `v0.3.0`, `v1.0.0`) were deleted because `proxy.golang.org` permanently caches each version's zip — reusing a deleted tag would serve stale (often Python-prototype-era) content. The commits remain on `main`; only the tags and GitHub Releases are gone. + +## Known release issues + +- **Goreleaser `draft: true`** — every release needs a manual `gh release edit --draft=false`. Defensible default; keeps you from broadcasting a half-baked release. To change, edit [`.goreleaser.yml`](../.goreleaser.yml) release block. +- **`files: README.md` was a literal-file match** — bare filenames in goreleaser's archive `files:` block hard-fail when missing. After the doc-wipe (#168), v0.4.2 release failed for this reason. Fixed by switching to `README.md*` glob (PR #169 — open at time of writing). Confirm landed before next tag push. +- **release-darwin race against release-go** — release-darwin polls for the Release created by release-go. Old 90s poll budget timed out; new 15-min budget + early-bail (PR #165) handles the worst-case Goreleaser pipeline. diff --git a/docs/10-known-risks-and-todos.md b/docs/10-known-risks-and-todos.md new file mode 100644 index 00000000..31547fe9 --- /dev/null +++ b/docs/10-known-risks-and-todos.md @@ -0,0 +1,100 @@ +# 10 — Known risks and TODOs + +## TODO / FIXME / HACK markers in code + +**Zero.** A repo-wide grep for `TODO`, `FIXME`, `HACK`, `XXX` in `internal/`, `cmd/`, and `parity/` returns **0 occurrences**. + +This is the result of a deliberate "no TODOs in main" discipline — incomplete work either ships behind a clear interface or is captured in plan files (now wiped). The flip side: there's no in-code roadmap for "things known to need work". The list below substitutes for that, drawn from comments, deleted plan files, and observed bugs. + +## Risks by surface + +### Index / enrich pipeline + +| Risk | Severity | Why | Where | +|---|---|---|---| +| **Detector registration choke point** | High | One forgotten blank import in [`detectors_register.go`](../internal/cli/detectors_register.go) ships an empty registry for that language family. Phase 4 caught the bug: 15 language families silently produced 0 nodes. | [`internal/cli/detectors_register.go`](../internal/cli/detectors_register.go) — 18 leaf packages today. | +| **Phantom edge drop** | Medium | Edges whose endpoint isn't in the node set get dropped at `Snapshot()`. Detectors must emit anchor nodes via [`base.EnsureFileAnchor`](../internal/detector/base/) / `EnsureExternalAnchor` for imports to survive. New detectors authored without this pattern silently lose edges. | [`internal/analyzer/graph_builder.go`](../internal/analyzer/graph_builder.go), [`internal/detector/base/imports_helpers.go`](../internal/detector/base/) | +| **CSV field-delimiter collisions** | Medium (mitigated) | Three production bugs in series: commas in JSON properties (#150), Istio cluster names with `\|` (#153), TOML quoted keys (#152). Current state: `DELIM='\|', QUOTE='"', ESCAPE='"'` with RFC-4180 round-trip + per-field tests. Any new structured-output channel must re-verify quoting end-to-end. | [`internal/graph/bulk.go`](../internal/graph/bulk.go) | +| **Service-ID PK collisions** | Mitigated | Two modules sharing a folder name used to emit `service:` twice → Kuzu COPY aborts. Now path-qualified `service::` (#151). Any future "container" node type (e.g. workspace, layer) needs the same discipline. | [`internal/analyzer/service_detector.go:168`](../internal/analyzer/service_detector.go) | +| **OOM at scale** | Mitigated | Pre-v0.4.0: enrich OOM-killed on 49k-file inputs. Fixed via three landed PRs (#145 parse-once-per-file + bounded extractor pool + Kuzu BufferPool cap; #146 TreeCursor refactor; #147 CLI knobs). Current ceiling: ~2 GiB peak RSS at ~/projects/-scale. Larger inputs (e.g. 200k+ files) untested. | [`internal/intelligence/extractor/enricher.go`](../internal/intelligence/extractor/enricher.go), [`internal/graph/store.go`](../internal/graph/store.go) | +| **Non-deterministic detector** | High if introduced | Same input must produce byte-identical output. Map iteration without `sort.Strings` of keys, or goroutine ordering without explicit sort, is the typical bug. Every detector ships a determinism test for exactly this reason. | All 100 detectors | +| **Tree-sitter grammar drift** | Medium | Detectors pattern-match on AST node names. A grammar bump that renames nodes silently regresses fidelity. No automated regression for this. | [`internal/parser/`](../internal/parser/) | + +### MCP server + +| Risk | Severity | Why | Where | +|---|---|---|---| +| **Mutation gate bypass via formatting tricks** | Medium | The gate is regex over the literal query string, comment-stripped. Adversarial inputs embedding `CREATE` inside a string literal, or weird whitespace, could in principle slip through to Kuzu. Belt-and-braces is `OpenReadOnly` at the Kuzu level. | [`internal/graph/mutation.go`](../internal/graph/mutation.go) — `MutationKeyword` | +| **`run_cypher` opens the full surface** | Medium | Even read-only Cypher can exhaust resources (huge result sets, recursive patterns). `--max-results 500` + `--query-timeout 30s` cap the worst case, but a CALL-procedure on the allow-list could theoretically be slow. | [`internal/mcp/tools_graph.go`](../internal/mcp/tools_graph.go) | +| **`read_file` path traversal** | Medium | Sandboxed to indexed root via `filepath.Rel` checks (Inference — verify in [`mcp/read_file.go`](../internal/mcp/)). If the rel-check is missing or wrong, file content outside the root could leak. Worth a focused fuzz test. | [`internal/mcp/tools_graph.go`](../internal/mcp/tools_graph.go) — `toolReadFile` | +| **Consolidated-tool arg-name dispatch** | Mitigated | 7 modes used to permanently return `INVALID_INPUT` because the 6 consolidated tools passed wrong arg names to the underlying narrow handlers (#149). Now covered by parity tests. New mode additions need a parity-test entry. | [`internal/mcp/tools_consolidated_parity_test.go`](../internal/mcp/tools_consolidated_parity_test.go) | + +### Build / release + +| Risk | Severity | Why | Where | +|---|---|---|---| +| **Poisoned proxy cache for deleted tags** | High (latent) | `proxy.golang.org` caches every version's zip immutably. The pre-v0.4.0 tags were deleted from GitHub but their zips live on at the proxy. Anyone running `go install …@v0.3.0` gets the old, broken layout. | History; mitigated by always picking a new version on releases. | +| **goreleaser `files:` hard-fail on missing literal** | Mitigated | `files: README.md` was a literal-file match — release blew up after the doc-wipe. Fixed by switching to globs `README.md*`. | [`.goreleaser.yml`](../.goreleaser.yml) (PR #169) | +| **release-darwin race** | Mitigated | Polls for the Release created by release-go. Old 90s window timed out. Now 15 min + early-bail on upstream failure (PR #165). | [`.github/workflows/release-darwin.yml`](../.github/workflows/release-darwin.yml) | +| **Goreleaser `draft: true`** | Operational | Every release lands as a draft. Maintainer must `gh release edit --draft=false` to publish. Forgetting it means users can't download. | [`.goreleaser.yml`](../.goreleaser.yml) | +| **No automatic CHANGELOG cut** | Operational | Each release tag should be preceded by a CHANGELOG PR that cuts `[Unreleased]` → `[vX.Y.Z]`. Today this is manual. Easy to forget. | [`docs/09-build-deploy-release.md`](09-build-deploy-release.md) | + +### Supply chain / security + +| Risk | Severity | Why | Where | +|---|---|---|---| +| **CGO + glibc lock-in** | Medium | The binary statically links Kuzu/SQLite/tree-sitter C++ code against glibc. Won't run on Alpine / musl without rebuilding upstream Kuzu. | All CGO deps. | +| **No SLSA build attestation visibility on the consumer side** | Low | `actions/attest-build-provenance` posts to GitHub's attestations store, but the README doesn't tell users how to verify it (`gh attestation verify codeiq.tar.gz --owner RandomCodeSpace`). | [`.github/workflows/release-go.yml`](../.github/workflows/release-go.yml) | +| **Long-running CGO third-party code** | Medium | Kuzu is C++. A use-after-free or heap-overflow in Kuzu's query engine becomes our problem. Mitigated by pinning to a stable Kuzu version + cosign-verified release archives. No automated fuzzing today. | [`go.mod`](../go.mod) | +| **Detector regex DoS** | Low | Per-file regex execution under a worker pool. A pathological input (e.g. very long backtracking-prone regex match) could stall a worker. Go's RE2 doesn't backtrack, so the risk is bounded — but a 1 GB single-line file is still 1 GB of byte scans. The default 2 MB tree-sitter buffer + `DefaultExcludeDirs` filter limit exposure. | All 100 regex detectors | + +### Code quality / debt + +| Item | Where | Note | +|---|---|---| +| **`label_lower` / `fqn_lower` columns** | [`internal/graph/schema.go`](../internal/graph/schema.go) | Redundant with FTS indexes since Kuzu 0.11. Kept for CONTAINS fallback. Removing would require a schema migration + dropping the fallback path. | +| **Hand-rolled structured parsers for YAML/JSON/TOML/INI/properties** | [`internal/parser/structured.go`](../internal/parser/structured.go) | Sufficient for detector use today. Misses TOML edge cases (array-of-tables, inline tables, multi-line strings — documented in code). Replace with a real TOML library if a bug surfaces. | +| **No central error type** | Across packages | Errors are stringly-typed. Operator scripting around stderr is fragile. | +| **No structured logging** | Across packages | Plain `fmt.Fprintln(os.Stderr, ...)` at `-v` levels. Fine for a CLI; would be limiting in a long-running service (and there isn't one, so it's fine). | +| **`config ` subcommand not implemented** | [`internal/cli/`](../internal/cli/) — no `config.go` | Older docs referenced this. Today only the root `--config` flag exists. | +| **Java detector `.java` test fixtures** | [`testdata/fixture-minimal/`](../testdata/fixture-minimal/) | The fixture has `User.java` + `UserController.java` to exercise the Java detector. They are content, not project Java code — easy to confuse with stale Java-era artifacts. | +| **`parity/` harness mostly idle** | [`parity/`](../parity/) | Build-tag `parity`. Used during the Java → Go port; now sits behind the tag waiting for someone to wake it up or delete it. | + +### Incomplete features + +| Feature | State | +|---|---| +| `codeiq config ` | Mentioned in older docs; never implemented. Root `--config` flag works. | +| Incremental enrich | Not implemented. Today `enrich` does a full re-bulk-load of every cached row. | +| Multi-host distributed enrich | Not implemented; not currently a need. | +| Real-time / watch mode | Not implemented. Each run is operator-initiated. | +| Kuzu version-upgrade path | Manual `rm -rf .codeiq/graph/` + re-enrich. No tooling. | + +### Security-sensitive areas + +| Surface | Why sensitive | +|---|---| +| [`internal/graph/mutation.go`](../internal/graph/mutation.go) — `MutationKeyword` | Last-line gate for the `run_cypher` MCP tool. A bypass = arbitrary writes against the read-only store. | +| [`internal/mcp/tools_graph.go`](../internal/mcp/tools_graph.go) — `toolReadFile` | Path sandboxing. A bypass = arbitrary local file disclosure. | +| [`internal/review/`](../internal/review/) | Outbound HTTP to Ollama. Only network-touching surface in the binary. SSRF / injection in `model` / `messages` would land in the LLM, not in our process — but a malicious Ollama endpoint could feed back crafted responses. | +| Release pipeline (workflows + goreleaser) | Cosign keyless signing depends on GitHub OIDC. A workflow-modify PR that changes the signing identity would produce an unverifiable signature — branch protection on `main` is the mitigation. | +| GitHub Actions third-party actions | All third-party actions are pinned by SHA (per Scorecard guidance). Updating one requires re-pinning. | + +## Migration risks + +| Bump | Risk | +|---|---| +| Kuzu major / minor | On-disk format may break. The graph is rebuildable from cache; the cache is SQLite (stable). Plan: enrich-rebuild on every Kuzu bump. Test by running enrich against `testdata/fixture-multi-lang` post-bump. | +| Go toolchain | Module pins `go 1.25.0` (clamped by `modelcontextprotocol/go-sdk`). Bumping past 1.26 requires waiting for the SDK to update. | +| tree-sitter grammars | A grammar bump that renames AST nodes silently regresses detectors that pattern-match on those node names. No automated regression. | +| `mattn/go-sqlite3` | Stable; minor bumps have shipped without issue (1.14.22 → 1.14.44 in PR #157). | +| MCP SDK | Pins the Go module minimum. A major bump may rework the `Server.AddTool` / `Tool` signature; the wrapper in [`internal/mcp/tool.go`](../internal/mcp/tool.go) is the absorption layer. | + +## Recommended next sweeps + +1. **Fuzz the mutation gate** with adversarial Cypher (comment smuggling, string literals containing blocked keywords, unicode whitespace tricks). +2. **Property fuzz the bulk-load CSV writer** with random byte sequences in node/edge properties. +3. **Snapshot tests for tree-sitter grammar output** on canonical files per language; pin grammar versions in `go.mod` rather than the wildcard tag patterns currently in use. +4. **`gh attestation verify` documentation** for release consumers. +5. **Consider extracting an `extra_files` block in `.goreleaser.yml`** with `match: optional` rather than the `*`-glob hack for README.md. +6. **Delete `parity/` or revive it** — Phase 5 of the Java port is over; the harness is in limbo. diff --git a/docs/11-agent-handoff.md b/docs/11-agent-handoff.md new file mode 100644 index 00000000..65f41d66 --- /dev/null +++ b/docs/11-agent-handoff.md @@ -0,0 +1,172 @@ +# 11 — Agent handoff + +> **One-stop brief for the next AI coding agent (Claude / Cursor / Cline / etc.) that lands in this repo.** Read this once, plus `CLAUDE.md`, before doing anything else. + +## Project in 20 lines + +1. **codeiq** is a Go CLI + stdio MCP server that builds a deterministic code-knowledge graph from source trees. +2. Single static binary (`cmd/codeiq/main.go`), CGO-mandatory (Kuzu + SQLite + tree-sitter all link C/C++). +3. Module path is `github.com/randomcodespace/codeiq` (hoisted from `/go/` in PR #162 — paths at repo root, not `go/...`). +4. Pipeline: `index` (files → SQLite cache) → `enrich` (cache → Kuzu graph + FTS indexes) → `mcp`/`stats`/`query`/etc. +5. **Zero LLM in the index/enrich pipeline.** Same input ⇒ same output, byte-for-byte. The only LLM touch is the opt-in `codeiq review` subcommand against Ollama. +6. 100 detectors across 35+ languages live in `internal/detector//.go`, each implementing the `detector.Detector` interface. +7. **Critical:** every detector category must be blank-imported in `internal/cli/detectors_register.go` — forget it and the binary ships dead for that family. +8. Storage: SQLite cache at `/.codeiq/cache/codeiq.sqlite`, Kuzu graph at `/.codeiq/graph/codeiq.kuzu/`. Both gitignored. +9. Kuzu 0.11.3 — bundled FTS extension (`CALL CREATE_FTS_INDEX` / `QUERY_FTS_INDEX`, BM25 ranked). Native parameterized `LIMIT $lim`. Mutation gate in `internal/graph/mutation.go` for `OpenReadOnly`. +10. MCP server exposes 10 tools (6 mode-driven + `run_cypher` + `read_file` + `generate_flow` + `review_changes`). Read-only — no write tools. +11. Determinism is non-negotiable. Don't iterate Go maps without sorting keys. Every detector has a determinism test. +12. Tests: `CGO_ENABLED=1 go test ./... -count=1` — ~880+ pass; CI runs `-race` too. All tests use `t.TempDir()`; no test writes outside its tempdir. +13. CI: `go-ci.yml` (vet/test/staticcheck/gosec/govulncheck), `perf-gate.yml` (RSS ≤ 300 MB on fixture-multi-lang), `security.yml` (OSV+Trivy+Semgrep+Gitleaks+jscpd+SBOM), Scorecard, release-go/darwin. +14. Releases: tag `vX.Y.Z` push → Goreleaser + Cosign keyless via GitHub OIDC + Syft SBOMs + draft Release → `gh release edit --draft=false` to publish. +15. Current version: **v0.4.1** (2026-05-14). All earlier tags were deleted from GitHub because `proxy.golang.org` permanently caches version content; reusing a deleted tag name serves stale Python-prototype-era zips. +16. **Never re-use a deleted version number.** Always tag forward (v0.4.X+1). +17. There's no REST API, no web UI, no telemetry, no auto-update, no Docker image. Operator-driven CLI + stdio MCP only. +18. Java reference implementation was deleted at v0.3.0 cutover (PR #132). Will not return. +19. `parity/` directory is a build-tag-gated harness (`-tags parity`) from the Java→Go port; mostly idle, can be revived or deleted. +20. Documentation lives entirely under `docs/` + `README.md` + `CLAUDE.md`. Wiped + rebuilt in this handoff (PR #168 + the doc-rewrite this file is part of). + +## Top 20 files to read first + +In order — each one is the entry point for one concept. Reading all 20 takes ~30 minutes and covers ~90% of the codebase's surface. + +| Order | File | What you learn | +|---|---|---| +| 1 | [`README.md`](../README.md) | User-facing overview + install paths + MCP wiring example | +| 2 | [`CLAUDE.md`](../CLAUDE.md) | Repo-specific instructions for AI agents (this file's sibling) | +| 3 | [`docs/02-architecture.md`](02-architecture.md) | The component map | +| 4 | [`docs/04-main-flows.md`](04-main-flows.md) | Per-flow entry points + failure modes | +| 5 | [`docs/06-data-model.md`](06-data-model.md) | Kuzu + SQLite schemas + canonical taxonomy | +| 6 | [`cmd/codeiq/main.go`](../cmd/codeiq/main.go) | Entry point (5 lines) | +| 7 | [`internal/cli/root.go`](../internal/cli/root.go) | Cobra root command + global flags | +| 8 | [`internal/cli/detectors_register.go`](../internal/cli/detectors_register.go) | **The choke point** — every detector family blank-imported here | +| 9 | [`internal/analyzer/analyzer.go`](../internal/analyzer/analyzer.go) | Index pipeline orchestration | +| 10 | [`internal/analyzer/enrich.go`](../internal/analyzer/enrich.go) | Enrich pipeline orchestration | +| 11 | [`internal/analyzer/graph_builder.go`](../internal/analyzer/graph_builder.go) | Dedup + phantom-edge drop | +| 12 | [`internal/detector/detector.go`](../internal/detector/detector.go) | `Detector` interface + `Default` registry | +| 13 | [`internal/detector/jvm/java/spring_rest.go`](../internal/detector/jvm/java/spring_rest.go) | A canonical detector implementation | +| 14 | [`internal/graph/schema.go`](../internal/graph/schema.go) | Kuzu schema definition | +| 15 | [`internal/graph/bulk.go`](../internal/graph/bulk.go) | CSV staging + Kuzu COPY FROM (pipe-delim + RFC-4180) | +| 16 | [`internal/graph/mutation.go`](../internal/graph/mutation.go) | Read-only mutation gate | +| 17 | [`internal/graph/indexes.go`](../internal/graph/indexes.go) | FTS index management + search helpers | +| 18 | [`internal/mcp/server.go`](../internal/mcp/server.go) | MCP server bootstrap | +| 19 | [`internal/mcp/tools_consolidated.go`](../internal/mcp/tools_consolidated.go) | The 6 mode-driven tools + delegation pattern | +| 20 | [`.goreleaser.yml`](../.goreleaser.yml) + [`.github/workflows/release-go.yml`](../.github/workflows/release-go.yml) | Release pipeline | + +## Commands future agents should use + +```bash +# Build + smoke +CGO_ENABLED=1 go build -o /tmp/codeiq ./cmd/codeiq +/tmp/codeiq --version +/tmp/codeiq index testdata/fixture-minimal && /tmp/codeiq enrich testdata/fixture-minimal && /tmp/codeiq stats testdata/fixture-minimal + +# Test +CGO_ENABLED=1 go test ./... -count=1 # all 880+ tests +CGO_ENABLED=1 go test ./... -race -count=1 # CI-equivalent +CGO_ENABLED=1 go test ./internal//... -count=1 # single package +CGO_ENABLED=1 go test ./... -count=1 -run TestFooBar # single test + +# Static analysis (mirrors CI) +go vet ./... +go install honnef.co/go/tools/cmd/staticcheck@2025.1.1 && "$(go env GOPATH)/bin/staticcheck" ./... +go install github.com/securego/gosec/v2/cmd/gosec@v2.22.0 && "$(go env GOPATH)/bin/gosec" -exclude=G104,G115,G202,G204,G301,G304,G306,G401,G404,G501 ./... +go install golang.org/x/vuln/cmd/govulncheck@latest && "$(go env GOPATH)/bin/govulncheck" ./... + +# Manual perf check (mirrors perf-gate.yml) +/usr/bin/time -v /tmp/codeiq enrich testdata/fixture-multi-lang # expect <8s wall, <300MB RSS + +# Inspect releases / tags +gh release list +git tag --list +gh pr list --state open --json number,title + +# Run / verify go install +CGO_ENABLED=1 go install github.com/randomcodespace/codeiq/cmd/codeiq@v0.4.1 +codeiq --version +``` + +## Commands future agents should AVOID + +| Command | Why | +|---|---| +| `git push --force` to `main` | Branch protection blocks it; bypassing rewrites shared history. | +| `git tag --force vX.Y.Z` to reuse a deleted version | proxy.golang.org caches version content immutably. The reused tag won't refresh; users get the stale zip. | +| `git tag v0.1.0`, `v0.2.0`, `v0.3.0`, `v1.0.0` | All deleted from GitHub but poisoned at proxy.golang.org. Use a never-previously-used version (next is **v0.4.2** unless deleted; **v0.5.0** is safer). | +| `gh release delete --cleanup-tag` on a published version | Same poison risk — once a tag has had `go install …@` run against it, the proxy has cached the zip. | +| `rm -rf .codeiq/` mid-pipeline | OK between `index` and `enrich`, but never between an `enrich` and a running `mcp` — the server will lose its read store. | +| `goreleaser release` locally without `--snapshot` | Will try to create a real GitHub Release. Use `goreleaser release --snapshot --clean` for local dry-runs. | +| `CGO_ENABLED=0 go build` | Will fail at link time. Kuzu + SQLite + tree-sitter all require CGO. | +| `cd go && go build` | The `go/` subdir was hoisted to root in PR #162. Stale instructions from older docs. | +| `gh pr merge` with `--admin` to bypass CI | go-ci.yml + security.yml are required for a reason. Don't bypass. | + +## Architectural rules + +1. **MCP server is strictly read-only.** No tool may mutate the Kuzu store. `run_cypher` enforces this via [`MutationKeyword`](../internal/graph/mutation.go); `read_file` enforces path sandboxing. +2. **Index/enrich are CLI-only.** The MCP layer never triggers them. Operator runs them. +3. **Detectors are stateless.** No mutable struct fields. The single shared instance per detector type registers once at `init()` time and is called concurrently from the worker pool. +4. **Determinism over micro-optimization.** Never iterate a map without sorting keys. Linker outputs go through `.Sorted()` at the call site. `Snapshot()` sorts. +5. **Confidence ladder is monotonic.** `LEXICAL (0.6) < SYNTACTIC (0.8) < RESOLVED (0.95)`. Merges keep the higher-confidence node; donor only fills missing properties. +6. **Phantom edges drop at Snapshot.** Detectors that emit edges to "external" or "file-anchor" nodes must explicitly create those nodes via `base.EnsureFileAnchor` / `EnsureExternalAnchor`. +7. **ID prefixes are stable.** `:file:`, `:external:`, `service::`, `topic:`. The GraphBuilder dedup map keys off these. +8. **No telemetry. No auto-update. No outbound network during index/enrich/mcp.** Only `codeiq review` reaches the Ollama endpoint. +9. **CGO is required everywhere.** This is not negotiable for as long as we use Kuzu + SQLite + tree-sitter. +10. **One CodeNode table for all NodeKinds.** Schema simplicity. Don't add per-label tables. + +## Coding conventions + +- **`gofmt`-clean.** CI verifies via `go vet`. +- **No `interface{}` unless needed.** Prefer concrete types or generics where possible (Go 1.25 has them). +- **Errors via `fmt.Errorf("layer: ...: %w", err)`.** The "layer:" prefix tells you which package failed. +- **No third-party assertion libraries in tests.** Use stdlib `testing` + plain `t.Fatalf`. +- **Receiver names: 1–3 letters, lowercase.** Match the struct's first letter (`s *Store`, `n *CodeNode`). +- **Detector files: `snake_case.go`.** Test files: `_test.go`. +- **Detector type names: `PascalCaseDetector`** (e.g. `SpringRestDetector`). +- **Detector `Name()` returns `snake_case`** (e.g. `"spring_rest"`). +- **Confidence is set by the base helper** (`base.RegexDetectorDefaultConfidence` or `base.StructuredDetectorDefaultConfidence`), not stamped in each detector. +- **Conventional commits** for PRs: `feat:`, `fix:`, `chore:`, `refactor:`, `test:`, `docs:`, `perf:`. + +## Common traps + +| Trap | Symptom | Fix | +|---|---|---| +| Forgot to blank-import a new detector category | `codeiq plugins list` doesn't show the detector; emissions silently 0 | Add the leaf package to [`internal/cli/detectors_register.go`](../internal/cli/detectors_register.go) | +| Detector emits an edge but not the target node | Edge silently dropped at `Snapshot()`; `analyzer.Stats.DroppedEdges` ticks up | Use [`base.EnsureFileAnchor`](../internal/detector/base/imports_helpers.go) / `EnsureExternalAnchor` | +| Detector emits the same node from two callbacks with different framework values | Higher-confidence framework wins; lower-confidence value is dropped | Intended behavior — confidence-aware merge. If you want the lower-confidence value to survive, raise its confidence at the source. | +| Re-running `go test` after a Kuzu bump and seeing cached results | Stale test results | `-count=1` flag | +| `go install` returns the wrong version | proxy.golang.org cache poisoning from a deleted tag | Use an explicit `@v0.4.1` (or higher); avoid `@latest` | +| Tag pushed but no release artifacts | release-darwin race timed out | (Fixed in PR #165) — but if it recurs, `gh run rerun --failed ` on the darwin job | +| `gh release view` returns "release not found" but `git tag` shows it | The Release is in DRAFT state and `gh release view` doesn't display drafts to non-maintainers | `gh release view --json isDraft` to confirm; `gh release edit --draft=false` to publish | +| `--max-buffer-pool=8589934592` to use 8 GiB | Bytes, not GiB. 2 GiB = `2147483648`, 8 GiB = `8589934592`. | Use the byte value. | +| Adding `TODO`/`FIXME` to code | Project policy is no TODOs in `main`. Capture in a doc or issue instead. | Edit a doc under `docs/` or open a tracked issue. | + +## Current unfinished work (at the time of this handoff) + +| Item | Status | +|---|---| +| PR #169 — goreleaser `files:` glob fix | **OPEN, CI green**. Merge before next tag push so v0.4.2+ works. | +| `v0.4.2` tag | Created then deleted because the v0.4.2 release failed on the goreleaser literal-file pattern. Once #169 lands, re-tag as v0.4.2. | +| CHANGELOG `[Unreleased]` section | Will need to be cut to `[v0.4.2]` when the next release ships. See [`docs/09-build-deploy-release.md`](09-build-deploy-release.md). | +| New reference docs | This is the deliverable. After this PR lands, the repo will have a clean `docs/` tree (the user wiped the prior set in PR #168). | +| `parity/` harness | Build-tag idle since the Java port. Revive or delete. | +| `config ` subcommand | Mentioned in older docs; never implemented. Root `--config` flag works. Implement or remove the mention. | + +## Recommended next tasks (priority order) + +1. **Merge PR #169** (goreleaser glob fix) → tag v0.4.2 → publish the release. +2. **Wire the new reference docs into `go-ci.yml` or `security.yml` link-check** — broken markdown links would be the most likely doc-bitrot vector. +3. **Add a `gh attestation verify` example to the README** — the binaries ship with build provenance but it's invisible to consumers. +4. **Decide on `parity/`** — keep + document, or delete. +5. **Fuzz [`MutationKeyword`](../internal/graph/mutation.go)** — adversarial Cypher with comment / string smuggling. +6. **Property-fuzz the CSV bulk-load writer** — random byte sequences in node/edge properties (catches the next #150/#152/#153-style bug). +7. **Snapshot tests for tree-sitter grammar outputs** — pin grammar versions; alert on AST node-name drift. +8. **Implement `codeiq config ` or remove every mention** of it from the codebase (it's already deleted from docs). +9. **Add a `go-ci.yml` step that runs `find testdata -name '*.md'` to confirm fixtures stay intact** — easy to accidentally `git rm` `testdata/fixture-minimal/README.md` thinking it's a stale doc. +10. **Consider a structured-logging layer** if a long-running mode (e.g. a watch / re-index daemon) is ever added. + +## When in doubt + +- Check `git log -p --since="1 month"` for what changed. +- Look for a determinism test on the package you're modifying — if there isn't one, add one. +- Read [`docs/10-known-risks-and-todos.md`](10-known-risks-and-todos.md) before making big changes. +- The user values terse direct communication. Skip preamble; show the change + the verification command. +- Never commit without explicit user request. Never push without explicit user request. diff --git a/docs/adr/0001-current-architecture.md b/docs/adr/0001-current-architecture.md new file mode 100644 index 00000000..89cec9f6 --- /dev/null +++ b/docs/adr/0001-current-architecture.md @@ -0,0 +1,169 @@ +# ADR 0001 — Current architecture + +> **Status:** Accepted. **Date:** 2026-05-14. **Deciders:** project maintainer + AI pairing. +> +> This ADR is retrospective — it documents the architecture as it exists at v0.4.1 and the reasoning behind the choices that produced it. Subsequent ADRs (0002+) should document *changes* relative to this baseline. + +## Context + +codeiq started as a Java/Spring Boot reference implementation with a Neo4j embedded graph, a React SPA, and a REST API. Phase 6 cutover (PR #132 in the now-deleted history) replaced the entire Java side with a single static Go binary backed by Kuzu, deleted the React SPA, and dropped the REST API in favor of a stdio MCP server. The Go module then lived at `/go/` inside the repo for one release cycle (v0.3.0), and was hoisted to the repo root in v0.4.0 (PR #162). + +This ADR captures the major decisions that produced today's shape: + +1. CLI binary + stdio MCP server (no REST, no UI, no daemon) +2. Kuzu embedded graph (CGO) for storage +3. SQLite cache as an intermediate, content-hash-keyed store +4. Detector pattern with a single `Detector` interface + 100 implementations +5. Read-only MCP surface with a mutation gate +6. Goreleaser + Cosign keyless via GitHub OIDC for releases +7. No telemetry, no auto-update, no outbound network during core flows +8. Go module at repo root, post-hoist + +## Decision + +### 1. CLI + stdio MCP, no daemon + +**Choice:** Ship a single static Go binary. Surface to humans via `codeiq `; surface to LLM agents via the stdio MCP protocol (`codeiq mcp`). No long-running server, no port to bind, no auth layer. + +**Why:** +- AI agents (Claude Code, Cursor) integrate via stdio MCP — that's the contract. +- Humans get the same Cypher / graph queries as the AI through the CLI. +- No daemon means no auth surface to harden, no port to expose, no service to monitor. +- A second process for a developer-host tool would be friction operators don't want. + +**Tradeoff:** State management is per-invocation. A long-running `mcp` server holds an open Kuzu read-only handle; CLI subcommands open + close per invocation. Concurrent invocations on the same `.codeiq/graph/` are not coordinated — the operator gets to enforce sequencing. + +### 2. Kuzu embedded graph, CGO + +**Choice:** Use Kuzu 0.11.3 via `github.com/kuzudb/go-kuzu`. CGO links the Kuzu C++ engine into the binary; data lives on disk at `.codeiq/graph/codeiq.kuzu/`. + +**Why:** +- Property-graph queries (Cypher) are the natural fit for the questions agents ask ("what calls X", "what depends on Y", "blast radius of changing Z"). +- Embedded means no daemon, no separate install. CGO + a static binary is the price. +- Kuzu 0.11.3 bundles the FTS extension — no network install needed, no air-gap concerns. +- Native BM25 ranking via `CALL QUERY_FTS_INDEX` replaces the substring-CONTAINS hacks from Kuzu 0.7.1. + +**Alternatives considered:** +- **Neo4j embedded** — was the Java-era choice. Heavier (JVM), and the Java-side process is gone. +- **DuckDB** — relational, not graph; recursive CTEs work but the Cypher ergonomics for blast-radius queries are better. +- **SQLite with adjacency tables** — would work; loses the FTS index ergonomics and recursive-pattern performance. + +**Tradeoff:** Kuzu's on-disk format may change across minor versions. The graph is rebuildable from the SQLite cache (which is stable), so we treat Kuzu as rebuildable and the cache as durable. + +### 3. SQLite cache as content-hash intermediary + +**Choice:** Run a two-phase pipeline: `index` writes to SQLite cache (content-hash keyed), `enrich` loads cache → Kuzu graph. Both stores live under `.codeiq/`. + +**Why:** +- The cache lets a re-run skip files whose content hash hasn't changed. +- It separates "detection" (lossy regex/AST work) from "graph assembly" (deterministic dedup/snapshot) — each phase has clean invariants. +- SQLite is the most reliable embedded store in the ecosystem and stable across versions. +- The cache also serves the `codeiq cache inspect` debugging surface. + +**Tradeoff:** Two storage layers + a transient. Operators sometimes wonder which to nuke. Answer: `rm -rf .codeiq/` rebuilds both. The cache is in WAL mode so we don't lose performance to the indirection. + +### 4. Detector pattern: one interface, 100 impls + +**Choice:** Every emitter implements: +```go +type Detector interface { + Name() string + SupportedLanguages() []string + DefaultConfidence() model.Confidence + Detect(ctx *Context) *Result +} +``` + +Each detector registers itself with `detector.RegisterDefault(NewMyDetector())` in `init()`. The package leaf must be blank-imported in `internal/cli/detectors_register.go`. + +**Why:** +- Adding a new framework / detector is a single file. The interface is tight. +- Detectors are stateless → safe to call concurrently from the worker pool. +- Confidence is monotonic; the `DefaultConfidence()` stamp is enforced at the registration boundary. +- The blank-import gate keeps the Go linker from dropping packages we want present. + +**Tradeoff:** The blank-import is a footgun. Forgetting it ships an empty registry silently. We test it indirectly via `codeiq plugins list` + the determinism tests per detector. A static `go vet`-style check that compares the leaf packages under `internal/detector/` to the blank-import list would be a worthwhile addition. + +### 5. Read-only MCP + mutation gate + +**Choice:** The MCP server (`codeiq mcp`) opens Kuzu with `OpenReadOnly`. The `run_cypher` tool runs every query through a regex-based [`MutationKeyword`](../../internal/graph/mutation.go) check that rejects CREATE/DELETE/SET/REMOVE/MERGE/DROP/FOREACH/LOAD CSV/COPY/DETACH and any CALL outside an allow-list (`db.*`, `show_*`, `table_*`, `current_setting`, `table_info`, `query_fts_index`). + +**Why:** +- AI agents should not mutate the graph. Mutations come from `codeiq index/enrich` only. +- Belt-and-braces: the Kuzu `OpenReadOnly` flag is the engine-level enforcement; the keyword gate is a fast-fail at the binding layer. +- The allow-list for CALL is narrow enough that a typo lets `db.show_indexes` through (read) but a typo on `db.delete` doesn't (no matching prefix). + +**Tradeoff:** Regex keyword matching is not a Cypher parser. Adversarial inputs (comments smuggling keywords, unicode whitespace) might in principle bypass the gate. The recommended mitigation is fuzzing the gate (see [`docs/10-known-risks-and-todos.md`](../10-known-risks-and-todos.md)). + +### 6. Goreleaser + Cosign keyless via GitHub OIDC + +**Choice:** Releases ship as `vX.Y.Z` tag pushes. The CI workflow (`release-go.yml` for linux/amd64+arm64; `release-darwin.yml` for darwin/arm64) builds, generates SBOMs (Syft), signs the checksums file via cosign keyless (Fulcio ephemeral cert from GitHub OIDC; Sigstore Rekor transparency log), and creates a draft GitHub Release. The maintainer publishes manually. + +**Why:** +- No long-lived signing key reduces blast radius. +- SBOM + cosign signatures meet SLSA-level expectations for a supply-chain-conscious tool. +- The `draft: true` default forces a human review before broadcast. + +**Tradeoff:** Two workflows on one tag push race on the Release object. The early `release-darwin` poll budget was tight (90s) and timed out frequently. Fixed in PR #165 (15-min budget + early-bail on upstream failure). + +### 7. No telemetry, no auto-update, no outbound network + +**Choice:** The `index` / `enrich` / `mcp` / `stats` / `find` / `query` / `cypher` / `flow` / `graph` / `topology` / `cache` / `plugins` / `version` subcommands make zero outbound HTTP calls. The only network touch in the entire binary is the `codeiq review` subcommand against Ollama. + +**Why:** +- Air-gap and security-sensitive environments need this. The repo positions itself as deployable behind a corporate firewall. +- Telemetry would create a privacy expectation we can't (and won't) maintain. +- Auto-update would add a verification surface we don't currently sign for runtime use. + +**Tradeoff:** Operators have to track new releases manually. Mitigated by Dependabot keeping the project's own deps current and the `version` subcommand printing build info clearly. + +### 8. Module at repo root (post-hoist) + +**Choice:** As of v0.4.0 (PR #162), the Go module lives at the repository root. Module path: `github.com/randomcodespace/codeiq`. Before v0.4.0, it lived at `/go/`; that subdirectory is gone. + +**Why:** +- `go install github.com/randomcodespace/codeiq/cmd/codeiq@vX.Y.Z` resolves directly to the matching tag, no subdir hop. +- Tags are clean (`vX.Y.Z`, not `go/vX.Y.Z`). +- Standard Go project layout, easier onboarding. + +**Tradeoff:** `git blame` paths changed for every Go file. The historical commits still exist; `git log --follow ` works. One-time disruption for a permanent UX win. + +## Consequences + +**Positive:** +- Single static binary, ~25 MB, no daemons, no external services in core flows. +- Deterministic output. Same input ⇒ same Kuzu graph, byte-for-byte. +- Read-only MCP surface that's safe to wire into any agent. +- Supply-chain story: cosign keyless via GitHub OIDC, Sigstore Rekor, Syft SBOMs, OpenSSF Scorecard. +- Pre-1.0 but production-grade in the surfaces that matter (884+ tests, perf-gate CI, 6 security scanners in CI). + +**Negative:** +- CGO is mandatory. Can't cross-compile darwin from linux without OSXCross or a macOS host. +- Detector registration is a choke-point footgun. +- Two storage layers (SQLite + Kuzu) means two paths to nuke when state corrupts. +- Mutation gate is regex-based — not a full Cypher parser. +- `proxy.golang.org` immutability means deleted tags poison their version numbers permanently. + +**Neutral:** +- No telemetry → no usage data to improve UX from. +- No auto-update → users may run stale binaries. +- Operator-driven (no daemon, no scheduler) → simple but requires operator discipline. + +## Open follow-ups + +- Fuzz the mutation gate. +- Property-fuzz the CSV bulk-load writer. +- Snapshot tests for tree-sitter grammar outputs. +- Document `gh attestation verify` for release consumers. +- Decide on `parity/` — keep or delete. +- Implement `codeiq config ` or formally remove the historical reference. + +## References + +- [`docs/02-architecture.md`](../02-architecture.md) — current component map +- [`docs/04-main-flows.md`](../04-main-flows.md) — per-flow entry points +- [`docs/06-data-model.md`](../06-data-model.md) — Kuzu + SQLite schemas +- [`docs/10-known-risks-and-todos.md`](../10-known-risks-and-todos.md) — open follow-ups +- [`internal/detector/detector.go`](../../internal/detector/detector.go) — Detector interface +- [`internal/graph/mutation.go`](../../internal/graph/mutation.go) — mutation gate +- [`.goreleaser.yml`](../../.goreleaser.yml) — release pipeline config