Skip to content

perf(graph): batch BulkLoad CSV / COPY FROM at 50k rows#143

Merged
aksOps merged 1 commit into
mainfrom
perf/enrich-bulk-load-batching
May 13, 2026
Merged

perf(graph): batch BulkLoad CSV / COPY FROM at 50k rows#143
aksOps merged 1 commit into
mainfrom
perf/enrich-bulk-load-batching

Conversation

@aksOps
Copy link
Copy Markdown
Contributor

@aksOps aksOps commented May 13, 2026

Summary

`BulkLoadNodes` and `copyEdgeGroup` previously staged one CSV with every row and issued a single Kuzu `COPY FROM`. Kuzu buffers the full CSV in process memory during ingest, so on polyglot targets with hundreds of thousands of nodes the COPY-side resident set grew unbounded.

This PR chunks the work into batches of `bulkLoadBatchSize` (default 50k, override via `CODEIQ_BULK_BATCH_SIZE` env). Each batch stages + ingests + cleans up before the next batch starts so neither on-disk CSV nor Kuzu's ingest buffer ever holds more than `batchSize` rows.

Caveat — partial fix

This is production hygiene, not a complete OOM fix at `~/projects/` scale (49k files / 434k nodes). At that scale the enrich pipeline OOMs earlier than BulkLoad — likely in the GraphBuilder / linker / classifier passes that materialise all nodes in Go memory before BulkLoad runs.

Reproducer:
```bash
codeiq index ~/projects/
codeiq enrich ~/projects/ # OOM at exit 137, ~46s in (before BulkLoad fires)
```

Streaming the upstream enrich stages is a separate, larger refactor — tracked as a follow-up.

Correctness

Cypher uniqueness constraints are still enforced cross-batch (Kuzu ingest commits before the next COPY starts), so a duplicate primary key surfaces the same Copy exception either way.

Test plan

  • `go test ./... -count=1` — 875 pass
  • `fixture-minimal` index → enrich → stats: same 45-node / 68-edge output as pre-PR
  • `go vet ./...` clean

BulkLoadNodes and copyEdgeGroup previously staged one CSV with every
row and issued a single Kuzu COPY FROM. Kuzu buffers the full CSV in
process memory during ingest, so on polyglot targets with hundreds of
thousands of nodes the COPY-side resident set grew unbounded.

Chunk the work into batches of bulkLoadBatchSize (default 50k,
override via CODEIQ_BULK_BATCH_SIZE env). Each batch stages + ingests
+ cleans up before the next batch starts so neither on-disk CSV nor
Kuzu's ingest buffer ever holds more than batchSize rows.

Caveat: this is production hygiene, not a complete OOM fix at the
~/projects/ scale (49k files / 434k nodes). At that scale the enrich
pipeline OOMs earlier than BulkLoad - likely in the graph builder /
linker / classifier passes that materialise all nodes in Go memory
before BulkLoad runs. Streaming the upstream enrich stages is a
separate, larger refactor.

Cypher uniqueness constraints are still enforced cross-batch (Kuzu
ingest commits before the next COPY starts), so a duplicate primary
key surfaces the same Copy exception either way. Verified:
- go test ./... -count=1 - 875 pass
- fixture-minimal index->enrich->stats - same 45-node / 68-edge output
@aksOps aksOps merged commit 4136977 into main May 13, 2026
13 checks passed
@aksOps aksOps deleted the perf/enrich-bulk-load-batching branch May 13, 2026 12:49
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant