perf(graph): batch BulkLoad CSV / COPY FROM at 50k rows by aksOps · Pull Request #143 · RandomCodeSpace/codeiq

aksOps · 2026-05-13T11:11:33Z

Summary

`BulkLoadNodes` and `copyEdgeGroup` previously staged one CSV with every row and issued a single Kuzu `COPY FROM`. Kuzu buffers the full CSV in process memory during ingest, so on polyglot targets with hundreds of thousands of nodes the COPY-side resident set grew unbounded.

This PR chunks the work into batches of `bulkLoadBatchSize` (default 50k, override via `CODEIQ_BULK_BATCH_SIZE` env). Each batch stages + ingests + cleans up before the next batch starts so neither on-disk CSV nor Kuzu's ingest buffer ever holds more than `batchSize` rows.

Caveat — partial fix

This is production hygiene, not a complete OOM fix at `~/projects/` scale (49k files / 434k nodes). At that scale the enrich pipeline OOMs earlier than BulkLoad — likely in the GraphBuilder / linker / classifier passes that materialise all nodes in Go memory before BulkLoad runs.

Reproducer:
```bash
codeiq index ~/projects/
codeiq enrich ~/projects/ # OOM at exit 137, ~46s in (before BulkLoad fires)
```

Streaming the upstream enrich stages is a separate, larger refactor — tracked as a follow-up.

Correctness

Cypher uniqueness constraints are still enforced cross-batch (Kuzu ingest commits before the next COPY starts), so a duplicate primary key surfaces the same Copy exception either way.

Test plan

`go test ./... -count=1` — 875 pass
`fixture-minimal` index → enrich → stats: same 45-node / 68-edge output as pre-PR
`go vet ./...` clean

BulkLoadNodes and copyEdgeGroup previously staged one CSV with every row and issued a single Kuzu COPY FROM. Kuzu buffers the full CSV in process memory during ingest, so on polyglot targets with hundreds of thousands of nodes the COPY-side resident set grew unbounded. Chunk the work into batches of bulkLoadBatchSize (default 50k, override via CODEIQ_BULK_BATCH_SIZE env). Each batch stages + ingests + cleans up before the next batch starts so neither on-disk CSV nor Kuzu's ingest buffer ever holds more than batchSize rows. Caveat: this is production hygiene, not a complete OOM fix at the ~/projects/ scale (49k files / 434k nodes). At that scale the enrich pipeline OOMs earlier than BulkLoad - likely in the graph builder / linker / classifier passes that materialise all nodes in Go memory before BulkLoad runs. Streaming the upstream enrich stages is a separate, larger refactor. Cypher uniqueness constraints are still enforced cross-batch (Kuzu ingest commits before the next COPY starts), so a duplicate primary key surfaces the same Copy exception either way. Verified: - go test ./... -count=1 - 875 pass - fixture-minimal index->enrich->stats - same 45-node / 68-edge output

aksOps merged commit 4136977 into main May 13, 2026
13 checks passed

aksOps deleted the perf/enrich-bulk-load-batching branch May 13, 2026 12:49

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf(graph): batch BulkLoad CSV / COPY FROM at 50k rows#143

perf(graph): batch BulkLoad CSV / COPY FROM at 50k rows#143
aksOps merged 1 commit into
mainfrom
perf/enrich-bulk-load-batching

aksOps commented May 13, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

aksOps commented May 13, 2026

Summary

Caveat — partial fix

Correctness

Test plan

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant