feat: reddit seed data#59
Conversation
✅ Deploy Preview for intodev ready!
To edit notification comments on pull requests, go to your Netlify project configuration. |
There was a problem hiding this comment.
Code Review
This pull request refactors the seeding process to use real Reddit post data from compressed JSONL files and introduces a standalone script for applying embeddings. Key changes include updating the reseed endpoint to support different dataset sizes and implementing batch insertion for posts. Feedback focuses on potential memory issues when loading large datasets in SeedService and performance bottlenecks in the embedding update script due to joining on jsonb columns.
| async function loadRedditPosts(limit?: number): Promise<RedditPost[]> { | ||
| const result: RedditPost[] = []; | ||
| const fileStream = fs.createReadStream(REDDIT_POSTS_PATH); | ||
| const gunzip = zlib.createGunzip(); | ||
| const rl = readline.createInterface({ | ||
| input: fileStream.pipe(gunzip), | ||
| crlfDelay: Infinity, | ||
| }); | ||
|
|
||
| for await (const line of rl) { | ||
| const trimmed = line.trim(); | ||
| if (!trimmed) continue; | ||
| result.push(JSON.parse(trimmed) as RedditPost); | ||
| if (limit !== undefined && result.length >= limit) { | ||
| rl.close(); | ||
| fileStream.destroy(); | ||
| gunzip.destroy(); | ||
| break; | ||
| } | ||
| } | ||
|
|
||
| return result; | ||
| } |
There was a problem hiding this comment.
The loadRedditPosts function currently loads the entire dataset into memory when no limit is provided (which happens when size === "big"). This poses a significant risk of Out-Of-Memory (OOM) errors if the Reddit dataset is large.
Additionally, the seedPostsBatched method (lines 170-180) performs a .map() over the entire array, doubling the memory pressure.
It is highly recommended to refactor this to use an AsyncGenerator to stream the data from the file and process it in chunks, similar to the implementation in apply-reddit-embeddings.ts.
| const result = await db.execute(sql` | ||
| WITH data(content, embedding) AS (VALUES ${values}) | ||
| UPDATE ${posts} | ||
| SET embedding = data.embedding | ||
| FROM data | ||
| WHERE ${posts.content} = data.content | ||
| `); |
There was a problem hiding this comment.
The bulk update join using WHERE posts.content = data.content will be extremely inefficient on large datasets. Since posts.content is a jsonb column, Postgres cannot use a standard B-tree index for this comparison, likely resulting in a sequential scan of the posts table for every batch (or a very expensive hash join).
To make the "big" seed viable, consider:
- Adding a temporary GIN index on the
contentcolumn before running the script and dropping it afterwards. - Increasing the
UPDATE_BATCH_SIZE(currently 50) to a larger value like 500 or 1000 to reduce the number of round-trips and table scans. - If the dataset allows, using a more efficient unique identifier (like a hash of the content) for matching.
|
|
||
| const EMBEDDED_POSTS_FILE_PATTERN = | ||
| /^reddit-posts\.embeddings\.part-\d+\.jsonl\.gz$/; | ||
| const UPDATE_BATCH_SIZE = 50; |
There was a problem hiding this comment.
📝 WalkthroughWalkthroughThis PR introduces Reddit dataset seeding infrastructure with embeddings support, refactoring the backend seed system to load compressed post data instead of synthesizing content, and adding a configurable seed size parameter for different data volumes. ChangesReddit Dataset and Embedding Seeding
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Actionable comments posted: 2
🧹 Nitpick comments (1)
apps/backend/src/seed/apply-reddit-embeddings.ts (1)
85-91: ⚖️ Poor tradeoffContent-based matching may be slow without an index and risks updating duplicates.
The
WHERE posts.content = data.contentrelies on JSONB equality. On large datasets this could be slow without an index onposts.content. Additionally, if two posts have identical content, both will be updated to the same embedding.Consider adding a unique constraint or index on content, or matching by a more reliable identifier if available.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@apps/backend/src/seed/apply-reddit-embeddings.ts` around lines 85 - 91, The UPDATE currently matches rows by JSONB equality via WHERE posts.content = data.content which is slow and may update multiple rows with identical content; change the ingestion so the VALUES include a reliable identifier (e.g., post id) and update by that id (modify the VALUES construction and the CTE to be WITH data(id, content, embedding) and use WHERE ${posts.id} = data.id in the db.execute sql), and/or add a unique constraint or GIN/index on ${posts.content} if ids aren't available to speed equality checks; update any code paths that build values (the values variable) and the db.execute call to use the id column instead of or in addition to content.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@apps/backend/src/seed/seed.controller.ts`:
- Around line 10-14: The reseed endpoint currently decorated with
`@AllowAnonymous`() (method reseed in SeedController calling seedService.reseed)
allows unauthenticated destructive access; remove or replace `@AllowAnonymous`()
and protect the endpoint by adding an environment or role guard: either add a
NonProdGuard/RequireAdmin guard via `@UseGuards`(...) that blocks access when
NODE_ENV === 'production' or add an immediate runtime check in reseed (if
process.env.NODE_ENV === 'production' throw Forbidden) so the destructive
seedService.reseed can only run in non-production contexts.
In `@apps/backend/src/seed/seed.service.ts`:
- Around line 44-66: The loadRedditPosts function currently only closes
rl/fileStream/gunzip when the loop breaks early; wrap the readline consumption
in a try...finally (or ensure a finally block) so that after the for-await loop
finishes naturally you always call rl.close(), fileStream.destroy(), and
gunzip.destroy(); reference the symbols rl, fileStream, and gunzip in your
finally to ensure timely resource release regardless of whether limit caused an
early break.
---
Nitpick comments:
In `@apps/backend/src/seed/apply-reddit-embeddings.ts`:
- Around line 85-91: The UPDATE currently matches rows by JSONB equality via
WHERE posts.content = data.content which is slow and may update multiple rows
with identical content; change the ingestion so the VALUES include a reliable
identifier (e.g., post id) and update by that id (modify the VALUES construction
and the CTE to be WITH data(id, content, embedding) and use WHERE ${posts.id} =
data.id in the db.execute sql), and/or add a unique constraint or GIN/index on
${posts.content} if ids aren't available to speed equality checks; update any
code paths that build values (the values variable) and the db.execute call to
use the id column instead of or in addition to content.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro Plus
Run ID: fa171972-91e9-4398-b4d8-259028fe68f3
⛔ Files ignored due to path filters (6)
apps/backend/src/seed/reddit-posts.embeddings.part-0001.jsonl.gzis excluded by!**/*.gzapps/backend/src/seed/reddit-posts.embeddings.part-0002.jsonl.gzis excluded by!**/*.gzapps/backend/src/seed/reddit-posts.embeddings.part-0003.jsonl.gzis excluded by!**/*.gzapps/backend/src/seed/reddit-posts.embeddings.part-0004.jsonl.gzis excluded by!**/*.gzapps/backend/src/seed/reddit-posts.embeddings.part-0005.jsonl.gzis excluded by!**/*.gzapps/backend/src/seed/reddit-posts.jsonl.gzis excluded by!**/*.gz
📒 Files selected for processing (6)
apps/backend/nest-cli.jsonapps/backend/package.jsonapps/backend/src/seed/apply-reddit-embeddings.tsapps/backend/src/seed/seed.controller.tsapps/backend/src/seed/seed.module.tsapps/backend/src/seed/seed.service.ts
💤 Files with no reviewable changes (1)
- apps/backend/src/seed/seed.module.ts
| @AllowAnonymous() | ||
| @Post() | ||
| async reseed() { | ||
| return this.seedService.reseed(); | ||
| async reseed(@Query("size") size?: string) { | ||
| const seedSize = size === "big" ? "big" : "small"; | ||
| return this.seedService.reseed(seedSize); |
There was a problem hiding this comment.
Anonymous access to destructive seed endpoint is risky.
The @AllowAnonymous() decorator allows unauthenticated users to trigger reseed(), which truncates all data. If this endpoint is accidentally exposed in production, it could cause total data loss.
Consider protecting this endpoint with environment checks or removing it from production builds entirely.
🛡️ Proposed guard to restrict to non-production environments
+import { Controller, Post, Query, ForbiddenException } from "@nestjs/common";
-import { Controller, Post, Query } from "@nestjs/common";
import { AllowAnonymous } from "@thallesp/nestjs-better-auth";
import { SeedService } from "./seed.service";
`@Controller`("seed")
export class SeedController {
constructor(private readonly seedService: SeedService) {}
`@AllowAnonymous`()
`@Post`()
async reseed(`@Query`("size") size?: string) {
+ if (process.env.NODE_ENV === "production") {
+ throw new ForbiddenException("Seeding is disabled in production");
+ }
const seedSize = size === "big" ? "big" : "small";
return this.seedService.reseed(seedSize);
}
}📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| @AllowAnonymous() | |
| @Post() | |
| async reseed() { | |
| return this.seedService.reseed(); | |
| async reseed(@Query("size") size?: string) { | |
| const seedSize = size === "big" ? "big" : "small"; | |
| return this.seedService.reseed(seedSize); | |
| import { Controller, Post, Query, ForbiddenException } from "@nestjs/common"; | |
| import { AllowAnonymous } from "@thallesp/nestjs-better-auth"; | |
| import { SeedService } from "./seed.service"; | |
| `@Controller`("seed") | |
| export class SeedController { | |
| constructor(private readonly seedService: SeedService) {} | |
| `@AllowAnonymous`() | |
| `@Post`() | |
| async reseed(`@Query`("size") size?: string) { | |
| if (process.env.NODE_ENV === "production") { | |
| throw new ForbiddenException("Seeding is disabled in production"); | |
| } | |
| const seedSize = size === "big" ? "big" : "small"; | |
| return this.seedService.reseed(seedSize); | |
| } | |
| } |
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@apps/backend/src/seed/seed.controller.ts` around lines 10 - 14, The reseed
endpoint currently decorated with `@AllowAnonymous`() (method reseed in
SeedController calling seedService.reseed) allows unauthenticated destructive
access; remove or replace `@AllowAnonymous`() and protect the endpoint by adding
an environment or role guard: either add a NonProdGuard/RequireAdmin guard via
`@UseGuards`(...) that blocks access when NODE_ENV === 'production' or add an
immediate runtime check in reseed (if process.env.NODE_ENV === 'production'
throw Forbidden) so the destructive seedService.reseed can only run in
non-production contexts.
| async function loadRedditPosts(limit?: number): Promise<RedditPost[]> { | ||
| const result: RedditPost[] = []; | ||
| const fileStream = fs.createReadStream(REDDIT_POSTS_PATH); | ||
| const gunzip = zlib.createGunzip(); | ||
| const rl = readline.createInterface({ | ||
| input: fileStream.pipe(gunzip), | ||
| crlfDelay: Infinity, | ||
| }); | ||
|
|
||
| for await (const line of rl) { | ||
| const trimmed = line.trim(); | ||
| if (!trimmed) continue; | ||
| result.push(JSON.parse(trimmed) as RedditPost); | ||
| if (limit !== undefined && result.length >= limit) { | ||
| rl.close(); | ||
| fileStream.destroy(); | ||
| gunzip.destroy(); | ||
| break; | ||
| } | ||
| } | ||
|
|
||
| return result; | ||
| } |
There was a problem hiding this comment.
Stream resources not closed on natural file completion.
When limit is undefined or greater than file line count, the loop completes naturally without explicitly closing rl, fileStream, and gunzip. While Node.js will eventually garbage-collect these, explicitly closing ensures timely resource release.
🛠️ Proposed fix to ensure cleanup
async function loadRedditPosts(limit?: number): Promise<RedditPost[]> {
const result: RedditPost[] = [];
const fileStream = fs.createReadStream(REDDIT_POSTS_PATH);
const gunzip = zlib.createGunzip();
const rl = readline.createInterface({
input: fileStream.pipe(gunzip),
crlfDelay: Infinity,
});
- for await (const line of rl) {
- const trimmed = line.trim();
- if (!trimmed) continue;
- result.push(JSON.parse(trimmed) as RedditPost);
- if (limit !== undefined && result.length >= limit) {
+ try {
+ for await (const line of rl) {
+ const trimmed = line.trim();
+ if (!trimmed) continue;
+ result.push(JSON.parse(trimmed) as RedditPost);
+ if (limit !== undefined && result.length >= limit) {
+ break;
+ }
+ }
+ } finally {
- rl.close();
- fileStream.destroy();
- gunzip.destroy();
- break;
- }
+ rl.close();
+ fileStream.destroy();
+ gunzip.destroy();
}
return result;
}📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| async function loadRedditPosts(limit?: number): Promise<RedditPost[]> { | |
| const result: RedditPost[] = []; | |
| const fileStream = fs.createReadStream(REDDIT_POSTS_PATH); | |
| const gunzip = zlib.createGunzip(); | |
| const rl = readline.createInterface({ | |
| input: fileStream.pipe(gunzip), | |
| crlfDelay: Infinity, | |
| }); | |
| for await (const line of rl) { | |
| const trimmed = line.trim(); | |
| if (!trimmed) continue; | |
| result.push(JSON.parse(trimmed) as RedditPost); | |
| if (limit !== undefined && result.length >= limit) { | |
| rl.close(); | |
| fileStream.destroy(); | |
| gunzip.destroy(); | |
| break; | |
| } | |
| } | |
| return result; | |
| } | |
| async function loadRedditPosts(limit?: number): Promise<RedditPost[]> { | |
| const result: RedditPost[] = []; | |
| const fileStream = fs.createReadStream(REDDIT_POSTS_PATH); | |
| const gunzip = zlib.createGunzip(); | |
| const rl = readline.createInterface({ | |
| input: fileStream.pipe(gunzip), | |
| crlfDelay: Infinity, | |
| }); | |
| try { | |
| for await (const line of rl) { | |
| const trimmed = line.trim(); | |
| if (!trimmed) continue; | |
| result.push(JSON.parse(trimmed) as RedditPost); | |
| if (limit !== undefined && result.length >= limit) { | |
| break; | |
| } | |
| } | |
| } finally { | |
| rl.close(); | |
| fileStream.destroy(); | |
| gunzip.destroy(); | |
| } | |
| return result; | |
| } |
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@apps/backend/src/seed/seed.service.ts` around lines 44 - 66, The
loadRedditPosts function currently only closes rl/fileStream/gunzip when the
loop breaks early; wrap the readline consumption in a try...finally (or ensure a
finally block) so that after the for-await loop finishes naturally you always
call rl.close(), fileStream.destroy(), and gunzip.destroy(); reference the
symbols rl, fileStream, and gunzip in your finally to ensure timely resource
release regardless of whether limit caused an early break.
Summary by CodeRabbit
New Features
seed:embeddingscommand.Chores