Skip to content

feat: reddit seed data#59

Open
012e wants to merge 1 commit into
mainfrom
reddit-seed
Open

feat: reddit seed data#59
012e wants to merge 1 commit into
mainfrom
reddit-seed

Conversation

@012e
Copy link
Copy Markdown
Owner

@012e 012e commented May 13, 2026

Summary by CodeRabbit

  • New Features

    • Seeding system now uses real Reddit data with configurable size options (small/big).
    • Added ability to apply embeddings to seeded posts via new seed:embeddings command.
  • Chores

    • Updated build configuration to include gzipped seed dataset files.
    • Removed unused module dependencies.

Review Change Stack

@netlify
Copy link
Copy Markdown

netlify Bot commented May 13, 2026

Deploy Preview for intodev ready!

Name Link
🔨 Latest commit 57750a2
🔍 Latest deploy log https://app.netlify.com/projects/intodev/deploys/6a049da8d54a8d00085c8457
😎 Deploy Preview https://deploy-preview-59--intodev.netlify.app
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify project configuration.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request refactors the seeding process to use real Reddit post data from compressed JSONL files and introduces a standalone script for applying embeddings. Key changes include updating the reseed endpoint to support different dataset sizes and implementing batch insertion for posts. Feedback focuses on potential memory issues when loading large datasets in SeedService and performance bottlenecks in the embedding update script due to joining on jsonb columns.

Comment on lines +44 to +66
async function loadRedditPosts(limit?: number): Promise<RedditPost[]> {
const result: RedditPost[] = [];
const fileStream = fs.createReadStream(REDDIT_POSTS_PATH);
const gunzip = zlib.createGunzip();
const rl = readline.createInterface({
input: fileStream.pipe(gunzip),
crlfDelay: Infinity,
});

for await (const line of rl) {
const trimmed = line.trim();
if (!trimmed) continue;
result.push(JSON.parse(trimmed) as RedditPost);
if (limit !== undefined && result.length >= limit) {
rl.close();
fileStream.destroy();
gunzip.destroy();
break;
}
}

return result;
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The loadRedditPosts function currently loads the entire dataset into memory when no limit is provided (which happens when size === "big"). This poses a significant risk of Out-Of-Memory (OOM) errors if the Reddit dataset is large.

Additionally, the seedPostsBatched method (lines 170-180) performs a .map() over the entire array, doubling the memory pressure.

It is highly recommended to refactor this to use an AsyncGenerator to stream the data from the file and process it in chunks, similar to the implementation in apply-reddit-embeddings.ts.

Comment on lines +85 to +91
const result = await db.execute(sql`
WITH data(content, embedding) AS (VALUES ${values})
UPDATE ${posts}
SET embedding = data.embedding
FROM data
WHERE ${posts.content} = data.content
`);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The bulk update join using WHERE posts.content = data.content will be extremely inefficient on large datasets. Since posts.content is a jsonb column, Postgres cannot use a standard B-tree index for this comparison, likely resulting in a sequential scan of the posts table for every batch (or a very expensive hash join).

To make the "big" seed viable, consider:

  1. Adding a temporary GIN index on the content column before running the script and dropping it afterwards.
  2. Increasing the UPDATE_BATCH_SIZE (currently 50) to a larger value like 500 or 1000 to reduce the number of round-trips and table scans.
  3. If the dataset allows, using a more efficient unique identifier (like a hash of the content) for matching.


const EMBEDDED_POSTS_FILE_PATTERN =
/^reddit-posts\.embeddings\.part-\d+\.jsonl\.gz$/;
const UPDATE_BATCH_SIZE = 50;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Increasing the batch size will significantly reduce the number of sequential scans and round-trips required to update the embeddings, especially given the current join strategy on a jsonb column.

Suggested change
const UPDATE_BATCH_SIZE = 50;
const UPDATE_BATCH_SIZE = 500;

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented May 13, 2026

📝 Walkthrough

Walkthrough

This PR introduces Reddit dataset seeding infrastructure with embeddings support, refactoring the backend seed system to load compressed post data instead of synthesizing content, and adding a configurable seed size parameter for different data volumes.

Changes

Reddit Dataset and Embedding Seeding

Layer / File(s) Summary
Build and Script Configuration
apps/backend/nest-cli.json, apps/backend/package.json
Nest CLI now copies compressed seed files (seed/*.jsonl.gz) to the distribution directory, and a new seed:embeddings npm script runs the embeddings application tool.
Embedding Application Script
apps/backend/src/seed/apply-reddit-embeddings.ts
Standalone Node script streams gzipped Reddit embedding shards line-by-line, decodes base64 vectors into PostgreSQL format, and batch-updates the posts table's embedding column by matching post content, with progress tracking and proper resource cleanup.
Controller and Module Wiring
apps/backend/src/seed/seed.controller.ts, apps/backend/src/seed/seed.module.ts
SeedController's reseed endpoint now accepts an optional size query parameter (mapping to "small" or "big" seed presets), and PollsModule dependency was removed from SeedModule.
SeedService Core Refactoring
apps/backend/src/seed/seed.service.ts
SeedService now loads compressed Reddit posts from disk via streaming JSONL, bulk-inserts them across users in batches, distributes post IDs for per-user comment seeding, and replaced synthesized content helpers with sentence-based text generation. Supports configurable seed sizing via the new SeedSize type.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Poem

🐰 Hops through datasets, seeds compress and expand,
Embeddings dance from file to database land,
Where Reddit posts bloom in batches so grand,
And sizing controls guide with a tweet and a hand,
Our thesis takes shape, oh so wonderfully planned! 🌱

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 40.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title 'feat: reddit seed data' directly matches the main change: adding Reddit dataset seeding functionality with new script, configuration, and service refactoring.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch reddit-seed

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🧹 Nitpick comments (1)
apps/backend/src/seed/apply-reddit-embeddings.ts (1)

85-91: ⚖️ Poor tradeoff

Content-based matching may be slow without an index and risks updating duplicates.

The WHERE posts.content = data.content relies on JSONB equality. On large datasets this could be slow without an index on posts.content. Additionally, if two posts have identical content, both will be updated to the same embedding.

Consider adding a unique constraint or index on content, or matching by a more reliable identifier if available.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@apps/backend/src/seed/apply-reddit-embeddings.ts` around lines 85 - 91, The
UPDATE currently matches rows by JSONB equality via WHERE posts.content =
data.content which is slow and may update multiple rows with identical content;
change the ingestion so the VALUES include a reliable identifier (e.g., post id)
and update by that id (modify the VALUES construction and the CTE to be WITH
data(id, content, embedding) and use WHERE ${posts.id} = data.id in the
db.execute sql), and/or add a unique constraint or GIN/index on ${posts.content}
if ids aren't available to speed equality checks; update any code paths that
build values (the values variable) and the db.execute call to use the id column
instead of or in addition to content.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@apps/backend/src/seed/seed.controller.ts`:
- Around line 10-14: The reseed endpoint currently decorated with
`@AllowAnonymous`() (method reseed in SeedController calling seedService.reseed)
allows unauthenticated destructive access; remove or replace `@AllowAnonymous`()
and protect the endpoint by adding an environment or role guard: either add a
NonProdGuard/RequireAdmin guard via `@UseGuards`(...) that blocks access when
NODE_ENV === 'production' or add an immediate runtime check in reseed (if
process.env.NODE_ENV === 'production' throw Forbidden) so the destructive
seedService.reseed can only run in non-production contexts.

In `@apps/backend/src/seed/seed.service.ts`:
- Around line 44-66: The loadRedditPosts function currently only closes
rl/fileStream/gunzip when the loop breaks early; wrap the readline consumption
in a try...finally (or ensure a finally block) so that after the for-await loop
finishes naturally you always call rl.close(), fileStream.destroy(), and
gunzip.destroy(); reference the symbols rl, fileStream, and gunzip in your
finally to ensure timely resource release regardless of whether limit caused an
early break.

---

Nitpick comments:
In `@apps/backend/src/seed/apply-reddit-embeddings.ts`:
- Around line 85-91: The UPDATE currently matches rows by JSONB equality via
WHERE posts.content = data.content which is slow and may update multiple rows
with identical content; change the ingestion so the VALUES include a reliable
identifier (e.g., post id) and update by that id (modify the VALUES construction
and the CTE to be WITH data(id, content, embedding) and use WHERE ${posts.id} =
data.id in the db.execute sql), and/or add a unique constraint or GIN/index on
${posts.content} if ids aren't available to speed equality checks; update any
code paths that build values (the values variable) and the db.execute call to
use the id column instead of or in addition to content.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro Plus

Run ID: fa171972-91e9-4398-b4d8-259028fe68f3

📥 Commits

Reviewing files that changed from the base of the PR and between 2cf28a5 and 57750a2.

⛔ Files ignored due to path filters (6)
  • apps/backend/src/seed/reddit-posts.embeddings.part-0001.jsonl.gz is excluded by !**/*.gz
  • apps/backend/src/seed/reddit-posts.embeddings.part-0002.jsonl.gz is excluded by !**/*.gz
  • apps/backend/src/seed/reddit-posts.embeddings.part-0003.jsonl.gz is excluded by !**/*.gz
  • apps/backend/src/seed/reddit-posts.embeddings.part-0004.jsonl.gz is excluded by !**/*.gz
  • apps/backend/src/seed/reddit-posts.embeddings.part-0005.jsonl.gz is excluded by !**/*.gz
  • apps/backend/src/seed/reddit-posts.jsonl.gz is excluded by !**/*.gz
📒 Files selected for processing (6)
  • apps/backend/nest-cli.json
  • apps/backend/package.json
  • apps/backend/src/seed/apply-reddit-embeddings.ts
  • apps/backend/src/seed/seed.controller.ts
  • apps/backend/src/seed/seed.module.ts
  • apps/backend/src/seed/seed.service.ts
💤 Files with no reviewable changes (1)
  • apps/backend/src/seed/seed.module.ts

Comment on lines 10 to +14
@AllowAnonymous()
@Post()
async reseed() {
return this.seedService.reseed();
async reseed(@Query("size") size?: string) {
const seedSize = size === "big" ? "big" : "small";
return this.seedService.reseed(seedSize);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Anonymous access to destructive seed endpoint is risky.

The @AllowAnonymous() decorator allows unauthenticated users to trigger reseed(), which truncates all data. If this endpoint is accidentally exposed in production, it could cause total data loss.

Consider protecting this endpoint with environment checks or removing it from production builds entirely.

🛡️ Proposed guard to restrict to non-production environments
+import { Controller, Post, Query, ForbiddenException } from "@nestjs/common";
-import { Controller, Post, Query } from "@nestjs/common";
 import { AllowAnonymous } from "@thallesp/nestjs-better-auth";

 import { SeedService } from "./seed.service";

 `@Controller`("seed")
 export class SeedController {
   constructor(private readonly seedService: SeedService) {}

   `@AllowAnonymous`()
   `@Post`()
   async reseed(`@Query`("size") size?: string) {
+    if (process.env.NODE_ENV === "production") {
+      throw new ForbiddenException("Seeding is disabled in production");
+    }
     const seedSize = size === "big" ? "big" : "small";
     return this.seedService.reseed(seedSize);
   }
 }
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
@AllowAnonymous()
@Post()
async reseed() {
return this.seedService.reseed();
async reseed(@Query("size") size?: string) {
const seedSize = size === "big" ? "big" : "small";
return this.seedService.reseed(seedSize);
import { Controller, Post, Query, ForbiddenException } from "@nestjs/common";
import { AllowAnonymous } from "@thallesp/nestjs-better-auth";
import { SeedService } from "./seed.service";
`@Controller`("seed")
export class SeedController {
constructor(private readonly seedService: SeedService) {}
`@AllowAnonymous`()
`@Post`()
async reseed(`@Query`("size") size?: string) {
if (process.env.NODE_ENV === "production") {
throw new ForbiddenException("Seeding is disabled in production");
}
const seedSize = size === "big" ? "big" : "small";
return this.seedService.reseed(seedSize);
}
}
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@apps/backend/src/seed/seed.controller.ts` around lines 10 - 14, The reseed
endpoint currently decorated with `@AllowAnonymous`() (method reseed in
SeedController calling seedService.reseed) allows unauthenticated destructive
access; remove or replace `@AllowAnonymous`() and protect the endpoint by adding
an environment or role guard: either add a NonProdGuard/RequireAdmin guard via
`@UseGuards`(...) that blocks access when NODE_ENV === 'production' or add an
immediate runtime check in reseed (if process.env.NODE_ENV === 'production'
throw Forbidden) so the destructive seedService.reseed can only run in
non-production contexts.

Comment on lines +44 to +66
async function loadRedditPosts(limit?: number): Promise<RedditPost[]> {
const result: RedditPost[] = [];
const fileStream = fs.createReadStream(REDDIT_POSTS_PATH);
const gunzip = zlib.createGunzip();
const rl = readline.createInterface({
input: fileStream.pipe(gunzip),
crlfDelay: Infinity,
});

for await (const line of rl) {
const trimmed = line.trim();
if (!trimmed) continue;
result.push(JSON.parse(trimmed) as RedditPost);
if (limit !== undefined && result.length >= limit) {
rl.close();
fileStream.destroy();
gunzip.destroy();
break;
}
}

return result;
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Stream resources not closed on natural file completion.

When limit is undefined or greater than file line count, the loop completes naturally without explicitly closing rl, fileStream, and gunzip. While Node.js will eventually garbage-collect these, explicitly closing ensures timely resource release.

🛠️ Proposed fix to ensure cleanup
 async function loadRedditPosts(limit?: number): Promise<RedditPost[]> {
   const result: RedditPost[] = [];
   const fileStream = fs.createReadStream(REDDIT_POSTS_PATH);
   const gunzip = zlib.createGunzip();
   const rl = readline.createInterface({
     input: fileStream.pipe(gunzip),
     crlfDelay: Infinity,
   });

-  for await (const line of rl) {
-    const trimmed = line.trim();
-    if (!trimmed) continue;
-    result.push(JSON.parse(trimmed) as RedditPost);
-    if (limit !== undefined && result.length >= limit) {
+  try {
+    for await (const line of rl) {
+      const trimmed = line.trim();
+      if (!trimmed) continue;
+      result.push(JSON.parse(trimmed) as RedditPost);
+      if (limit !== undefined && result.length >= limit) {
+        break;
+      }
+    }
+  } finally {
-      rl.close();
-      fileStream.destroy();
-      gunzip.destroy();
-      break;
-    }
+    rl.close();
+    fileStream.destroy();
+    gunzip.destroy();
   }

   return result;
 }
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
async function loadRedditPosts(limit?: number): Promise<RedditPost[]> {
const result: RedditPost[] = [];
const fileStream = fs.createReadStream(REDDIT_POSTS_PATH);
const gunzip = zlib.createGunzip();
const rl = readline.createInterface({
input: fileStream.pipe(gunzip),
crlfDelay: Infinity,
});
for await (const line of rl) {
const trimmed = line.trim();
if (!trimmed) continue;
result.push(JSON.parse(trimmed) as RedditPost);
if (limit !== undefined && result.length >= limit) {
rl.close();
fileStream.destroy();
gunzip.destroy();
break;
}
}
return result;
}
async function loadRedditPosts(limit?: number): Promise<RedditPost[]> {
const result: RedditPost[] = [];
const fileStream = fs.createReadStream(REDDIT_POSTS_PATH);
const gunzip = zlib.createGunzip();
const rl = readline.createInterface({
input: fileStream.pipe(gunzip),
crlfDelay: Infinity,
});
try {
for await (const line of rl) {
const trimmed = line.trim();
if (!trimmed) continue;
result.push(JSON.parse(trimmed) as RedditPost);
if (limit !== undefined && result.length >= limit) {
break;
}
}
} finally {
rl.close();
fileStream.destroy();
gunzip.destroy();
}
return result;
}
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@apps/backend/src/seed/seed.service.ts` around lines 44 - 66, The
loadRedditPosts function currently only closes rl/fileStream/gunzip when the
loop breaks early; wrap the readline consumption in a try...finally (or ensure a
finally block) so that after the for-await loop finishes naturally you always
call rl.close(), fileStream.destroy(), and gunzip.destroy(); reference the
symbols rl, fileStream, and gunzip in your finally to ensure timely resource
release regardless of whether limit caused an early break.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant