feat: reddit seed data by 012e · Pull Request #59 · 012e/thesis

012e · 2026-05-13T15:47:28Z

Summary by CodeRabbit

New Features
- Seeding system now uses real Reddit data with configurable size options (small/big).
- Added ability to apply embeddings to seeded posts via new seed:embeddings command.
Chores
- Updated build configuration to include gzipped seed dataset files.
- Removed unused module dependencies.

netlify · 2026-05-13T15:47:37Z

✅ Deploy Preview for intodev ready!

Name	Link
🔨 Latest commit	`57750a2`
🔍 Latest deploy log	https://app.netlify.com/projects/intodev/deploys/6a049da8d54a8d00085c8457
😎 Deploy Preview	https://deploy-preview-59--intodev.netlify.app
📱 Preview on mobile	Toggle QR Code... Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify project configuration.

gemini-code-assist

Code Review

This pull request refactors the seeding process to use real Reddit post data from compressed JSONL files and introduces a standalone script for applying embeddings. Key changes include updating the reseed endpoint to support different dataset sizes and implementing batch insertion for posts. Feedback focuses on potential memory issues when loading large datasets in SeedService and performance bottlenecks in the embedding update script due to joining on jsonb columns.

gemini-code-assist · 2026-05-13T15:49:04Z

+async function loadRedditPosts(limit?: number): Promise<RedditPost[]> {
+  const result: RedditPost[] = [];
+  const fileStream = fs.createReadStream(REDDIT_POSTS_PATH);
+  const gunzip = zlib.createGunzip();
+  const rl = readline.createInterface({
+    input: fileStream.pipe(gunzip),
+    crlfDelay: Infinity,
+  });
+
+  for await (const line of rl) {
+    const trimmed = line.trim();
+    if (!trimmed) continue;
+    result.push(JSON.parse(trimmed) as RedditPost);
+    if (limit !== undefined && result.length >= limit) {
+      rl.close();
+      fileStream.destroy();
+      gunzip.destroy();
+      break;
+    }
+  }
+
+  return result;
+}


The loadRedditPosts function currently loads the entire dataset into memory when no limit is provided (which happens when size === "big"). This poses a significant risk of Out-Of-Memory (OOM) errors if the Reddit dataset is large.

Additionally, the seedPostsBatched method (lines 170-180) performs a .map() over the entire array, doubling the memory pressure.

It is highly recommended to refactor this to use an AsyncGenerator to stream the data from the file and process it in chunks, similar to the implementation in apply-reddit-embeddings.ts.

gemini-code-assist · 2026-05-13T15:49:04Z

+  const result = await db.execute(sql`
+    WITH data(content, embedding) AS (VALUES ${values})
+    UPDATE ${posts}
+    SET embedding = data.embedding
+    FROM data
+    WHERE ${posts.content} = data.content
+  `);


The bulk update join using WHERE posts.content = data.content will be extremely inefficient on large datasets. Since posts.content is a jsonb column, Postgres cannot use a standard B-tree index for this comparison, likely resulting in a sequential scan of the posts table for every batch (or a very expensive hash join).

To make the "big" seed viable, consider:

Adding a temporary GIN index on the content column before running the script and dropping it afterwards.

Increasing the UPDATE_BATCH_SIZE (currently 50) to a larger value like 500 or 1000 to reduce the number of round-trips and table scans.

If the dataset allows, using a more efficient unique identifier (like a hash of the content) for matching.

gemini-code-assist · 2026-05-13T15:49:04Z

+
+const EMBEDDED_POSTS_FILE_PATTERN =
+  /^reddit-posts\.embeddings\.part-\d+\.jsonl\.gz$/;
+const UPDATE_BATCH_SIZE = 50;


Increasing the batch size will significantly reduce the number of sequential scans and round-trips required to update the embeddings, especially given the current join strategy on a jsonb column.

Suggested change

const UPDATE_BATCH_SIZE = 50;

const UPDATE_BATCH_SIZE = 500;

coderabbitai · 2026-05-13T15:53:07Z

📝 Walkthrough

Walkthrough

This PR introduces Reddit dataset seeding infrastructure with embeddings support, refactoring the backend seed system to load compressed post data instead of synthesizing content, and adding a configurable seed size parameter for different data volumes.

Changes

Reddit Dataset and Embedding Seeding

Layer / File(s)	Summary
Build and Script Configuration `apps/backend/nest-cli.json`, `apps/backend/package.json`	Nest CLI now copies compressed seed files (`seed/*.jsonl.gz`) to the distribution directory, and a new `seed:embeddings` npm script runs the embeddings application tool.
Embedding Application Script `apps/backend/src/seed/apply-reddit-embeddings.ts`	Standalone Node script streams gzipped Reddit embedding shards line-by-line, decodes base64 vectors into PostgreSQL format, and batch-updates the `posts` table's `embedding` column by matching post content, with progress tracking and proper resource cleanup.
Controller and Module Wiring `apps/backend/src/seed/seed.controller.ts`, `apps/backend/src/seed/seed.module.ts`	SeedController's reseed endpoint now accepts an optional `size` query parameter (mapping to "small" or "big" seed presets), and PollsModule dependency was removed from SeedModule.
SeedService Core Refactoring `apps/backend/src/seed/seed.service.ts`	SeedService now loads compressed Reddit posts from disk via streaming JSONL, bulk-inserts them across users in batches, distributes post IDs for per-user comment seeding, and replaced synthesized content helpers with sentence-based text generation. Supports configurable seed sizing via the new `SeedSize` type.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Poem

🐰 Hops through datasets, seeds compress and expand,
Embeddings dance from file to database land,
Where Reddit posts bloom in batches so grand,
And sizing controls guide with a tweet and a hand,
Our thesis takes shape, oh so wonderfully planned! 🌱

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 40.00% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title 'feat: reddit seed data' directly matches the main change: adding Reddit dataset seeding functionality with new script, configuration, and service refactoring.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

📝 Generate docstrings

Create stacked PR
Commit on current branch

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch reddit-seed

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 2

🧹 Nitpick comments (1)

apps/backend/src/seed/apply-reddit-embeddings.ts (1)
85-91: ⚖️ Poor tradeoff

Content-based matching may be slow without an index and risks updating duplicates.

The WHERE posts.content = data.content relies on JSONB equality. On large datasets this could be slow without an index on posts.content. Additionally, if two posts have identical content, both will be updated to the same embedding.

Consider adding a unique constraint or index on content, or matching by a more reliable identifier if available.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@apps/backend/src/seed/apply-reddit-embeddings.ts` around lines 85 - 91, The
UPDATE currently matches rows by JSONB equality via WHERE posts.content =
data.content which is slow and may update multiple rows with identical content;
change the ingestion so the VALUES include a reliable identifier (e.g., post id)
and update by that id (modify the VALUES construction and the CTE to be WITH
data(id, content, embedding) and use WHERE ${posts.id} = data.id in the
db.execute sql), and/or add a unique constraint or GIN/index on ${posts.content}
if ids aren't available to speed equality checks; update any code paths that
build values (the values variable) and the db.execute call to use the id column
instead of or in addition to content.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@apps/backend/src/seed/seed.controller.ts`:
- Around line 10-14: The reseed endpoint currently decorated with
`@AllowAnonymous`() (method reseed in SeedController calling seedService.reseed)
allows unauthenticated destructive access; remove or replace `@AllowAnonymous`()
and protect the endpoint by adding an environment or role guard: either add a
NonProdGuard/RequireAdmin guard via `@UseGuards`(...) that blocks access when
NODE_ENV === 'production' or add an immediate runtime check in reseed (if
process.env.NODE_ENV === 'production' throw Forbidden) so the destructive
seedService.reseed can only run in non-production contexts.

In `@apps/backend/src/seed/seed.service.ts`:
- Around line 44-66: The loadRedditPosts function currently only closes
rl/fileStream/gunzip when the loop breaks early; wrap the readline consumption
in a try...finally (or ensure a finally block) so that after the for-await loop
finishes naturally you always call rl.close(), fileStream.destroy(), and
gunzip.destroy(); reference the symbols rl, fileStream, and gunzip in your
finally to ensure timely resource release regardless of whether limit caused an
early break.

---

Nitpick comments:
In `@apps/backend/src/seed/apply-reddit-embeddings.ts`:
- Around line 85-91: The UPDATE currently matches rows by JSONB equality via
WHERE posts.content = data.content which is slow and may update multiple rows
with identical content; change the ingestion so the VALUES include a reliable
identifier (e.g., post id) and update by that id (modify the VALUES construction
and the CTE to be WITH data(id, content, embedding) and use WHERE ${posts.id} =
data.id in the db.execute sql), and/or add a unique constraint or GIN/index on
${posts.content} if ids aren't available to speed equality checks; update any
code paths that build values (the values variable) and the db.execute call to
use the id column instead of or in addition to content.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro Plus

Run ID: fa171972-91e9-4398-b4d8-259028fe68f3

📥 Commits

Reviewing files that changed from the base of the PR and between 2cf28a5 and 57750a2.

⛔ Files ignored due to path filters (6)

apps/backend/src/seed/reddit-posts.embeddings.part-0001.jsonl.gz is excluded by !**/*.gz
apps/backend/src/seed/reddit-posts.embeddings.part-0002.jsonl.gz is excluded by !**/*.gz
apps/backend/src/seed/reddit-posts.embeddings.part-0003.jsonl.gz is excluded by !**/*.gz
apps/backend/src/seed/reddit-posts.embeddings.part-0004.jsonl.gz is excluded by !**/*.gz
apps/backend/src/seed/reddit-posts.embeddings.part-0005.jsonl.gz is excluded by !**/*.gz
apps/backend/src/seed/reddit-posts.jsonl.gz is excluded by !**/*.gz

📒 Files selected for processing (6)

apps/backend/nest-cli.json
apps/backend/package.json
apps/backend/src/seed/apply-reddit-embeddings.ts
apps/backend/src/seed/seed.controller.ts
apps/backend/src/seed/seed.module.ts
apps/backend/src/seed/seed.service.ts

💤 Files with no reviewable changes (1)

apps/backend/src/seed/seed.module.ts

coderabbitai · 2026-05-13T15:58:39Z

  @AllowAnonymous()
  @Post()
-  async reseed() {
-    return this.seedService.reseed();
+  async reseed(@Query("size") size?: string) {
+    const seedSize = size === "big" ? "big" : "small";
+    return this.seedService.reseed(seedSize);


⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Anonymous access to destructive seed endpoint is risky.

The @AllowAnonymous() decorator allows unauthenticated users to trigger reseed(), which truncates all data. If this endpoint is accidentally exposed in production, it could cause total data loss.

Consider protecting this endpoint with environment checks or removing it from production builds entirely.

🛡️ Proposed guard to restrict to non-production environments

+import { Controller, Post, Query, ForbiddenException } from "@nestjs/common"; -import { Controller, Post, Query } from "@nestjs/common"; import { AllowAnonymous } from "@thallesp/nestjs-better-auth"; import { SeedService } from "./seed.service"; `@Controller`("seed") export class SeedController { constructor(private readonly seedService: SeedService) {} `@AllowAnonymous`() `@Post`() async reseed(`@Query`("size") size?: string) { + if (process.env.NODE_ENV === "production") { + throw new ForbiddenException("Seeding is disabled in production"); + } const seedSize = size === "big" ? "big" : "small"; return this.seedService.reseed(seedSize); } }

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

@AllowAnonymous()

@Post()

async reseed() {

return this.seedService.reseed();

async reseed(@Query("size") size?: string) {

const seedSize = size === "big" ? "big" : "small";

return this.seedService.reseed(seedSize);

import { Controller, Post, Query, ForbiddenException } from "@nestjs/common";

import { AllowAnonymous } from "@thallesp/nestjs-better-auth";

import { SeedService } from "./seed.service";

`@Controller`("seed")

export class SeedController {

constructor(private readonly seedService: SeedService) {}

`@AllowAnonymous`()

`@Post`()

async reseed(`@Query`("size") size?: string) {

if (process.env.NODE_ENV === "production") {

throw new ForbiddenException("Seeding is disabled in production");

}

const seedSize = size === "big" ? "big" : "small";

return this.seedService.reseed(seedSize);

}

}

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@apps/backend/src/seed/seed.controller.ts` around lines 10 - 14, The reseed endpoint currently decorated with `@AllowAnonymous`() (method reseed in SeedController calling seedService.reseed) allows unauthenticated destructive access; remove or replace `@AllowAnonymous`() and protect the endpoint by adding an environment or role guard: either add a NonProdGuard/RequireAdmin guard via `@UseGuards`(...) that blocks access when NODE_ENV === 'production' or add an immediate runtime check in reseed (if process.env.NODE_ENV === 'production' throw Forbidden) so the destructive seedService.reseed can only run in non-production contexts.

coderabbitai · 2026-05-13T15:58:39Z

+async function loadRedditPosts(limit?: number): Promise<RedditPost[]> {
+  const result: RedditPost[] = [];
+  const fileStream = fs.createReadStream(REDDIT_POSTS_PATH);
+  const gunzip = zlib.createGunzip();
+  const rl = readline.createInterface({
+    input: fileStream.pipe(gunzip),
+    crlfDelay: Infinity,
+  });
+
+  for await (const line of rl) {
+    const trimmed = line.trim();
+    if (!trimmed) continue;
+    result.push(JSON.parse(trimmed) as RedditPost);
+    if (limit !== undefined && result.length >= limit) {
+      rl.close();
+      fileStream.destroy();
+      gunzip.destroy();
+      break;
+    }
+  }
+
+  return result;
+}


⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Stream resources not closed on natural file completion.

When limit is undefined or greater than file line count, the loop completes naturally without explicitly closing rl, fileStream, and gunzip. While Node.js will eventually garbage-collect these, explicitly closing ensures timely resource release.

🛠️ Proposed fix to ensure cleanup

async function loadRedditPosts(limit?: number): Promise<RedditPost[]> { const result: RedditPost[] = []; const fileStream = fs.createReadStream(REDDIT_POSTS_PATH); const gunzip = zlib.createGunzip(); const rl = readline.createInterface({ input: fileStream.pipe(gunzip), crlfDelay: Infinity, }); - for await (const line of rl) { - const trimmed = line.trim(); - if (!trimmed) continue; - result.push(JSON.parse(trimmed) as RedditPost); - if (limit !== undefined && result.length >= limit) { + try { + for await (const line of rl) { + const trimmed = line.trim(); + if (!trimmed) continue; + result.push(JSON.parse(trimmed) as RedditPost); + if (limit !== undefined && result.length >= limit) { + break; + } + } + } finally { - rl.close(); - fileStream.destroy(); - gunzip.destroy(); - break; - } + rl.close(); + fileStream.destroy(); + gunzip.destroy(); } return result; }

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

async function loadRedditPosts(limit?: number): Promise<RedditPost[]> {

const result: RedditPost[] = [];

const fileStream = fs.createReadStream(REDDIT_POSTS_PATH);

const gunzip = zlib.createGunzip();

const rl = readline.createInterface({

input: fileStream.pipe(gunzip),

crlfDelay: Infinity,

});

for await (const line of rl) {

const trimmed = line.trim();

if (!trimmed) continue;

result.push(JSON.parse(trimmed) as RedditPost);

if (limit !== undefined && result.length >= limit) {

rl.close();

fileStream.destroy();

gunzip.destroy();

break;

}

}

return result;

}

async function loadRedditPosts(limit?: number): Promise<RedditPost[]> {

const result: RedditPost[] = [];

const fileStream = fs.createReadStream(REDDIT_POSTS_PATH);

const gunzip = zlib.createGunzip();

const rl = readline.createInterface({

input: fileStream.pipe(gunzip),

crlfDelay: Infinity,

});

try {

for await (const line of rl) {

const trimmed = line.trim();

if (!trimmed) continue;

result.push(JSON.parse(trimmed) as RedditPost);

if (limit !== undefined && result.length >= limit) {

break;

}

}

} finally {

rl.close();

fileStream.destroy();

gunzip.destroy();

}

return result;

}

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@apps/backend/src/seed/seed.service.ts` around lines 44 - 66, The loadRedditPosts function currently only closes rl/fileStream/gunzip when the loop breaks early; wrap the readline consumption in a try...finally (or ensure a finally block) so that after the for-await loop finishes naturally you always call rl.close(), fileStream.destroy(), and gunzip.destroy(); reference the symbols rl, fileStream, and gunzip in your finally to ensure timely resource release regardless of whether limit caused an early break.

feat: reddit seed data

57750a2

gemini-code-assist Bot reviewed May 13, 2026

View reviewed changes

012e force-pushed the reddit-seed branch from 233a293 to 57750a2 Compare May 13, 2026 15:49

coderabbitai Bot reviewed May 13, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: reddit seed data#59

feat: reddit seed data#59
012e wants to merge 1 commit into
mainfrom
reddit-seed

012e commented May 13, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

netlify Bot commented May 13, 2026 •

edited

Loading

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

gemini-code-assist Bot May 13, 2026

Uh oh!

gemini-code-assist Bot May 13, 2026

Uh oh!

gemini-code-assist Bot May 13, 2026

Uh oh!

coderabbitai Bot commented May 13, 2026 •

edited

Loading

Walkthrough

Changes

Estimated code review effort

Poem

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Uh oh!

coderabbitai Bot May 13, 2026

Uh oh!

coderabbitai Bot May 13, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

-  @AllowAnonymous()
-  @Post()
-  async reseed() {
-    return this.seedService.reseed();
-  async reseed(@Query("size") size?: string) {
-    const seedSize = size === "big" ? "big" : "small";
-    return this.seedService.reseed(seedSize);
+import { Controller, Post, Query, ForbiddenException } from "@nestjs/common";
+import { AllowAnonymous } from "@thallesp/nestjs-better-auth";
+import { SeedService } from "./seed.service";
+`@Controller`("seed")
+export class SeedController {
+  constructor(private readonly seedService: SeedService) {}
+  `@AllowAnonymous`()
+  `@Post`()
+  async reseed(`@Query`("size") size?: string) {
+    if (process.env.NODE_ENV === "production") {
+      throw new ForbiddenException("Seeding is disabled in production");
+    }
+    const seedSize = size === "big" ? "big" : "small";
+    return this.seedService.reseed(seedSize);
+  }
+}

Conversation

012e commented May 13, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Uh oh!

netlify Bot commented May 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

✅ Deploy Preview for intodev ready!

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot May 13, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot May 13, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot May 13, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot commented May 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Poem

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot May 13, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot May 13, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

012e commented May 13, 2026 •

edited by coderabbitai Bot

Loading

netlify Bot commented May 13, 2026 •

edited

Loading

coderabbitai Bot commented May 13, 2026 •

edited

Loading