Status: paused. Active development has stopped. The CLI prototype works as documented below; the project did not advance into Phase 1 (Next.js + Postgres). Code is preserved here as a reference and starting point if work resumes.
A pipeline that ingests podcast interviews, transcribes them with speaker diarization, and builds AI-generated profiles of the people who appear — entirely from their own words across all their appearances.
The product is the person page: a single research surface where someone's worldview, conviction-ranked positions, taste clusters, and deep-dive topics are synthesized across every episode they've been on. Episodes are internal data units, not destinations.
Every fact on a person's page comes from that person's own transcribed words. No Wikipedia, no LinkedIn, no external bios. The only outside data is the person's name and basic deduplication metadata. This constraint is non-negotiable.
Phase 0 + 0.5: complete (CLI prototype). The Andrew Huberman profile across 12 podcast appearances is the primary test case — 81 themes, 31 convictions, 130 tools, full timestamp-linked attribution.
What works today (run as TypeScript CLI scripts):
- 5-stage episode pipeline — transcribe → correct → identify speakers → extract → registry update
- 4-pass extraction by default (segmentation + entities in parallel via Haiku, theme synthesis from summaries via Sonnet, targeted quote selection via Haiku)
- Post-extraction quote correction against raw transcript utterances (no API calls)
- Lex Fridman fast path that scrapes pre-made transcripts (skips Deepgram)
- Person aggregation — semantic theme merging, conviction extraction & ranking, worldview synthesis, deep-on badge identification, taste clustering
- Static HTML profile pages with collapsible sections and timestamp-linked quotes
- Per-step cost tracking (
costs.jsonledger per episode)
Originally planned next: profile a second person, add more podcast feeds, then start Phase 1 (Next.js + PostgreSQL + Prisma + BullMQ). Not in progress — see status banner above.
- TypeScript +
tsx(no build step in the prototype) - Deepgram Nova-3 — transcription with speaker diarization
- Anthropic Claude — extraction (Haiku + Sonnet, multi-pass), aggregation (Sonnet + Opus where nuance matters)
- Google Gemini — A/B comparison for extraction quality
- Zod — runtime schema validation with retry-on-failure
- yt-dlp — YouTube audio download
Phase 1 will add Next.js (App Router), PostgreSQL, Prisma, BullMQ + Redis.
git clone https://github.com/dstrunin/PodGraph.git
cd PodGraph
npm install
cp .env.example .env
# Edit .env and set DEEPGRAM_API_KEY and ANTHROPIC_API_KEY
npm run test-keys # verify both keys authenticateEnd-to-end on a single episode:
# Register a podcast feed (uses iTunes Search API)
npm run add-podcast -- "Lex Fridman"
# Find appearances by a person
npm run discover -- "Andrew Huberman"
# Run the full pipeline on a YouTube or direct audio URL
npm run pipeline -- "https://www.youtube.com/watch?v=VIDEO_ID"
# After processing 1+ episodes for someone, build their profile
npm run aggregate -- "Andrew Huberman"
npm run build-profile -- "Andrew Huberman" # → output/andrew-huberman.htmlFor the full command reference, prompts, data file layout, and operational tips, see PODGRAPH_PIPELINE_GUIDE.md.
PodGraph/
├── scripts/ # CLI pipeline (current implementation)
│ ├── pipeline.ts # End-to-end episode pipeline
│ ├── transcribe.ts # Deepgram Nova-3 + diarization
│ ├── correct-transcript.ts # Claude proper-noun correction
│ ├── identify-speakers.ts # Claude maps speakers to real names
│ ├── extract-4pass.ts # 4-pass extraction (default)
│ ├── extract-multipass.ts # 2-pass extraction (legacy)
│ ├── correct-quotes.ts # Programmatic quote verification
│ ├── validate-extraction.ts# Cross-reference + accuracy checks
│ ├── aggregate.ts # Person aggregation pipeline
│ ├── build-profile.ts # Static HTML profile page
│ ├── lex/ # Lex Fridman fast path
│ └── lib/ # Shared schemas, manifest, dirs, cost ledger
├── prompts/ # All Claude prompts (one file each)
├── data/
│ ├── episodes/ # Per-episode artifacts (gitignored)
│ ├── profiles/ # Aggregated person profiles
│ ├── entities.json # Global entity registry
│ ├── corrections-global.json
│ └── manifest.json # Processed-episode index
└── .env.example
Per episode (rough): $0.80–$2.40 depending on length, mostly Deepgram + multi-pass Claude. Aggregation is ~$0.12 per person across 5 Claude calls.
npm run costs aggregates spending across every episode and profile from the per-episode costs.json ledgers.
| File | What it covers |
|---|---|
| PODGRAPH_PIPELINE_GUIDE.md | Full CLI reference, prompts, data files, operational tips |
| podgraph-roadmap-revised.md | Source-of-truth architecture and full implementation plan |
| TODO.md | Current task list across all phases |
| CASE_STUDY.md | Portfolio narrative — problem, architecture, tradeoffs, challenges |
| CLAUDE.md | Project instructions for Claude / AI agents |
ISC.