A fully offline chat app for local LLMs via Ollama, with structure-aware document indexing, retrieval-augmented generation, runtime model switching, and structured reasoning modes. Drop a PDF, EPUB, DOCX, or text file into the window, attach it to a conversation, and query it with the model that best fits your hardware. No cloud, no telemetry, no internet required after setup.
Cortex is a single-file FastAPI app with an embedded HTML UI. It runs a chat interface against any Ollama model, persists conversations to local SQLite, and includes a complete document indexing pipeline so you can ground answers in your own books, papers, and notes.
The name reflects what the app does: it acts as an external cortex — memory (your indexed documents) and reasoning (a local LLM) brought together so you can think through complex material without anything leaving the machine.
Cortex works best with well-structured documents — books, textbooks, dissertations, peer-reviewed papers, and other content that has explicit chapter and section structure. At index time, Cortex reads each document's authored structure (PDF outlines, table-of-contents pages, EPUB chapters) and uses it to produce hierarchical summaries that the model sees alongside specific excerpts during retrieval. The result is answers that are grounded in both detail (verbatim passages) and context (what section the passage comes from and what that section is about). Unstructured documents still work, but the depth advantage of Cortex over naive RAG diminishes the less structure the source has.
- Fully offline — once Ollama and the models are installed, no internet is needed.
- Three model tiers, one executable — switch between 7B (fast), 14B (balanced), and 32B Q4_K_L (research-grade) at runtime via the sidebar dropdown. Your choice persists across launches.
- Structure-aware indexing — at index time, Cortex reads each document's authored structure (PDF outline, TOC, EPUB chapters) and produces section-level summaries that the model sees alongside specific excerpts during retrieval. Falls back to fixed-size sections when no structure is recoverable.
- Hierarchical retrieval context — retrieved excerpts arrive with their section summary attached, giving the model both detail and surrounding context.
- Reasoning modes — five structured prompt scaffolds (Default, Compare, Process, Cross-source, Critique) that force the model to produce organized intermediate output before its prose answer. Modes auto-disable on smaller models that can't handle them well.
- Multi-source RAG with guaranteed coverage — when multiple documents are attached, retrieval reserves slots per source so no book gets ignored, and the prompt explicitly instructs the model to use all attached sources.
- Built-in document indexer — drag and drop PDF, EPUB, DOCX, TXT, or Markdown into the window.
- Citation-aware prompting — the model is instructed to cite
[Title, p. N]and not to invent. - Streaming responses — token-by-token output via Server-Sent Events.
- Persistent history — conversations and attachments saved to local SQLite.
The indexing and retrieval pipeline:
- Document arrives via upload or drag-and-drop.
- Format-specific extractor pulls text with location metadata (PDF page numbers, EPUB chapter index, DOCX/text pseudo-pages of ~3000 characters).
- Structure extraction attempts to recover the document's authored structure: PDF outline → TOC page parsing → fall back to fixed-size sections.
- Chunking splits text into overlapping ~1000-character chunks with 200-character overlap.
- Embedding of each chunk via
nomic-embed-text(768-dim). - Hierarchical summarization runs the active LLM over each section to produce a concise summary stored alongside the embedding cache.
- Retrieval at query time: the question is embedded and cosine similarity ranks all chunks across all attached documents. Per-book minimum slots are reserved when multiple sources are attached.
- Context assembly packs retrieved excerpts with their section summaries into a structured prompt; the model receives both detail and context.
- Generation with whatever reasoning mode is active produces the final response with citations.
- Ollama (install)
- VRAM, depending on which tier you want to use:
- 8 GB for the Lite tier (7B model)
- 12 GB for the Standard tier (14B model)
- 24 GB for the Research tier (32B Q4_K_L)
- Lower VRAM still works via Ollama's CPU/GPU split, but expect slow generation
- ~5 GB disk for Lite-only setup; ~14 GB for Lite + Standard; ~34 GB for all three tiers
- Python 3.10+ only required if running from source
-
Download
Cortex.exe(Windows) from the Releases page. -
Install Ollama if you haven't already.
-
Pull the embedder and at least one model tier:
ollama pull nomic-embed-text # required for retrieval (~270 MB) # Then pull whichever tier(s) you want — you can install all three: ollama pull qwen2.5:7b # Lite tier (~4.7 GB) ollama pull qwen2.5:14b # Standard tier (~9 GB) # For the Research tier, pull the high-fidelity Q4_K_L variant: ollama pull hf.co/bartowski/Qwen2.5-32B-Instruct-GGUF:Qwen2.5-32B-Instruct-Q4_K_L.gguf
-
Double-click
Cortex.exe. Cortex starts a local server and opens your default browser to the chat UI. -
Click the model name in the top-left of the sidebar to open the tier dropdown. Pick whichever model you want to use.
A log file at cortex.log next to the executable captures any errors — useful when filing bug reports.
| Tier | Model | Min VRAM | Best for |
|---|---|---|---|
| Lite | qwen2.5:7b |
8 GB | Daily use, quick lookups, fast response. RAG-grounded queries are strong; pure reasoning is the weakest of the three. |
| Standard | qwen2.5:14b |
12 GB | The balanced sweet spot. Strong reasoning at usable speed. Recommended default if you have the VRAM. |
| Research | qwen2.5:32b Q4_K_L |
24 GB (or 32+ GB system RAM) | Best precision, especially on technical and academic content. Uses the Q4_K_L quantization, which preserves higher precision (Q6_K) on token embeddings and the output projection — sharper handling of specialized terminology and rare tokens. |
The tier dropdown shows install status for each one (✓ installed or ⚠ run: ollama pull ...).
git clone https://github.com/hamii31/Cortex.git
cd Cortex
ollama pull nomic-embed-text
ollama pull qwen2.5:7b # or another tier
pip install fastapi uvicorn ollama numpy python-multipart \
pypdf ebooklib beautifulsoup4 python-docx
python cortex.pyOpen http://localhost:8000 if it doesn't open automatically.
pip install pyinstaller
python build_executable.pyThe result lands in dist/Cortex.exe (Windows), dist/Cortex (Linux), or dist/Cortex.app (macOS). PyInstaller doesn't cross-compile, so build on the target platform.
Click the model name in the top-left of the sidebar to open the tier dropdown. Pick any tier with ✓ installed to switch to it. Cortex updates the active model immediately and refreshes the available reasoning modes.
Above the chat input is a row of mode pills. Click one to set the mode for your next message.
| Mode | What it does | Best for | Min tier |
|---|---|---|---|
| Default | No scaffold — direct answer | Simple lookups, factual questions | Lite |
| Compare | Forces a markdown comparison table before prose | "A vs B", tradeoffs, "best approach" questions | Lite |
| Process | Forces explicit state/step layout before prose | "How does X work", pathways, algorithms, system dynamics | Standard |
| Cross-source | Forces a cross-reference table across attached documents | Multi-document queries where you want all sources considered | Standard |
| Critique | Forces structured strengths/weaknesses analysis | Reviewing a plan, paper, code design, or proposal | Standard |
When multiple documents are attached and the active model is Standard or Research, Cortex silently promotes Default-mode queries to Cross-source mode. Modes that require strong instruction-following are hidden on the Lite tier — the 7B can't reliably produce the structured scaffolds those modes need.
Drag any supported file into the Cortex window. A progress bar appears in the sidebar showing extraction → chunking → embedding → summarizing → caching. When finished, the document slides into your library and is immediately queryable.
Cortex prefers documents with explicit structure. During indexing, the system attempts to recover each document's authored chapter and section layout in this order:
- PDF outline (bookmarks tree) — most modern PDF books and papers have this. Best results.
- Table-of-contents page parsing — for PDFs without outlines, Cortex parses common TOC formats from text.
- Fixed-size fallback — for documents with no recoverable structure, sections are formed by grouping ~15 chunks at page boundaries.
| Format | Page semantics | Structure source |
|---|---|---|
.pdf |
Real PDF page numbers | Outline → TOC page → fallback |
.epub |
Chapter index | EPUB spine (chapters used directly as section units) |
.docx |
Pseudo-pages of ~3000 characters | Fallback only |
.txt, .md |
Pseudo-pages of ~3000 characters | Fallback only |
The log file (cortex.log) reports which strategy was used for each book — useful when you want to verify a textbook was properly structured. Documents with chapter-level structure (well-made PDF books, modern academic papers) produce noticeably better retrieval context than unstructured material.
- Click + new chat in the sidebar (or just start typing).
- Click ⊕ next to one or more books in the LIBRARY section to attach them.
- Pick a reasoning mode if appropriate.
- Type your question and press Enter.
When sources are attached, Cortex retrieves the top relevant excerpts and shows them in a small RAG panel above the AI's response. If a retrieved chunk belongs to a section with a summary, the section summary is shown to the model as context before the verbatim excerpts. The model is instructed to cite specific pages (not the summary) and to say "the source doesn't cover this" rather than fabricating.
If no documents are attached, Cortex behaves as a normal offline chat — no retrieval runs, no excerpts are injected.
Cortex originated as an extension of SmartReader, and earlier versions auto-detected SmartReader's cache directory at launch. This was removed in 1.2. SmartReader caches no longer appear in Cortex's library unless you explicitly opt in.
The reason: SmartReader caches were produced before Cortex's structure-aware indexing existed and lack the hierarchical section summaries that make Cortex queries work well. Surfacing them by default created duplicates with Cortex-native caches of the same books, with strictly inferior retrieval context.
Recommended migration: re-index any books you care about in Cortex by dragging the original files into the window. The new caches will use structure-aware sections and produce better answers.
Opt-in compatibility: if you still want SmartReader cache access (for example, you have indexed books that you don't have the original files for), set the environment variable:
CORTEX_SMARTREADER_CACHE=/path/to/SmartReader/cacheOn Windows the default SmartReader location is %APPDATA%\SmartReader\cache. When set, SmartReader-indexed books appear in the library tagged sr and are read-only — Cortex won't modify them.
Configure via environment variables before launching:
| Variable | Default | Notes |
|---|---|---|
CORTEX_DEFAULT_TIER |
lite |
Which tier is active on first launch (lite, standard, or research). Subsequent launches honor your last selection in the UI. |
CORTEX_MODEL |
(unset) | Power-user override: set to any Ollama model name to bypass the tier system entirely. The UI selector becomes disabled when this is set. |
CORTEX_EMBED_MODEL |
nomic-embed-text |
Embedding model used for both indexing and retrieval. |
CORTEX_HOST |
127.0.0.1 |
Set to 0.0.0.0 to expose to your local network (no auth — be careful). |
CORTEX_PORT |
8000 |
HTTP port. |
CORTEX_TOP_K |
tier-dependent (4 for Lite, 6 for Standard/Research) | Number of chunks retrieved per query. |
CORTEX_LIBRARY |
platform-specific | Override the library cache directory. |
CORTEX_SMARTREADER_CACHE |
(unset, opt-in) | Path to a SmartReader cache directory to expose alongside Cortex's library. |
CORTEX_SKIP_SUMMARIES |
(unset) | Set to 1 to skip hierarchical summarization at index time. Speeds indexing but loses section context at query time. |
| OS | Library directory |
|---|---|
| Linux | ~/.config/cortex/library/ |
| macOS | ~/Library/Application Support/cortex/library/ |
| Windows | %APPDATA%\cortex\library\ |
conversations.db, cortex_state.json, and cortex.log live in the parent of the library directory.
Cortex exposes a small REST API. Use it from scripts, other tools, or to integrate Cortex's library into your own pipelines.
| Method | Path | Purpose |
|---|---|---|
GET |
/api/model |
Active model and config info |
GET |
/api/model/tiers |
List available tiers and which is active |
POST |
/api/model/switch |
Switch active tier (body: {"tier": "lite"}) |
GET |
/api/modes |
List available reasoning modes (filtered by current tier) |
GET |
/api/ollama/installed |
List models currently installed in Ollama |
GET |
/api/library |
List indexed documents |
POST |
/api/library/upload |
Upload a document for indexing |
POST |
/api/library/{book_id}/summarize |
Retroactively generate hierarchical summaries for an existing cache |
GET |
/api/library/jobs |
All current/recent indexing jobs and their status |
DELETE |
/api/library/{book_id} |
Remove a document from the library (Cortex-managed only) |
GET |
/api/conversations |
List recent conversations |
GET |
/api/conversations/{cid} |
Get conversation messages and attachments |
DELETE |
/api/conversations/{cid} |
Delete conversation |
POST |
/api/conversations/{cid}/attach |
Attach a book |
DELETE |
/api/conversations/{cid}/attach/{book_id} |
Detach a book |
POST |
/api/chat |
Send a message; returns SSE stream of tokens |
POST |
/api/heartbeat |
Browser keepalive |
POST |
/api/shutdown |
Explicit clean shutdown |
Normal for unsigned executables. Click More info → Run anyway.
PyInstaller-packed executables sometimes trip antivirus heuristics. False positive — whitelist Cortex.exe or build from source.
Run from a terminal so you can see the error, or check cortex.log next to the executable. The most common cause is Ollama not being installed or not running.
Ollama isn't running. Start it: ollama serve.
The error message lists what is currently installed. Pull the missing tier with the command shown.
For a 900-page textbook, indexing takes ~20-40 minutes total: embedding the chunks (15-30 min) plus summarizing sections (5-10 min on Lite, longer on Research). This is dominated by Ollama's per-call inference speed. Index in the background and avoid running large chat queries at the same time. To skip summarization (saves the section-summary step), set CORTEX_SKIP_SUMMARIES=1 before launching.
The model is too big for your GPU. Switch to a smaller tier in the dropdown. If you've recently crashed the Ollama runner, restart it (right-click the tray icon → Quit, then start again) — the GPU context can stay in a bad state until Ollama is fully restarted.
Process, Cross-source, and Critique modes require Standard or Research tier. Switch up in the dropdown to access them.
Re-index the book by dragging the original file in again, or call POST /api/library/{book_id}/summarize to generate summaries retroactively (without re-embedding). Note: retroactive summarization can't recover document structure (PDF outline, TOC), so the summaries will use fixed-size sections.
The default embedding truncates each chunk to 500 characters. For technical documents with key info past the first 500 characters of a chunk, change EMBED_TRUNCATE = 500 to EMBED_TRUNCATE = 2000 near the top of the file and re-index.
The process should auto-shutdown ~30 seconds after you close the browser tab. If it doesn't:
# Linux
sudo ss -ltnp '( sport = :8000 )'
sudo kill <PID>
# Windows (PowerShell)
Get-NetTCPConnection -LocalPort 8000 | Select-Object OwningProcess
Stop-Process -Id <PID> -Force- All processing is local. No data is sent to any external service.
- Ollama runs on
localhost. Verify withss -tlnp | grep 11434(Linux) or by disconnecting from the network and confirming queries still work. - Uploaded documents live temporarily in
<library>/_uploads/during indexing and are deleted after the cache is written. Only the embedded.pklcache and the.summaries.jsoncompanion persist. - Conversation history is plain SQLite — readable, exportable, deletable.
- The default bind is
127.0.0.1, not exposed to your local network. If you setCORTEX_HOST=0.0.0.0, anyone on your LAN can hit the API; there is no authentication.
MIT.
- Ollama for the local LLM runtime.
- SmartReader — the predecessor project that established the offline RAG pattern Cortex extends.
- The Qwen team for open-weight models that make this practical.
- bartowski for high-quality GGUF quantizations.
- nomic-embed-text for embeddings.
