Cortex

A fully offline chat app for local LLMs via Ollama, with structure-aware document indexing, retrieval-augmented generation, runtime model switching, and structured reasoning modes. Drop a PDF, EPUB, DOCX, or text file into the window, attach it to a conversation, and query it with the model that best fits your hardware. No cloud, no telemetry, no internet required after setup.

What it is

Cortex is a single-file FastAPI app with an embedded HTML UI. It runs a chat interface against any Ollama model, persists conversations to local SQLite, and includes a complete document indexing pipeline so you can ground answers in your own books, papers, and notes.

The name reflects what the app does: it acts as an external cortex — memory (your indexed documents) and reasoning (a local LLM) brought together so you can think through complex material without anything leaving the machine.

Cortex works best with well-structured documents — books, textbooks, dissertations, peer-reviewed papers, and other content that has explicit chapter and section structure. At index time, Cortex reads each document's authored structure (PDF outlines, table-of-contents pages, EPUB chapters) and uses it to produce hierarchical summaries that the model sees alongside specific excerpts during retrieval. The result is answers that are grounded in both detail (verbatim passages) and context (what section the passage comes from and what that section is about). Unstructured documents still work, but the depth advantage of Cortex over naive RAG diminishes the less structure the source has.

Features

Fully offline — once Ollama and the models are installed, no internet is needed.
Three model tiers, one executable — switch between 7B (fast), 14B (balanced), and 32B Q4_K_L (research-grade) at runtime via the sidebar dropdown. Your choice persists across launches.
Structure-aware indexing — at index time, Cortex reads each document's authored structure (PDF outline, TOC, EPUB chapters) and produces section-level summaries that the model sees alongside specific excerpts during retrieval. Falls back to fixed-size sections when no structure is recoverable.
Hierarchical retrieval context — retrieved excerpts arrive with their section summary attached, giving the model both detail and surrounding context.
Reasoning modes — five structured prompt scaffolds (Default, Compare, Process, Cross-source, Critique) that force the model to produce organized intermediate output before its prose answer. Modes auto-disable on smaller models that can't handle them well.
Multi-source RAG with guaranteed coverage — when multiple documents are attached, retrieval reserves slots per source so no book gets ignored, and the prompt explicitly instructs the model to use all attached sources.
Built-in document indexer — drag and drop PDF, EPUB, DOCX, TXT, or Markdown into the window.
Citation-aware prompting — the model is instructed to cite [Title, p. N] and not to invent.
Streaming responses — token-by-token output via Server-Sent Events.
Persistent history — conversations and attachments saved to local SQLite.

Architecture

The indexing and retrieval pipeline:

Document arrives via upload or drag-and-drop.
Format-specific extractor pulls text with location metadata (PDF page numbers, EPUB chapter index, DOCX/text pseudo-pages of ~3000 characters).
Structure extraction attempts to recover the document's authored structure: PDF outline → TOC page parsing → fall back to fixed-size sections.
Chunking splits text into overlapping ~1000-character chunks with 200-character overlap.
Embedding of each chunk via nomic-embed-text (768-dim).
Hierarchical summarization runs the active LLM over each section to produce a concise summary stored alongside the embedding cache.
Retrieval at query time: the question is embedded and cosine similarity ranks all chunks across all attached documents. Per-book minimum slots are reserved when multiple sources are attached.
Context assembly packs retrieved excerpts with their section summaries into a structured prompt; the model receives both detail and context.
Generation with whatever reasoning mode is active produces the final response with citations.

Requirements

Ollama (install)
VRAM, depending on which tier you want to use:
- 8 GB for the Lite tier (7B model)
- 12 GB for the Standard tier (14B model)
- 24 GB for the Research tier (32B Q4_K_L)
- Lower VRAM still works via Ollama's CPU/GPU split, but expect slow generation
~5 GB disk for Lite-only setup; ~14 GB for Lite + Standard; ~34 GB for all three tiers
Python 3.10+ only required if running from source

Installation

Option A: Download the executable (recommended)

Download Cortex.exe (Windows) from the Releases page.
Install Ollama if you haven't already.

Pull the embedder and at least one model tier:

ollama pull nomic-embed-text                  # required for retrieval (~270 MB)

# Then pull whichever tier(s) you want — you can install all three:
ollama pull qwen2.5:7b                        # Lite tier (~4.7 GB)
ollama pull qwen2.5:14b                       # Standard tier (~9 GB)

# For the Research tier, pull the high-fidelity Q4_K_L variant:
ollama pull hf.co/bartowski/Qwen2.5-32B-Instruct-GGUF:Qwen2.5-32B-Instruct-Q4_K_L.gguf

Double-click Cortex.exe. Cortex starts a local server and opens your default browser to the chat UI.
Click the model name in the top-left of the sidebar to open the tier dropdown. Pick whichever model you want to use.

A log file at cortex.log next to the executable captures any errors — useful when filing bug reports.

About the model tiers

Tier	Model	Min VRAM	Best for
Lite	`qwen2.5:7b`	8 GB	Daily use, quick lookups, fast response. RAG-grounded queries are strong; pure reasoning is the weakest of the three.
Standard	`qwen2.5:14b`	12 GB	The balanced sweet spot. Strong reasoning at usable speed. Recommended default if you have the VRAM.
Research	`qwen2.5:32b` Q4_K_L	24 GB (or 32+ GB system RAM)	Best precision, especially on technical and academic content. Uses the Q4_K_L quantization, which preserves higher precision (Q6_K) on token embeddings and the output projection — sharper handling of specialized terminology and rare tokens.

The tier dropdown shows install status for each one (✓ installed or ⚠ run: ollama pull ...).

Option B: Run from source

git clone https://github.com/hamii31/Cortex.git
cd Cortex
ollama pull nomic-embed-text
ollama pull qwen2.5:7b   # or another tier
pip install fastapi uvicorn ollama numpy python-multipart \
            pypdf ebooklib beautifulsoup4 python-docx
python cortex.py

Open http://localhost:8000 if it doesn't open automatically.

Option C: Build your own executable

pip install pyinstaller
python build_executable.py

The result lands in dist/Cortex.exe (Windows), dist/Cortex (Linux), or dist/Cortex.app (macOS). PyInstaller doesn't cross-compile, so build on the target platform.

Usage

Switching models

Click the model name in the top-left of the sidebar to open the tier dropdown. Pick any tier with ✓ installed to switch to it. Cortex updates the active model immediately and refreshes the available reasoning modes.

Reasoning modes

Above the chat input is a row of mode pills. Click one to set the mode for your next message.

Mode	What it does	Best for	Min tier
Default	No scaffold — direct answer	Simple lookups, factual questions	Lite
Compare	Forces a markdown comparison table before prose	"A vs B", tradeoffs, "best approach" questions	Lite
Process	Forces explicit state/step layout before prose	"How does X work", pathways, algorithms, system dynamics	Standard
Cross-source	Forces a cross-reference table across attached documents	Multi-document queries where you want all sources considered	Standard
Critique	Forces structured strengths/weaknesses analysis	Reviewing a plan, paper, code design, or proposal	Standard

When multiple documents are attached and the active model is Standard or Research, Cortex silently promotes Default-mode queries to Cross-source mode. Modes that require strong instruction-following are hidden on the Lite tier — the 7B can't reliably produce the structured scaffolds those modes need.

Indexing a document

Drag any supported file into the Cortex window. A progress bar appears in the sidebar showing extraction → chunking → embedding → summarizing → caching. When finished, the document slides into your library and is immediately queryable.

Cortex prefers documents with explicit structure. During indexing, the system attempts to recover each document's authored chapter and section layout in this order:

PDF outline (bookmarks tree) — most modern PDF books and papers have this. Best results.
Table-of-contents page parsing — for PDFs without outlines, Cortex parses common TOC formats from text.
Fixed-size fallback — for documents with no recoverable structure, sections are formed by grouping ~15 chunks at page boundaries.

Format	Page semantics	Structure source
`.pdf`	Real PDF page numbers	Outline → TOC page → fallback
`.epub`	Chapter index	EPUB spine (chapters used directly as section units)
`.docx`	Pseudo-pages of ~3000 characters	Fallback only
`.txt`, `.md`	Pseudo-pages of ~3000 characters	Fallback only

The log file (cortex.log) reports which strategy was used for each book — useful when you want to verify a textbook was properly structured. Documents with chapter-level structure (well-made PDF books, modern academic papers) produce noticeably better retrieval context than unstructured material.

Asking a question

Click + new chat in the sidebar (or just start typing).
Click ⊕ next to one or more books in the LIBRARY section to attach them.
Pick a reasoning mode if appropriate.
Type your question and press Enter.

When sources are attached, Cortex retrieves the top relevant excerpts and shows them in a small RAG panel above the AI's response. If a retrieved chunk belongs to a section with a summary, the section summary is shown to the model as context before the verbatim excerpts. The model is instructed to cite specific pages (not the summary) and to say "the source doesn't cover this" rather than fabricating.

If no documents are attached, Cortex behaves as a normal offline chat — no retrieval runs, no excerpts are injected.

Library separation from SmartReader

Cortex originated as an extension of SmartReader, and earlier versions auto-detected SmartReader's cache directory at launch. This was removed in 1.2. SmartReader caches no longer appear in Cortex's library unless you explicitly opt in.

The reason: SmartReader caches were produced before Cortex's structure-aware indexing existed and lack the hierarchical section summaries that make Cortex queries work well. Surfacing them by default created duplicates with Cortex-native caches of the same books, with strictly inferior retrieval context.

Recommended migration: re-index any books you care about in Cortex by dragging the original files into the window. The new caches will use structure-aware sections and produce better answers.

Opt-in compatibility: if you still want SmartReader cache access (for example, you have indexed books that you don't have the original files for), set the environment variable:

CORTEX_SMARTREADER_CACHE=/path/to/SmartReader/cache

On Windows the default SmartReader location is %APPDATA%\SmartReader\cache. When set, SmartReader-indexed books appear in the library tagged sr and are read-only — Cortex won't modify them.

Configuration

Configure via environment variables before launching:

Variable	Default	Notes
`CORTEX_DEFAULT_TIER`	`lite`	Which tier is active on first launch (`lite`, `standard`, or `research`). Subsequent launches honor your last selection in the UI.
`CORTEX_MODEL`	(unset)	Power-user override: set to any Ollama model name to bypass the tier system entirely. The UI selector becomes disabled when this is set.
`CORTEX_EMBED_MODEL`	`nomic-embed-text`	Embedding model used for both indexing and retrieval.
`CORTEX_HOST`	`127.0.0.1`	Set to `0.0.0.0` to expose to your local network (no auth — be careful).
`CORTEX_PORT`	`8000`	HTTP port.
`CORTEX_TOP_K`	tier-dependent (4 for Lite, 6 for Standard/Research)	Number of chunks retrieved per query.
`CORTEX_LIBRARY`	platform-specific	Override the library cache directory.
`CORTEX_SMARTREADER_CACHE`	(unset, opt-in)	Path to a SmartReader cache directory to expose alongside Cortex's library.
`CORTEX_SKIP_SUMMARIES`	(unset)	Set to `1` to skip hierarchical summarization at index time. Speeds indexing but loses section context at query time.

Default cache locations

OS	Library directory
Linux	`~/.config/cortex/library/`
macOS	`~/Library/Application Support/cortex/library/`
Windows	`%APPDATA%\cortex\library\`

conversations.db, cortex_state.json, and cortex.log live in the parent of the library directory.

API reference

Cortex exposes a small REST API. Use it from scripts, other tools, or to integrate Cortex's library into your own pipelines.

Method	Path	Purpose
`GET`	`/api/model`	Active model and config info
`GET`	`/api/model/tiers`	List available tiers and which is active
`POST`	`/api/model/switch`	Switch active tier (body: `{"tier": "lite"}`)
`GET`	`/api/modes`	List available reasoning modes (filtered by current tier)
`GET`	`/api/ollama/installed`	List models currently installed in Ollama
`GET`	`/api/library`	List indexed documents
`POST`	`/api/library/upload`	Upload a document for indexing
`POST`	`/api/library/{book_id}/summarize`	Retroactively generate hierarchical summaries for an existing cache
`GET`	`/api/library/jobs`	All current/recent indexing jobs and their status
`DELETE`	`/api/library/{book_id}`	Remove a document from the library (Cortex-managed only)
`GET`	`/api/conversations`	List recent conversations
`GET`	`/api/conversations/{cid}`	Get conversation messages and attachments
`DELETE`	`/api/conversations/{cid}`	Delete conversation
`POST`	`/api/conversations/{cid}/attach`	Attach a book
`DELETE`	`/api/conversations/{cid}/attach/{book_id}`	Detach a book
`POST`	`/api/chat`	Send a message; returns SSE stream of tokens
`POST`	`/api/heartbeat`	Browser keepalive
`POST`	`/api/shutdown`	Explicit clean shutdown

Troubleshooting

Windows: "Unknown publisher" warning when launching

Normal for unsigned executables. Click More info → Run anyway.

Antivirus blocks Cortex.exe

PyInstaller-packed executables sometimes trip antivirus heuristics. False positive — whitelist Cortex.exe or build from source.

Cortex.exe opens and immediately closes (Windows)

Run from a terminal so you can see the error, or check cortex.log next to the executable. The most common cause is Ollama not being installed or not running.

`Connection error` in the UI

Ollama isn't running. Start it: ollama serve.

"Model 'X' is not installed" when switching tiers

The error message lists what is currently installed. Pull the missing tier with the command shown.

Indexing is slow on large books

For a 900-page textbook, indexing takes ~20-40 minutes total: embedding the chunks (15-30 min) plus summarizing sections (5-10 min on Lite, longer on Research). This is dominated by Ollama's per-call inference speed. Index in the background and avoid running large chat queries at the same time. To skip summarization (saves the section-summary step), set CORTEX_SKIP_SUMMARIES=1 before launching.

CUDA error 500 / "shared object initialization failed"

The model is too big for your GPU. Switch to a smaller tier in the dropdown. If you've recently crashed the Ollama runner, restart it (right-click the tray icon → Quit, then start again) — the GPU context can stay in a bad state until Ollama is fully restarted.

Mode pill doesn't appear on Lite tier

Process, Cross-source, and Critique modes require Standard or Research tier. Switch up in the dropdown to access them.

Section summaries are missing for an older indexed book

Re-index the book by dragging the original file in again, or call POST /api/library/{book_id}/summarize to generate summaries retroactively (without re-embedding). Note: retroactive summarization can't recover document structure (PDF outline, TOC), so the summaries will use fixed-size sections.

Retrieval misses obviously relevant content

The default embedding truncates each chunk to 500 characters. For technical documents with key info past the first 500 characters of a chunk, change EMBED_TRUNCATE = 500 to EMBED_TRUNCATE = 2000 near the top of the file and re-index.

Kill process listening on port 8000

The process should auto-shutdown ~30 seconds after you close the browser tab. If it doesn't:

# Linux
sudo ss -ltnp '( sport = :8000 )'
sudo kill <PID>

# Windows (PowerShell)
Get-NetTCPConnection -LocalPort 8000 | Select-Object OwningProcess
Stop-Process -Id <PID> -Force

Privacy and data handling

All processing is local. No data is sent to any external service.
Ollama runs on localhost. Verify with ss -tlnp | grep 11434 (Linux) or by disconnecting from the network and confirming queries still work.
Uploaded documents live temporarily in <library>/_uploads/ during indexing and are deleted after the cache is written. Only the embedded .pkl cache and the .summaries.json companion persist.
Conversation history is plain SQLite — readable, exportable, deletable.
The default bind is 127.0.0.1, not exposed to your local network. If you set CORTEX_HOST=0.0.0.0, anyone on your LAN can hit the API; there is no authentication.

License

MIT.

Acknowledgments

Ollama for the local LLM runtime.
SmartReader — the predecessor project that established the offline RAG pattern Cortex extends.
The Qwen team for open-weight models that make this practical.
bartowski for high-quality GGUF quantizations.
nomic-embed-text for embeddings.

Name		Name	Last commit message	Last commit date
Latest commit History 40 Commits
docs		docs
LICENSE		LICENSE
README.md		README.md
build_executable.py		build_executable.py
cortex.py		cortex.py
cortex_launcher.py		cortex_launcher.py

Folders and files

Latest commit

History

Repository files navigation

Cortex

What it is

Features

Architecture

Requirements

Installation

Option A: Download the executable (recommended)

About the model tiers

Option B: Run from source

Option C: Build your own executable

Usage

Switching models

Reasoning modes

Indexing a document

Asking a question

Library separation from SmartReader

Configuration

Default cache locations

API reference

Troubleshooting

Windows: "Unknown publisher" warning when launching

Antivirus blocks Cortex.exe

Cortex.exe opens and immediately closes (Windows)

Connection error in the UI

"Model 'X' is not installed" when switching tiers

Indexing is slow on large books

CUDA error 500 / "shared object initialization failed"

Mode pill doesn't appear on Lite tier

Section summaries are missing for an older indexed book

Retrieval misses obviously relevant content

Kill process listening on port 8000

Privacy and data handling

License

Acknowledgments

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 5

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

`Connection error` in the UI

Packages