Minimal RAG (Retrieval-Augmented Generation) for the console. Uses Ollama for local LLM and embeddings, and pgvector for vector storage.
Licensed under GNU GPL v3 — free for learning and open source projects. Commercial use requires the derivative work to also be open source.
Most organizations have knowledge trapped in documents — manuals, contracts, reports, policies — that nobody reads because finding the right answer takes too long. miniRAG lets you talk to those documents in plain language and get direct answers, without sending your data to any external service.
Concrete use cases:
- HR & internal policies — "How many vacation days do I get after my first year?" instead of scrolling through a 40-page handbook
- Legal & contracts — "What are the termination clauses in this agreement?" without reading the full document
- Technical manuals — "What does error code E-401 mean and how do I fix it?" across hundreds of pages of documentation
- Financial reports — "What was the operating margin in Q3 and what explains the variance?" from dense PDF reports
- Customer support — Answer repetitive questions automatically from your own product documentation
- Compliance & audits — "Does our policy cover this scenario?" with traceable answers pointing to the source fragment
Why local? Everything runs on your own machine. Documents never leave your infrastructure — no OpenAI, no cloud APIs, no data exposure.
- LLM & Embeddings — Ollama (local)
- Vector DB — PostgreSQL + pgvector (HNSW cosine index)
- ORM — SQLAlchemy
- PDF parsing — pdfplumber
- Package manager — uv
# 1. Pull the models in Ollama
ollama pull qwen2.5:14b
ollama pull nomic-embed-text
# 2. Start pgvector
docker compose up -d
# 3. Install dependencies
uv sync
# 4. Configure environment
cp .env.example .env
# edit .env as needed# Index a file (.txt or .pdf)
uv run minirag -f document.pdf
# Ask a question
uv run minirag -q "What is the vacation policy?"
# Index and query in one command
uv run minirag -f document.pdf -q "What is the vacation policy?"
# Run diagnostics on a question
uv run minirag -d "What is the vacation policy?"
# Suppress pipeline output
uv run minirag -q "What is the vacation policy?" --no-verboseAll settings are controlled via .env:
| Variable | Default | Description |
|---|---|---|
OLLAMA_BASE_URL |
http://localhost:11434 |
Ollama server URL |
OLLAMA_LLM_MODEL |
qwen2.5:14b |
Model for generation |
OLLAMA_EMBEDDING_MODEL |
nomic-embed-text |
Model for embeddings |
EMBEDDING_DIM |
768 |
Embedding dimensions |
TEMPERATURE |
0.1 |
LLM temperature |
LANGUAGE |
es |
Prompt language (en or es) |
N_RESULTS |
6 |
Number of fragments to retrieve |
MIN_SIMILARITY |
0.5 |
Minimum cosine similarity threshold |
CHUNK_SIZE |
500 |
Characters per chunk |
CHUNK_OVERLAP |
50 |
Overlap between chunks |
RETRIEVE_MODE |
simple |
Retrieval strategy (simple, hyde, multi) |
MULTI_N |
3 |
Number of query variants for multi mode |
RERANKER_ENABLED |
false |
Enable reranker after retrieval |
RERANKER_MODEL |
qllama/bge-reranker-v2-m3:f16 |
Reranker model via Ollama |
RERANKER_TOP_N |
6 |
Final fragments after reranking |
RERANKER_CANDIDATES |
20 |
Candidates fetched before reranking |
POSTGRES_HOST |
localhost |
Database host |
POSTGRES_PORT |
5432 |
Database port |
POSTGRES_DB |
minirag |
Database name |
POSTGRES_USER |
minirag |
Database user |
POSTGRES_PASSWORD |
minirag |
Database password |
miniRAG/
├── main.py # CLI entry point
├── utils.py # All core logic
│ ├── Config # Env vars and constants
│ ├── ORM # SQLAlchemy Doc model + pgvector schema
│ ├── Database # index(), _query_vector(), get_collection()
│ ├── Ollama # embedded(), call_llm(), rerank()
│ ├── Prompt # build_prompt(), _load_templates()
│ ├── RAG pipeline # rag(), retrieve(), diagnose_rag()
│ └── File ingestion # load_file(), _chunk_text()
├── prompt_templates.json # Prompt templates (en/es)
├── docker-compose.yml
├── pyproject.toml
└── .env
SIMPLE — best for: technical docs, FAQs, manuals, direct questions
question ──► embed(question) ──► pgvector search ──► build_prompt ──► LLM ──► answer
HYDE — best for: narrative documents, open-ended questions, literary texts
┌─► LLM (hypothetical answer) ──► embed(answer) ──┐
question ─────────┤ ├──► pgvector search ──► build_prompt ──► LLM ──► answer
└───────────────────────────────────────────────────┘
MULTI — best for: ambiguous questions, mixed documents, when simple/hyde miss results
┌─► variant 1 ──► embed ──► pgvector search ──┐
question ──► LLM ─┼─► variant 2 ──► embed ──► pgvector search ──┼──► merge & dedupe ──► build_prompt ──► LLM ──► answer
└─► variant N ──► embed ──► pgvector search ──┘
+ RERANKER (optional, any mode) — best for: noisy retrieval results or low similarity scores
retrieve() ──► top RERANKER_CANDIDATES ──► bge-reranker cosine ──► top RERANKER_TOP_N ──► build_prompt ──► LLM ──► answer
The reranker uses qllama/bge-reranker-v2-m3 via Ollama as a cosine similarity scorer.
This is not a true reranker — the model's classification head is lost in the GGUF conversion.
It does however produce better relevance scores than nomic-embed-text because its vector
space is trained specifically for query-document relevance.
To enable it, pull the model and set RERANKER_ENABLED=true in .env:
ollama pull qllama/bge-reranker-v2-m3:f16For production-grade reranking, replace the body of rerank() in utils.py with the
HuggingFace version documented in the function's docstring. It uses the classification head
directly and produces a true relevance score per (question, fragment) pair.
Requirements:
uv add transformers torchThe function signature is identical — same inputs, same output — so it is a drop-in replacement. No other code changes needed.