A modular multimodal Retrieval-Augmented Generation (RAG) system built for scalable real-world AI applications.
Supports:
- local knowledge retrieval
- OCR ingestion
- audio transcription
- live web retrieval
- hybrid multimodal querying
RAG Engine has evolved from a text-only prototype into a hybrid multimodal retrieval system capable of combining:
- π uploaded documents
- πΌοΈ OCR image extraction
- π€ audio transcription
- π live internet retrieval
through a unified semantic retrieval pipeline.
- Semantic similarity search
- FAISS vector database
- Dynamic Top-K retrieval
- Local + web retrieval routing
- Structured retrieval context assembly
- Runtime document uploads
- Automatic chunking
- Embedding generation
- Dynamic vector index rebuilding
- Upload observability
Supports:
.png.jpg.jpeg
Powered by:
- pytesseract
- Pillow
Capabilities:
- scanned text extraction
- screenshot ingestion
- printed English OCR
Supports:
.mp3.wav.m4a
Powered by:
- Whisper (tiny)
- FFmpeg
Capabilities:
- speech-to-text ingestion
- transcript indexing
- audio-based retrieval
Implemented real-time internet augmentation pipeline.
User Query
β
Web Search
β
URL Extraction
β
Webpage Fetching
β
Content Extraction
β
Chunking
β
Structured Web Context
β
LLM Response
- DDGS
- Requests
- Trafilatura
Features:
- live internet retrieval
- semantic webpage extraction
- structured web context
- web source tracking
- retrieval observability
Frontend redesigned from:
- developer utility UI
to:
- modern AI assistant interface
Features:
- chat-style interaction
- modality-aware UI
- upload progress tracking
- web search configuration dialog
- retrieval observability panels
- source inspection system
- interactive modality badges
- slash command support
FILE / WEB QUERY
β
Extractor Router
βββ TXT Extractor
βββ OCR Extractor
βββ Audio Extractor
βββ Web Retriever
β
Normalized Text
β
Chunking
β
Embedding Generation
β
FAISS Indexing
β
Retriever Engine
β
Structured Context
β
LLM / Local Response
RAG/
β
βββ app/
β β
β βββ core/
β β βββ chunking.py
β β βββ embeddings.py
β β βββ retriever.py
β β βββ retriever_engine.py
β β
β βββ ingestion/
β β βββ extractors/
β β β βββ txt_extractor.py
β β β βββ ocr_extractor.py
β β β βββ audio_extractor.py
β β β βββ extractor_router.py
β β β
β β βββ loader.py
β β
β βββ retrieval/
β β βββ web_search.py
β β βββ web_scraper.py
β β βββ web_context_builder.py
β β
β βββ llm/
β β βββ base.py
β β βββ gemini.py
β β
β βββ services/
β β βββ answer_engine.py
β β βββ rag_pipeline.py
β β
β βββ storage/
β βββ faiss_store.py
β
βββ uploads/
βββ data/
βββ model/
β
βββ index.html
βββ main.py
βββ requirements.txt
βββ README.md
βββ .gitignore
git clone https://github.com/your-username/rag-engine.git
cd rag-enginepip install -r requirements.txtDownload:
https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2
Place inside:
model/all-MiniLM-L6-v2/
GEMINI_API_KEY=your_api_keyRequired for audio transcription.
Add FFmpeg to system PATH.
python main.pyhttp://127.0.0.1:8000
.txt.md
.png.jpg.jpeg
.mp3.wav.m4a
- live web retrieval
- semantic webpage extraction
Current retrieval sources:
- uploaded documents
- OCR-extracted text
- audio transcripts
- live internet knowledge
Planned:
- metadata-aware retrieval
- multilingual audio
- reranking
- confidence scoring
- retrieval thresholding
- caching
- async retrieval
- observability expansion
- no handwriting support
- multilingual OCR still experimental
- no layout preservation
- English-only
- no speaker diarization
- no multilingual transcription
- no reranking
- no caching
- temporary retrieval context only
- Python
- FastAPI
- SentenceTransformers
- FAISS
- Gemini API
- Whisper
- Pytesseract
- Trafilatura
- HTML/CSS/JavaScript
MIT License
Β© 2026 β HARDIK BASU