AI-powered document classification that herds files into organized folders
Features • Quick Start • Usage • Configuration • Documentation
Drover uses LLMs to analyze documents and suggest consistent, policy-compliant filesystem paths and filenames. Named after herding dogs that drove livestock, Drover herds your scattered files into an organized folder structure.
- Multi-Provider AI — Works with Ollama (local), OpenAI, Anthropic, and OpenRouter
- Intelligent Classification — Categorizes documents by domain, category, and document type
- Smart Sampling — Adaptive page sampling for efficient processing of large documents
- Taxonomy System — Extensible controlled vocabularies with strict or fallback modes
- NARA-Compliant Naming — Generates standardized filenames:
{doctype}-{vendor}-{subject}-{entity}-{date}.pdf. Theentityslot (pet, patient, performer, brand) is optional and is dropped when empty, when it would duplicate the vendor, or for privacy-sensitive domains. - macOS Tagging — Apply classification as native filesystem tags
- Batch Processing — Classify multiple documents with JSONL output
- Evaluation Framework — Measure accuracy against ground truth datasets
- Python 3.14.x
- uv (package and environment manager)
- Ollama (for local inference) or API keys for cloud providers
To install drover as a global CLI on your system, see INSTALL.md. The steps below set up a development checkout for working on drover itself.
# Clone and sync the project (creates .venv and installs dependencies)
git clone https://github.com/ckrough/drover.git
cd drover
uv sync --extra docling
# Download Docling models (one-time, ~500 MB to ~/.cache/docling/models)
uv run docling-tools models downloadRun the CLI through uv run drover ..., or activate the environment with source .venv/bin/activate to call drover directly.
Drover uses Docling as the sole document loader, with full-page OCR enabled on PDFs so vendor names carried in logos and embedded images reach the classifier. The [docling] install extra and the one-time model download above are required. If you skip the download, Docling's first run fetches models from Hugging Face on demand (a few hundred MB, internet required); subsequent runs are fully offline. Rationale and the format-coverage policy live in ADR-005 and ADR-006.
# Using local Ollama (default)
drover classify document.pdf
# Using OpenAI
export OPENAI_API_KEY="sk-..."
drover classify document.pdf --ai-provider openai --ai-model gpt-4oAnalyze documents and output suggested file paths:
drover classify invoice.pdf
drover classify *.pdf --batch # Multiple files, JSONL output
drover classify doc.pdf --metrics # Include AI metrics
drover classify doc.pdf --log-level verbose # Detailed loggingClassify and apply native filesystem tags:
drover tag document.pdf --dry-run # Preview tags
drover tag document.pdf --tag-fields domain,vendor
drover tag --tag-mode replace document.pdf # Replace existing tagsClassify, optionally tag, and move files into a destination tree. SRC may be a single file or a directory; --dest is required.
# Move every supported file under ~/Inbox into ~/Documents/filed.
drover organize ~/Inbox --dest ~/Documents/filed
# Dry-run preview to stdout (one JSONL record per file).
drover organize ~/Inbox --dest ~/Documents/filed --dry-run --report -
# Single-file invocation suitable for Hazel rules or macOS Folder Actions.
drover organize ~/Inbox/scan.pdf --dest ~/Documents/filed \
--tag-fields category,doctype \
--report ~/Library/Logs/drover/run-$(date +%Y%m%d-%H%M%S).jsonl
# Copy instead of move; source preserved.
drover organize ~/Inbox --dest ~/Documents/filed --copyBehavior:
- Default conflict policy is skip: if
{DEST}/{suggested_path}already exists, the source is left in place and the record isskipped_exists. Re-running on the same source is idempotent. --tag-fieldsis validated againstdomain, category, doctype, vendor, date, subjectat parse time. Tags are written only to the destination Drover produces; the source file is never tagged.- The JSONL report is identical in
--dry-runand live mode (status values are prefixedwould_in dry-run). Records carryoriginal_path,suggested_path,final_destination,status,tags_applied, anderror. - Stream discipline: log chatter goes to stderr (gated by
DROVER_LOG_LEVEL); the JSONL report goes to the file you pass, or to stdout when--report -. Unsupported extensions surface asskipped_unsupportedrecords and do not raise the exit code.
A Hazel rule that watches ~/Inbox/, runs drover organize on every newly added file, and appends to a daily JSONL log:
drover organize "$1" \
--dest ~/Documents/filed \
--tag-fields category,doctype \
--report ~/Library/Logs/drover/$(date +%Y-%m-%d).jsonlEquivalent macOS Folder Action shell:
for f in "$@"; do
/usr/local/bin/drover organize "$f" \
--dest ~/Documents/filed \
--tag-fields category,doctype \
--report ~/Library/Logs/drover/$(date +%Y-%m-%d).jsonl
doneMeasure classification accuracy against ground truth:
drover evaluate eval/ground_truth/synthetic.jsonl
drover evaluate eval/ground_truth/synthetic.jsonl --output-format json{
"original": "scan001.pdf",
"suggested_path": "financial/banking/statement/statement-chase-checking-20240115.pdf",
"domain": "financial",
"category": "banking",
"doctype": "statement",
"vendor": "chase",
"date": "20240115",
"subject": "checking"
}| Variable | Description | Default |
|---|---|---|
DROVER_AI_PROVIDER |
AI provider (ollama, openai, anthropic, openrouter) | ollama |
DROVER_AI_MODEL |
Model name | gemma4:latest |
DROVER_TAXONOMY |
Classification taxonomy | household |
DROVER_NAMING_STYLE |
Filename policy | nara |
DROVER_SAMPLE_STRATEGY |
Page sampling (full, first_n, bookends, adaptive) | adaptive |
DROVER_LOG_LEVEL |
Logging verbosity (quiet, verbose, debug) | quiet |
Drover searches for configuration in order: --config PATH → drover.yaml → ~/.config/drover/config.yaml
# drover.yaml
ai:
provider: openai
model: gpt-4o
temperature: 0.0
taxonomy: household
taxonomy_mode: fallback
naming_style: nara
concurrency: 4| Provider | API Key Variable | Example Model |
|---|---|---|
| Ollama | — (local) | gemma4:latest |
| OpenAI | OPENAI_API_KEY |
gpt-4o |
| Anthropic | ANTHROPIC_API_KEY |
claude-sonnet-4-20250514 |
| OpenRouter | OPENROUTER_API_KEY |
anthropic/claude-sonnet-4 |
The loader is Docling, so the supported set matches Docling's officially-supported formats per docs/usage/supported_formats. See ADR-006 for the audit.
| Category | Extensions |
|---|---|
.pdf |
|
| Office (Open XML) | .docx, .xlsx, .pptx |
| Markup | .txt, .md, .html, .htm |
| Data | .csv |
| Images | .png, .jpg, .jpeg, .tiff, .tif, .bmp |
Drover follows a pipeline architecture with extensible plugin systems:
[Document] → [Loader] → [Classifier] → [PathBuilder] → [Output]
↓ ↓ ↓
[Sampling] [Taxonomy] [NamingPolicy]
Tech Stack:
- CLI: Click
- LLM: LangChain with structured output
- Config: Pydantic
- Logging: structlog
# Install with dev dependencies
uv sync --all-extras
# Run tests
uv run pytest
# Run end-to-end smoke suite (LLM tests need Ollama; emits JSON report under smoke/reports/)
uv run python smoke/run.py # full suite (~2 min)
uv run python smoke/run.py --skip-llm # CLI + error-path tests only (~10s)
# See smoke/README.md for the test catalog and report schema.
# Lint and format
uv run ruff check src/ --fix && uv run ruff format src/
# Security scan
uv run bandit -r src/ -c pyproject.tomlSee CONTRIBUTING.md for the full development workflow.
- Contributing Guide — Development setup, architecture, and extension guides
- ADR-001: Chain-of-Thought Prompting — 7-step reasoning for accurate classification
- ADR-002: Privacy-First Design — Local-first, zero telemetry approach
- ADR-003: NLI Classifier Roadmap — Zero-shot NLI exploration (superseded by ADR-004)
- ADR-004: Local LLM as Primary Local Path — Ollama gemma4 as the default local classifier
- ADR-005: Docling with Full-Page OCR as the Default PDF Loader — Structure-aware loading with OCR over logos and embedded images for accurate folder placement
GNU Affero General Public License v3.0 or later (AGPL-3.0-or-later). See LICENSE for the full text.
Copyright (C) 2025 Backchain LLC
