Skip to content

ckrough/drover

Repository files navigation

Drover logo

Drover

AI-powered document classification that herds files into organized folders

FeaturesQuick StartUsageConfigurationDocumentation


Drover uses LLMs to analyze documents and suggest consistent, policy-compliant filesystem paths and filenames. Named after herding dogs that drove livestock, Drover herds your scattered files into an organized folder structure.

Features

  • Multi-Provider AI — Works with Ollama (local), OpenAI, Anthropic, and OpenRouter
  • Intelligent Classification — Categorizes documents by domain, category, and document type
  • Smart Sampling — Adaptive page sampling for efficient processing of large documents
  • Taxonomy System — Extensible controlled vocabularies with strict or fallback modes
  • NARA-Compliant Naming — Generates standardized filenames: {doctype}-{vendor}-{subject}-{entity}-{date}.pdf. The entity slot (pet, patient, performer, brand) is optional and is dropped when empty, when it would duplicate the vendor, or for privacy-sensitive domains.
  • macOS Tagging — Apply classification as native filesystem tags
  • Batch Processing — Classify multiple documents with JSONL output
  • Evaluation Framework — Measure accuracy against ground truth datasets

Quick Start

Prerequisites

  • Python 3.14.x
  • uv (package and environment manager)
  • Ollama (for local inference) or API keys for cloud providers

Installation

To install drover as a global CLI on your system, see INSTALL.md. The steps below set up a development checkout for working on drover itself.

# Clone and sync the project (creates .venv and installs dependencies)
git clone https://github.com/ckrough/drover.git
cd drover
uv sync --extra docling

# Download Docling models (one-time, ~500 MB to ~/.cache/docling/models)
uv run docling-tools models download

Run the CLI through uv run drover ..., or activate the environment with source .venv/bin/activate to call drover directly.

About the Docling loader

Drover uses Docling as the sole document loader, with full-page OCR enabled on PDFs so vendor names carried in logos and embedded images reach the classifier. The [docling] install extra and the one-time model download above are required. If you skip the download, Docling's first run fetches models from Hugging Face on demand (a few hundred MB, internet required); subsequent runs are fully offline. Rationale and the format-coverage policy live in ADR-005 and ADR-006.

Classify Your First Document

# Using local Ollama (default)
drover classify document.pdf

# Using OpenAI
export OPENAI_API_KEY="sk-..."
drover classify document.pdf --ai-provider openai --ai-model gpt-4o

Usage

Classify Command

Analyze documents and output suggested file paths:

drover classify invoice.pdf
drover classify *.pdf --batch                    # Multiple files, JSONL output
drover classify doc.pdf --metrics                # Include AI metrics
drover classify doc.pdf --log-level verbose      # Detailed logging

Tag Command (macOS)

Classify and apply native filesystem tags:

drover tag document.pdf --dry-run                # Preview tags
drover tag document.pdf --tag-fields domain,vendor
drover tag --tag-mode replace document.pdf       # Replace existing tags

Organize Command

Classify, optionally tag, and move files into a destination tree. SRC may be a single file or a directory; --dest is required.

# Move every supported file under ~/Inbox into ~/Documents/filed.
drover organize ~/Inbox --dest ~/Documents/filed

# Dry-run preview to stdout (one JSONL record per file).
drover organize ~/Inbox --dest ~/Documents/filed --dry-run --report -

# Single-file invocation suitable for Hazel rules or macOS Folder Actions.
drover organize ~/Inbox/scan.pdf --dest ~/Documents/filed \
  --tag-fields category,doctype \
  --report ~/Library/Logs/drover/run-$(date +%Y%m%d-%H%M%S).jsonl

# Copy instead of move; source preserved.
drover organize ~/Inbox --dest ~/Documents/filed --copy

Behavior:

  • Default conflict policy is skip: if {DEST}/{suggested_path} already exists, the source is left in place and the record is skipped_exists. Re-running on the same source is idempotent.
  • --tag-fields is validated against domain, category, doctype, vendor, date, subject at parse time. Tags are written only to the destination Drover produces; the source file is never tagged.
  • The JSONL report is identical in --dry-run and live mode (status values are prefixed would_ in dry-run). Records carry original_path, suggested_path, final_destination, status, tags_applied, and error.
  • Stream discipline: log chatter goes to stderr (gated by DROVER_LOG_LEVEL); the JSONL report goes to the file you pass, or to stdout when --report -. Unsupported extensions surface as skipped_unsupported records and do not raise the exit code.

Hazel / Folder Action recipe

A Hazel rule that watches ~/Inbox/, runs drover organize on every newly added file, and appends to a daily JSONL log:

drover organize "$1" \
  --dest ~/Documents/filed \
  --tag-fields category,doctype \
  --report ~/Library/Logs/drover/$(date +%Y-%m-%d).jsonl

Equivalent macOS Folder Action shell:

for f in "$@"; do
  /usr/local/bin/drover organize "$f" \
    --dest ~/Documents/filed \
    --tag-fields category,doctype \
    --report ~/Library/Logs/drover/$(date +%Y-%m-%d).jsonl
done

Evaluate Command

Measure classification accuracy against ground truth:

drover evaluate eval/ground_truth/synthetic.jsonl
drover evaluate eval/ground_truth/synthetic.jsonl --output-format json

Output Format

{
  "original": "scan001.pdf",
  "suggested_path": "financial/banking/statement/statement-chase-checking-20240115.pdf",
  "domain": "financial",
  "category": "banking",
  "doctype": "statement",
  "vendor": "chase",
  "date": "20240115",
  "subject": "checking"
}

Configuration

Environment Variables

Variable Description Default
DROVER_AI_PROVIDER AI provider (ollama, openai, anthropic, openrouter) ollama
DROVER_AI_MODEL Model name gemma4:latest
DROVER_TAXONOMY Classification taxonomy household
DROVER_NAMING_STYLE Filename policy nara
DROVER_SAMPLE_STRATEGY Page sampling (full, first_n, bookends, adaptive) adaptive
DROVER_LOG_LEVEL Logging verbosity (quiet, verbose, debug) quiet

Config File

Drover searches for configuration in order: --config PATHdrover.yaml~/.config/drover/config.yaml

# drover.yaml
ai:
  provider: openai
  model: gpt-4o
  temperature: 0.0

taxonomy: household
taxonomy_mode: fallback
naming_style: nara
concurrency: 4

AI Providers

Provider API Key Variable Example Model
Ollama — (local) gemma4:latest
OpenAI OPENAI_API_KEY gpt-4o
Anthropic ANTHROPIC_API_KEY claude-sonnet-4-20250514
OpenRouter OPENROUTER_API_KEY anthropic/claude-sonnet-4

Supported File Formats

The loader is Docling, so the supported set matches Docling's officially-supported formats per docs/usage/supported_formats. See ADR-006 for the audit.

Category Extensions
PDF .pdf
Office (Open XML) .docx, .xlsx, .pptx
Markup .txt, .md, .html, .htm
Data .csv
Images .png, .jpg, .jpeg, .tiff, .tif, .bmp

Architecture

Drover follows a pipeline architecture with extensible plugin systems:

[Document] → [Loader] → [Classifier] → [PathBuilder] → [Output]
                ↓             ↓              ↓
           [Sampling]   [Taxonomy]    [NamingPolicy]

Tech Stack:

  • CLI: Click
  • LLM: LangChain with structured output
  • Config: Pydantic
  • Logging: structlog

Development

# Install with dev dependencies
uv sync --all-extras

# Run tests
uv run pytest

# Run end-to-end smoke suite (LLM tests need Ollama; emits JSON report under smoke/reports/)
uv run python smoke/run.py             # full suite (~2 min)
uv run python smoke/run.py --skip-llm  # CLI + error-path tests only (~10s)
# See smoke/README.md for the test catalog and report schema.

# Lint and format
uv run ruff check src/ --fix && uv run ruff format src/

# Security scan
uv run bandit -r src/ -c pyproject.toml

See CONTRIBUTING.md for the full development workflow.

Documentation

License

GNU Affero General Public License v3.0 or later (AGPL-3.0-or-later). See LICENSE for the full text.

Copyright (C) 2025 Backchain LLC

About

Document classification CLI that herds files into organized folders. Local-first via Ollama with OCR-aware loader; cloud LLM optional (OpenAI, Anthropic, OpenRouter). NARA-compliant naming, extensible taxonomies, optional macOS tagging.

Topics

Resources

License

Contributing

Stars

Watchers

Forks

Contributors