Fine-tune LLMs to sound like you.
Customain learns your writing style from your own real text content, and conversations, and fine-tunes (large) language models to mimic your tone, voice, and communication patterns. The result is your custom AI that does not sound generic, but just like you.
Your emails → Extract & clean → Fine-tune → A model that writes like you
- Connect a content source (Gmail today, more coming)
- Process your text into high-quality, anonymized training pairs
- Fine-tune OpenAI models on your writing style
- Evaluate how well the model captures your tone — with both classical metrics and a trained authorship classifier
| Source | Status |
|---|---|
| Gmail | ✅ Available |
| Outlook | 🔜 Planned |
| Slack | 🔜 Planned |
| Notion | 🔜 Planned |
| Google Docs | 🔜 Planned |
| Provider | Models | Methods | Status |
|---|---|---|---|
| OpenAI | GPT-4.1, 4.1-mini, 4.1-nano | SFT, DPO | ✅ Available |
| Together AI | Llama, Mixtral, Qwen + any HF model | -- | 🔜 Planned |
- Python 3.11+
- uv package manager
- OpenAI API key
- Gmail OAuth credentials (for Gmail source)
git clone https://github.com/user/customain.git
cd customain
uv syncCreate .secrets/api_keps.json:
{
"openai_api_key": "sk-...",
"wandb_api_key": "optional-for-tracking"
}For Gmail, you'll also need OAuth credentials — see Google's guide.
Run the full Gmail preprocessing pipeline:
uv run python -m gmail_preprocessing_pipeline.run_pipelineEach run now creates a versioned dataset under data/gmail/<timestamp>/, where the
timestamp is taken from the exported mbox filename. Inside each version you get
separate folders per dataset type, for example:
data/gmail/20260505_101530/
_intermediate/
sft/
train.jsonl
test.jsonl
train_mock.jsonl
test_mock.jsonl
dpo/
train.jsonl
test.jsonl
train_mock.jsonl
test_mock.jsonl
authorship/
train.jsonl
val.jsonl
manifest.json
Build only the dataset types you need:
uv run python -m gmail_preprocessing_pipeline.run_pipeline --targets sft
uv run python -m gmail_preprocessing_pipeline.run_pipeline --targets sft authorshipLimit how much mail is exported:
# Only recent mail
uv run python -m gmail_preprocessing_pipeline.run_pipeline --newer-than-days 30
# Cap the export size
uv run python -m gmail_preprocessing_pipeline.run_pipeline --max-threads 250
# Add an extra Gmail query filter
uv run python -m gmail_preprocessing_pipeline.run_pipeline \
--gmail-query "label:important -category:promotions"Or skip steps you've already completed:
# Already exported Gmail — start from extract
uv run python -m gmail_preprocessing_pipeline.run_pipeline --start-from 2
# Re-run dataset building from existing processed pairs
uv run python -m gmail_preprocessing_pipeline.run_pipeline --start-from 4 --targets sft dpoThe pipeline runs 4 steps:
- Export Gmail threads to mbox
- Extract email-reply pairs
- Transform pairs in one LLM pass: clean + filter + anonymize
- Build the selected dataset folders (
sft,dpo,authorship)
Configure which models and hyperparameters to try in ft/training_configs.py, then run the full pipeline:
uv run python -m ft.run_pipeline \
--data-dir dataBy default this uses the latest dataset version under data/gmail/.
Or run a quick test with a small subset first:
uv run python -m ft.run_pipeline \
--data-dir data \
--test-runYou can also skip steps you've already completed:
# Skip data upload and job launch, just evaluate
uv run python -m ft.run_pipeline \
--data-dir data \
--skip 1 2The pipeline will:
- Upload data and launch fine-tuning jobs across your configured model/hyperparameter combinations
- Poll until all jobs complete
- Run each fine-tuned model on the test set
- Evaluate results and log metrics to Weights & Biases
Customain includes a pluggable evaluation framework. Evaluators are auto-discovered. You can just drop a new one into ft/evaluation/evaluators/. It can be ml-based, statistical, or any other form you prefer. Take a look at the existing ml-based and metric/statistical evaluators already implemented:
| Evaluator | What it measures |
|---|---|
authorship_classifier |
CNN-based authorship probability score |
tone_judge |
LLM-as-judge scoring tone & style fidelity |
bleu |
N-gram overlap (BLEU score) |
meteor |
Token-level alignment (METEOR score) |
semantic_similarity |
Embedding cosine similarity |
Configure which evaluators to skip in ft/training_configs.py:
skip_evaluators = ["bleu", "meteor"] # Only run tone_judge and semantic_similarityA character-level CNN text classifier trained to distinguish the author's writing from other people's writings. Unlike LLM-as-judge evaluators, this learns style patterns directly from data, hence it does not suffer from the LLM-as-a-judge performance issues. Its current best performance is 91% precision.
# Prepare training data from the latest versioned SFT dataset
uv run python -m classifiers.authorship.prepare_data
# Train (logs to W&B under customain-classifiers)
uv run python -m classifiers.authorship.train \
--train-data data/gmail/<timestamp>/authorship/train.jsonl \
--val-data data/gmail/<timestamp>/authorship/val.jsonl
# The authorship_classifier evaluator auto-registers and uses the trained checkpointThis project is licensed under the GNU Affero General Public License v3.0 (AGPLv3).