GitHub - owenawsong/Inflect: Inflect is a lightweight, high‑quality text‑to‑speech model designed to deliver surprisingly natural audio with a minimal footprint. It’s an active work‑in‑progress focused on fast iteration, strong voice quality, and simple integration for developers experimenting with modern TTS systems.

Inflect-Nano is a local-first English TTS research project built around expressive voice cloning, stable long-form speech, and honest evaluation.

Preview · Architecture · Evaluation · Roadmap · Publishing

Inflect-Nano

Small TTS models usually make you choose one:

clean audio, but flat delivery
expressive speech, but unstable pacing
impressive short demos, but long-prompt collapse

Inflect-Nano is the attempt to not accept that tradeoff. The goal is a compact local English TTS model that can clone a voice from a short reference, speak naturally across longer text, and eventually expose direct style controls without needing a giant hosted model.

This repository is the public research scaffold: scripts, benchmark tools, architecture notes, release planning, and preview assets. Model weights and a real sample gallery will be linked when the system earns them.

Preview

Launch Film

The final launch film is planned as a short product-style demo: reference voice, generated speech, long prompt stability, and raw-vs-enhanced audio.

Sample Deck

These are placeholder slots for the public sample gallery. The audio files are intentionally tiny placeholder tones until real model samples are ready.

Moment	What It Will Prove	Placeholder
Voice clone	Short reference in, new speech out, same speaker identity.	voice-clone-placeholder.wav
Long prompt	A paragraph that keeps pacing, coverage, and voice consistent.	long-prompt-placeholder.wav
Inflect-Enhance	Raw generated output versus optional restoration pass.	enhanced-placeholder.wav
Style control	Future text tags such as `<soft>`, `<breath>`, `<laugh>`, `<emphasis>`.	Planned

Preview Claims

The public release should make these claims only when the benchmark suite supports them:

Voice cloning that stays recognizable.
Pacing that does not drift into slow motion.
Long prompts that do not skip words.
Emotion that helps the sentence instead of hijacking it.
Audio polish that is disclosed when an enhancer is used.

Architecture

Inflect is designed as a stack, not a single magic checkpoint.

flowchart LR
  V["Reference voice<br/>audio + transcript"] --> C["Voice conditioning"]
  T["Text prompt<br/>punctuation + future tags"] --> P["Prompt planner"]
  P --> G["Inflect-Nano<br/>TTS core"]
  C --> G
  G --> S["Runtime stabilizer<br/>duration, silence, retry"]
  S --> E["Optional Inflect-Enhance<br/>detail restoration"]
  E --> O["Final speech"]

The core model

Inflect-Nano is the compact TTS core. Its job is to generate the right words in the right voice with plausible rhythm and emotion. If the core skips text or loses the speaker, the enhancer is not allowed to hide that failure.

The stabilizer

The stabilizer is the reliability layer around inference. It is meant to catch problems that small TTS models often have:

slow-motion generations
leading clicks
long silences
loudness jumps
long-prompt collapse
text coverage failures

Inflect-Enhance

Inflect-Enhance is an optional post-generation restoration module. It can improve texture, detail, and artifact cleanup, but it counts as part of the enhanced pipeline. Public demos should label whether enhancement is used.

Research Direction

The project is currently comparing small and medium TTS families, with Lux / ZipVoice-style models still the most interesting tiny direction by expressiveness.

Path	Strength	Risk
Lux / ZipVoice lineage	Expressive, fast, compact	Inconsistent pacing and occasional glitches.
Kyutai Pocket TTS	Stable and small	Too flat without major expressiveness work.
Kokoro-style systems	Clean audio	Less suitable for emotional zero-shot cloning.
Chatterbox-style teachers	Useful style/tag data	Better as teacher data than final tiny model.
0.3B-0.8B models	Higher ceiling	Less aligned with the smallest local target.

The current rule is simple: do not trust training loss until listening and objective checks agree.

Evaluation

Inflect is evaluated with both human listening and objective checks.

flowchart TD
  A["Generate fixed prompt set"] --> B["Blind A/B listening"]
  A --> C["Objective audio checks"]
  B --> D["Accept, reject, or retry experiment"]
  C --> D
  D --> E["Document result"]

We track:

speaker similarity
pacing consistency
long-prompt stability
skipped or invented words
leading click / silence
internal silence
RMS jumps
real-time factor
enhancer impact

See docs/EVALUATION.md.

Repository Map

Path	Purpose
`inflect/`	Inflect-native modules and extension experiments.
`scripts/`	Benchmarking, dataset generation, fine-tuning, and release tooling.
`inflect_asr/`	Side project for small ASR and teacher-label pipelines.
`voice-encoder/`	Voice conditioning and paralinguistic research.
`docs/`	Architecture, roadmap, evaluation, media kit, and release checklist.
`examples/`	Lightweight public examples and placeholder sample assets.
`assets/`	README and media-kit visuals.

Local generated outputs, checkpoints, reference voices, and third-party checkouts are intentionally excluded from GitHub.

Quickstart

This repository is still a research preview, not a polished pip package.

git clone https://github.com/owenawsong/Inflect.git
cd Inflect

Run a local blind A/B server after generating a benchmark directory:

.\.venv-voxcpm\Scripts\python.exe blind_ab_server.py `
  --bench-root outputs\zipvoice_bench\YOUR_BENCH `
  --state-dir .blind_ab_state_YOUR_BENCH `
  --port 18132

Read first:

Honest Status

Inflect-Nano is not released yet. Some experiments improve one property while damaging another. That is why this repo is structured around proof: repeatable prompts, blind A/B, objective checks, and documentation of failures.

The target is not a cherry-picked demo. The target is a model that survives real prompts.

License

Repository code and documentation are licensed under Apache 2.0 unless otherwise noted.

Generated datasets, reference voices, model checkpoints, and third-party components may have separate terms. See LICENSE, PUBLISHING.md, and SECURITY.md.

Name		Name	Last commit message	Last commit date
Latest commit History 23 Commits
CLAUDE_READ		CLAUDE_READ
assets		assets
docs		docs
examples		examples
inflect		inflect
patches		patches
scripts		scripts
voice-encoder		voice-encoder
.gitignore		.gitignore
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
PROJECT_STRUCTURE.md		PROJECT_STRUCTURE.md
PUBLISHING.md		PUBLISHING.md
README.md		README.md
SECURITY.md		SECURITY.md
compare_server.py		compare_server.py
zipvoice_server.py		zipvoice_server.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Inflect-Nano

Preview

Launch Film

Sample Deck

Preview Claims

Architecture

The core model

The stabilizer

Inflect-Enhance

Research Direction

Evaluation

Repository Map

Quickstart

Honest Status

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Inflect-Nano

Preview

Launch Film

Sample Deck

Preview Claims

Architecture

The core model

The stabilizer

Inflect-Enhance

Research Direction

Evaluation

Repository Map

Quickstart

Honest Status

License

About

Topics

Resources

License

Code of conduct

Contributing

Security policy

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages