Avalonia 12 desktop app that captures Windows audio output (any app's playback), streams it to OpenAI's realtime translation model (gpt-realtime-translate), renders the translated audio + dual-language transcript live in a desktop-lyric overlay, and optionally records the whole session to WAV + SRT for later replay. Windows-only today; the audio layer is being abstracted for a macOS port.
- .NET 9 + Avalonia 12 (Fluent theme, Inter font fallback)
- NAudio —
WasapiLoopbackCapture/ Process Loopback for input,WasapiOut/WaveOutEventfor playback,WdlResamplingSampleProviderfor high-quality 48 kHz → 24 kHz resampling,WaveFileWriterfor recording MessageBox.Avalonia— dialog replacement (Avalonia has no built-inMessageBox)System.Net.WebSockets.ClientWebSocket(built-in) for the realtime APISystem.Text.Json(built-in) for protocol serialization
[Any Windows app] ──► WasapiLoopbackCapture ──► downmix ──► WDL resample to 24 kHz ──► PCM16
│
▼
ClientWebSocket → OpenAI Realtime
│
┌───────────────────────────────────────────────────────────────┴───┐
▼ ▼
translated audio (PCM16) dual transcript deltas
│ │
▼ ▼
WasapiOut device Avalonia TextBox + LyricWindow overlay
- Windows 10 / 11
- .NET 9 SDK
- An OpenAI API key with access to
gpt-realtime-translate
dotnet restore
dotnet runOn first launch, click the API… button in the settings panel and paste
your sk-… key. The key is stored locally at
%APPDATA%\Babelive\settings.json (plain JSON, never transmitted anywhere
except to the configured API endpoint).
For a self-contained release build:
dotnet publish -c ReleaseProduces a single Babelive.exe at
bin\Release\net9.0-windows\win-x64\publish\ (~80–90 MB, bundles the
.NET 9 runtime + Avalonia/Skia/HarfBuzz native libs, single-file compressed).
Just ship that one file.
- Pick a target language.
- Pick a Capture source — recommended is All system audio (no echo) which uses Win10 build 20348+ Process Loopback to exclude Babelive's own playback. Per-app entries (Teams, Chrome, Spotify, …) and legacy device loopbacks are also available.
- Pick a Playback device for the translated audio. Read the feedback warning below.
- Optional checkboxes:
- Transcript only — silence the translation audio, keep just the on-screen subtitles.
- Alt endpoint — fall back to the non-translations endpoint if your account doesn't have access to the dedicated one.
- Echo suppress — pause API input while translation plays (prevents feedback at the cost of occasional model stalls).
- Mute source — physically mute every speaker except Playback so you only hear the translation. Loopback still captures the source for the API (mute is downstream of the engine tap).
- Optional sliders (both support mouse-wheel adjust on hover):
- Source volume — level source apps are ducked to during translation playback (5–100%, default 10%). At 100% the ducker is fully disabled.
- Translation volume — PCM-level gain applied to translated audio (0–200%). 100% is unity; the OpenAI TTS is quieter than typical system audio so you'll often want 130–180%. Sliding this does not change Babelive's session volume in Windows Volume Mixer.
- Click Start (or the red ▶ on the lyric overlay), then play any video / call / song.
The settings window is hide-on-close — closing it leaves the lyric overlay + tray icon running. Exit from the tray menu (right-click the 译 icon) fully quits.
A transparent always-on-top desktop-lyric panel docks bottom-center on first launch. Hover to fade in the toolbar:
| Button | Action |
|---|---|
| ▶ Start / ■ Stop | Toggle translation |
| ● Record / ■ Stop | Toggle recording (auto-starts translation if not running) |
| A− / A+ | Decrease / increase translation font size |
| 🔉 / 🔊 | Step translation volume by ±10% |
| ⚙ | Open settings window |
| ✕ | Hide overlay (re-open from tray menu) |
Drag the panel by any non-button area to move it. Double-click the top empty strip to snap between top / bottom of the screen. Drag the bottom-right grip to resize.
Press ● Record (lyric overlay or main window) to start saving. Press again to stop. If translation isn't running yet, recording auto-starts it.
Each Record → Stop cycle creates a fresh timestamped folder under
%APPDATA%\Babelive\Recordings\{yyyy-MM-dd_HHmmss}\ containing:
source.wav ← original captured audio (24 kHz mono PCM16)
source.srt ← source-language transcript with timecodes
translation.<lang>.wav ← model's translated audio (same format)
translation.<lang>.srt ← target-language transcript with timecodes
The path is shown next to the Record button on the main window — click it to open the folder in Explorer.
SRT files share their base name with the matching WAV so any standard player auto-loads the subtitles when opened. The transcript splits cues at sentence terminators (., ?, !, 。, ?, !) plus a delta-arrival-gap heuristic (~800 ms) that catches sentence boundaries the model didn't punctuate.
Windows' built-in audio players don't render external SRT for .wav files — they treat audio as audio, no subtitle track. Two ways around it:
- VLC — open
source.wav, thenAudio → Visualizations → Spectrum. The visualizer activates VLC's video output surface, which the subtitle renderer needs. Subtitles appear immediately. - mpv —
mpv source.wavshows subtitles without configuration. mpv is the most reliable choice for audio + external SRT on Windows.
Teams and Skype set AUDCLNT_STREAMFLAGS_PREVENT_LOOPBACK_CAPTURE on their call audio for privacy, so Windows' Process Loopback API returns silence for them. Babelive auto-detects this and, if VB-CABLE is installed, redirects the Teams/Skype process tree to CABLE Input via IAudioPolicyConfig per-app routing, then loopback-captures from the cable. No manual Teams/Skype audio config needed.
Without VB-CABLE installed, Teams/Skype audio cannot be captured — this is a Windows DRM-style restriction, not a Babelive bug.
Zoom / Discord / Google Meet / WebEx / Slack use WebRTC and don't set the flag — they work via plain Process Loopback.
If translated audio plays through the same speakers you're capturing, the loopback re-translates it forever. Three fixes:
- Use headphones for playback (different physical device than the captured speakers).
- Install VB-CABLE — free virtual audio cable. Send the source app's output to
CABLE Input; Babelive can then loopback-capture the cable while playing translation through your real speakers / headphones. - Tick "Transcript only" — only spoken text appears, nothing replays.
The recommended All system audio (no echo) capture mode also fixes this — it uses Process Loopback to exclude Babelive's own playback from the captured stream, so even with Translation playing on the same device the API never re-hears it.
The realtime translation API is new. The exact event/field names in Translation/RealtimeTranslatorClient.cs are best-effort based on https://developers.openai.com/api/docs/guides/realtime-translation plus the standard /v1/realtime event conventions. If your account sees errors:
- Endpoint: defaults to
wss://api.openai.com/v1/realtime/translations?model=gpt-realtime-translate. Tick "Alt endpoint" in the UI to fall back towss://api.openai.com/v1/realtime?model=gpt-realtime-translate. - Session config:
RealtimeTranslatorClient.SendSessionUpdateAsyncsendssession.updatewithinput_audio_format=pcm16,output_audio_format=pcm16, andtranslation.target_language=<code>. Adjust if the official schema differs. - Event names:
Dispatchmatches both theoutput_*.deltaandresponse.output_*.deltashapes. If transcripts/audio don't arrive, log every incoming event and adjust.
Open YouTube in any non-target language, hit Start, and the translation should start streaming into the lyric overlay (and the settings window's transcript panes) within a second or two of the source audio playing.