Jerry is a Discord voice bot that listens in voice channels, transcribes speech, generates replies via a local LLM, and speaks back using Google's Gemini TTS. Built for AMD GPUs via DirectML.
| Component | Technology |
|---|---|
| STT | OpenAI Whisper (ONNX, runs on AMD GPU via DirectML) |
| LLM | Ollama — local model (default: mistral) |
| TTS | Google Gemini gemini-2.5-flash-preview-tts (REST API, voice: Charon) |
| Audio playback | FFmpeg via discord.py |
- Python 3.10+
- FFmpeg installed and on PATH (or set
FFMPEG_PATHin.env) - Ollama running locally with your chosen model pulled
- AMD GPU (RX 6000/7000 series recommended) with up-to-date drivers for DirectML
- A Google Gemini API key (free tier works, no rate limit on TTS model)
git clone https://github.com/YOUR_USERNAME/jerry-discord-bot.git
cd jerry-discord-botpython -m venv venv
# Windows
venv\Scripts\activate
# Linux/macOS
source venv/bin/activatepip install -r requirements.txtThe bot expects a whisper-small-onnx/ directory (or whichever size you configure). Export it using Optimum:
optimum-cli export onnx --model openai/whisper-small --task automatic-speech-recognition ./whisper-small-onnxChange small to tiny, base, medium, etc. if you want a different model size. Update WHISPER_MODEL_SIZE in .env to match.
ollama pull mistralCopy the example and fill in your values:
cp .env.example .envDISCORD_TOKEN=your_discord_bot_token_here
GEMINI_API_KEY=your_gemini_api_key_here
OLLAMA_MODEL=mistral
WHISPER_MODEL_SIZE=small
FFMPEG_PATH=ffmpegpython bot.py| Command | Description |
|---|---|
!join |
Jerry joins your current voice channel and starts listening |
!leave |
Jerry leaves the voice channel |
!clear |
Clears Jerry's conversation history for this server |
- Use
!joinin a text channel while you're in a voice channel - Say "Jerry" (or a phonetic variant — see below) followed by your request
- Jerry transcribes your speech, generates a reply via Ollama, and speaks it back using Gemini TTS
Wake word variants supported (handles different accents):
jerry, gerry, gary, jeri, gerri, sherry, terry, jury, and many more phonetic variants.
Jerry has full administrator capabilities in your Discord server. Speak naturally — for example:
- "Jerry, how are you doing?" — casual chat
- "Jerry, send a message in general saying game night is at 9"
- "Jerry, move Ari to the AFK channel"
- "Jerry, mute John"
- "Jerry, kick that guy"
- "Jerry, create a voice channel called chill zone"
- "Jerry, create an event called Tetris Night on Friday at 10pm"
- "Jerry, give Dave the Moderator role"
| Action | What it does |
|---|---|
send_message |
Send a text message to a channel |
send_dm |
Send a DM to a user |
move_user |
Move a user to a voice channel |
disconnect_user |
Disconnect a user from voice |
kick_user / ban_user |
Kick or ban a user |
mute_user / unmute_user |
Server mute/unmute |
deafen_user / undeafen_user |
Server deafen/undeafen |
rename_channel |
Rename a text or voice channel |
create_text_channel / create_voice_channel |
Create channels |
delete_channel |
Delete a channel |
create_role / delete_role |
Manage roles |
assign_role / remove_role |
Assign/remove roles from users |
rename_server |
Rename the server |
rename_user |
Change a user's nickname |
create_event / delete_event |
Manage scheduled events |
| Variable | Default | Description |
|---|---|---|
DISCORD_TOKEN |
(required) | Your Discord bot token |
GEMINI_API_KEY |
(required) | Google Gemini API key |
OLLAMA_MODEL |
mistral |
Any model pulled in Ollama (e.g. llama3, phi3) |
WHISPER_MODEL_SIZE |
small |
Whisper model size: tiny, base, small, medium |
FFMPEG_PATH |
ffmpeg |
Full path to ffmpeg binary if not on PATH |
- Go to Discord Developer Portal
- New Application → give it a name → go to Bot tab
- Copy the token → paste into
.envasDISCORD_TOKEN - Under Bot: enable Server Members Intent and Message Content Intent
- Under OAuth2 → URL Generator: select scopes
bot+applications.commands - Bot permissions needed:
Send Messages,Connect,Speak,Mute Members,Deafen Members,Move Members,Kick Members,Ban Members,Manage Channels,Manage Roles,Manage Nicknames,Manage Events - Use the generated URL to invite the bot to your server
User speaks in VC
│
▼
AssistantSink (discord-ext-voice-recv)
— buffers PCM audio per user
— detects silence (0.8s timeout)
│
▼
Whisper ONNX (DirectML / AMD GPU)
— transcribes audio to text
│
▼
Wake word detection ("Jerry" + variants)
│
▼
Ollama (local LLM, streaming) ←── streams tokens as they arrive
— generates JSON: {reply, action, params}
│
├── sentences → Gemini TTS (gemini-2.5-flash-preview-tts)
│ — synthesizes each sentence as it arrives
│ — plays via FFmpeg in Discord VC
│
└── actions → execute_action()
— Discord API calls (mute, move, create channel, etc.)
Concurrency model:
_whisper_executor— single-threaded, DirectML GPU (not thread-safe)_ollama_executor— single-threaded, Ollama streaming_tts_executor— 2 threads, Gemini TTS REST calls (allows prefetch of next sentence)- Ollama streaming and TTS playback run concurrently — Jerry starts speaking the first sentence while still generating the rest
Bot joins but never responds:
- Check that Ollama is running:
ollama serve - Check your
GEMINI_API_KEYis set correctly
TTS audio sounds wrong / garbled:
- The Gemini TTS model returns 24kHz mono 16-bit PCM. If audio sounds off, check that FFmpeg is up to date.
Wake word not detected:
- Try speaking "Jerry" clearly — the bot supports many phonetic variants
- Check the console for
[username] Said: ...to see what Whisper transcribed
Whisper model not found:
- Make sure
whisper-small-onnx/(or your configured size) exists in the project directory - Re-run the
optimum-cli exportcommand from step 4
AMD GPU not used for Whisper:
- Ensure
onnxruntime-directmlis installed (not justonnxruntime) - Update your AMD drivers
DISCORD_TOKEN=
GEMINI_API_KEY=
OLLAMA_MODEL=mistral
WHISPER_MODEL_SIZE=small
FFMPEG_PATH=ffmpeg