Jerry — Discord Voice Assistant Bot

Jerry is a Discord voice bot that listens in voice channels, transcribes speech, generates replies via a local LLM, and speaks back using Google's Gemini TTS. Built for AMD GPUs via DirectML.

Stack

Component	Technology
STT	OpenAI Whisper (ONNX, runs on AMD GPU via DirectML)
LLM	Ollama — local model (default: `mistral`)
TTS	Google Gemini `gemini-2.5-flash-preview-tts` (REST API, voice: Charon)
Audio playback	FFmpeg via `discord.py`

Prerequisites

Python 3.10+
FFmpeg installed and on PATH (or set FFMPEG_PATH in .env)
Ollama running locally with your chosen model pulled
AMD GPU (RX 6000/7000 series recommended) with up-to-date drivers for DirectML
A Google Gemini API key (free tier works, no rate limit on TTS model)

Setup

1. Clone the repo

git clone https://github.com/YOUR_USERNAME/jerry-discord-bot.git
cd jerry-discord-bot

2. Create and activate a virtual environment

python -m venv venv

# Windows
venv\Scripts\activate

# Linux/macOS
source venv/bin/activate

3. Install dependencies

pip install -r requirements.txt

4. Download the Whisper ONNX model

The bot expects a whisper-small-onnx/ directory (or whichever size you configure). Export it using Optimum:

optimum-cli export onnx --model openai/whisper-small --task automatic-speech-recognition ./whisper-small-onnx

Change small to tiny, base, medium, etc. if you want a different model size. Update WHISPER_MODEL_SIZE in .env to match.

5. Pull your Ollama model

ollama pull mistral

6. Create `.env`

Copy the example and fill in your values:

cp .env.example .env

DISCORD_TOKEN=your_discord_bot_token_here
GEMINI_API_KEY=your_gemini_api_key_here
OLLAMA_MODEL=mistral
WHISPER_MODEL_SIZE=small
FFMPEG_PATH=ffmpeg

7. Run the bot

python bot.py

Bot Commands

Command	Description
`!join`	Jerry joins your current voice channel and starts listening
`!leave`	Jerry leaves the voice channel
`!clear`	Clears Jerry's conversation history for this server

Usage

Use !join in a text channel while you're in a voice channel
Say "Jerry" (or a phonetic variant — see below) followed by your request
Jerry transcribes your speech, generates a reply via Ollama, and speaks it back using Gemini TTS

Wake word variants supported (handles different accents): jerry, gerry, gary, jeri, gerri, sherry, terry, jury, and many more phonetic variants.

What Jerry Can Do

Jerry has full administrator capabilities in your Discord server. Speak naturally — for example:

"Jerry, how are you doing?" — casual chat
"Jerry, send a message in general saying game night is at 9"
"Jerry, move Ari to the AFK channel"
"Jerry, mute John"
"Jerry, kick that guy"
"Jerry, create a voice channel called chill zone"
"Jerry, create an event called Tetris Night on Friday at 10pm"
"Jerry, give Dave the Moderator role"

Available Actions

Action	What it does
`send_message`	Send a text message to a channel
`send_dm`	Send a DM to a user
`move_user`	Move a user to a voice channel
`disconnect_user`	Disconnect a user from voice
`kick_user` / `ban_user`	Kick or ban a user
`mute_user` / `unmute_user`	Server mute/unmute
`deafen_user` / `undeafen_user`	Server deafen/undeafen
`rename_channel`	Rename a text or voice channel
`create_text_channel` / `create_voice_channel`	Create channels
`delete_channel`	Delete a channel
`create_role` / `delete_role`	Manage roles
`assign_role` / `remove_role`	Assign/remove roles from users
`rename_server`	Rename the server
`rename_user`	Change a user's nickname
`create_event` / `delete_event`	Manage scheduled events

Configuration Reference (`.env`)

Variable	Default	Description
`DISCORD_TOKEN`	(required)	Your Discord bot token
`GEMINI_API_KEY`	(required)	Google Gemini API key
`OLLAMA_MODEL`	`mistral`	Any model pulled in Ollama (e.g. `llama3`, `phi3`)
`WHISPER_MODEL_SIZE`	`small`	Whisper model size: `tiny`, `base`, `small`, `medium`
`FFMPEG_PATH`	`ffmpeg`	Full path to ffmpeg binary if not on PATH

Discord Bot Setup

Go to Discord Developer Portal
New Application → give it a name → go to Bot tab
Copy the token → paste into .env as DISCORD_TOKEN
Under Bot: enable Server Members Intent and Message Content Intent
Under OAuth2 → URL Generator: select scopes bot + applications.commands
Bot permissions needed: Send Messages, Connect, Speak, Mute Members, Deafen Members, Move Members, Kick Members, Ban Members, Manage Channels, Manage Roles, Manage Nicknames, Manage Events
Use the generated URL to invite the bot to your server

Architecture

User speaks in VC
      │
      ▼
AssistantSink (discord-ext-voice-recv)
  — buffers PCM audio per user
  — detects silence (0.8s timeout)
      │
      ▼
Whisper ONNX (DirectML / AMD GPU)
  — transcribes audio to text
      │
      ▼
Wake word detection ("Jerry" + variants)
      │
      ▼
Ollama (local LLM, streaming)           ←── streams tokens as they arrive
  — generates JSON: {reply, action, params}
      │
      ├── sentences → Gemini TTS (gemini-2.5-flash-preview-tts)
      │                  — synthesizes each sentence as it arrives
      │                  — plays via FFmpeg in Discord VC
      │
      └── actions → execute_action()
                      — Discord API calls (mute, move, create channel, etc.)

Concurrency model:

_whisper_executor — single-threaded, DirectML GPU (not thread-safe)
_ollama_executor — single-threaded, Ollama streaming
_tts_executor — 2 threads, Gemini TTS REST calls (allows prefetch of next sentence)
Ollama streaming and TTS playback run concurrently — Jerry starts speaking the first sentence while still generating the rest

Troubleshooting

Bot joins but never responds:

Check that Ollama is running: ollama serve
Check your GEMINI_API_KEY is set correctly

TTS audio sounds wrong / garbled:

The Gemini TTS model returns 24kHz mono 16-bit PCM. If audio sounds off, check that FFmpeg is up to date.

Wake word not detected:

Try speaking "Jerry" clearly — the bot supports many phonetic variants
Check the console for [username] Said: ... to see what Whisper transcribed

Whisper model not found:

Make sure whisper-small-onnx/ (or your configured size) exists in the project directory
Re-run the optimum-cli export command from step 4

AMD GPU not used for Whisper:

Ensure onnxruntime-directml is installed (not just onnxruntime)
Update your AMD drivers

`.env.example`

DISCORD_TOKEN=
GEMINI_API_KEY=
OLLAMA_MODEL=mistral
WHISPER_MODEL_SIZE=small
FFMPEG_PATH=ffmpeg

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
README.md		README.md
bot.py		bot.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Jerry — Discord Voice Assistant Bot

Stack

Prerequisites

Setup

1. Clone the repo

2. Create and activate a virtual environment

3. Install dependencies

4. Download the Whisper ONNX model

5. Pull your Ollama model

6. Create `.env`

7. Run the bot

Bot Commands

Usage

What Jerry Can Do

Available Actions

Configuration Reference (`.env`)

Discord Bot Setup

Architecture

Troubleshooting

`.env.example`

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Jerry — Discord Voice Assistant Bot

Stack

Prerequisites

Setup

1. Clone the repo

2. Create and activate a virtual environment

3. Install dependencies

4. Download the Whisper ONNX model

5. Pull your Ollama model

6. Create .env

7. Run the bot

Bot Commands

Usage

What Jerry Can Do

Available Actions

Configuration Reference (.env)

Discord Bot Setup

Architecture

Troubleshooting

.env.example

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

6. Create `.env`

Configuration Reference (`.env`)

`.env.example`

Packages