A voice agent for devices, powered by Google Gemini Live API with a local wake-word detector and long-term memory.
Capabilities:
- π£οΈ Realtime Speech-to-Speech dialogue via Gemini Live API
- π Fresh information from the internet via Google Search (grounding)
- π° Budget-friendly: offline mode with local wake-word (Vosk) and auto-shutdown timer
- π§ Managed long-term memory based on JSON files: people, places, facts, goals, experience, episodes, reflections and persona
- π οΈ Tool calling in live mode
Technical highlights:
- π Transactional memory isolation: backup + rollback on errors, single-use guard
- π§ͺ LLM Surgeon: automatic memory conflict resolution via LLM
- π Auto-save all sessions to daily markdown
- π Automatic recovery of missed sessions
- β±οΈ Graceful shutdown with configurable timeouts
- π Latency diagnostics for all LLM calls in logs
In development:
- π· Photo and video stream processing
- π§Ή Smart long-term memory cleanup
- π€ Integration with ROS2 modules for robot control
- π§ Other integrations and improvements
- Local wake-word detection via Vosk (no LLM costs)
- Instant transition to live mode on the wake word
- Speech-to-Speech dialogue with minimal latency
- Automatic extraction: facts, goals, experience, episodes, reflections and persona after every session
- LLM Surgeon: memory conflict resolution (UPDATE / MERGE / APPEND / IGNORE) via a separate LLM call
- Rebuild of
active_context.jsonfor upcoming dialogues - Automatic recovery of unprocessed sessions
- Name, gender and communication style are set in
agent_instructions.md SOUL.mdβ philosophical persona manifesto: attitude toward the world, inclinations, shadow, meta-reflection, written by the agent itself after hours of testing conversations- Automatic extraction of reflections and persona traits from dialogues into
reflections.json
Agent is offline; microphone is monitored locally via Vosk. No LLM costs.
After the wake word audio switches to Gemini Live API. Dialogue runs in realtime.
After a session ends:
- Save transcript to
memory_engine/daily/YYYY-MM-DD.md - Call
process_missing_sessions(day_date)β process all unprocessed sessions of the day - Call
build_memory_context()β rebuild active context
Every session is saved to memory_engine/daily/YYYY-MM-DD.md with metadata:
session_idstarted_atandended_at(ISO 8601 with timezone offset)- Dialogue transcript
Source of truth for post-session processing.
File memory_engine/active_context.json:
{
"last_context": "...",
"summary_yesterday": "...",
"summary_today": "...",
"reply_count_today": 0,
"summary_reply_count": 0,
"long_term_injections": []
}Used as the working context for upcoming live sessions.
Directory memory_engine/memory/:
people.jsonβ information about peopleplaces.jsonβ places and locationsfacts.jsonβ facts and knowledgegoals.jsonβ goals and tasksexperience.jsonβ experience and skillsreflections.jsonβ reflections and insightsepisodes.log.jsonlβ episode chronologyprocessed_sessions.jsonβ registry of processed sessions (prevents reprocessing the same session)
- Google AI Studio API Key in
.env - Key parameters in
config.py - Memory settings in
memory_config.py
MadMax/
βββ main.py # Entry point
βββ config.py # Configuration
βββ agent_instructions.md # Agent system prompt (Max)
βββ SOUL.md # Agent persona and philosophy
βββ core/
β βββ orchestrator.py # Agent lifecycle (sleep / live / post-session)
β βββ audio_io.py # Audio I/O and Vosk wake-word
β βββ gemini_client.py # Gemini Live API client
β βββ agent_tools.py # Memory tools for live mode
β βββ session_transcript_logger.py # Session transcript persistence
β βββ errors.py # Exceptions
β βββ state.py # Session state
βββ memory_engine/
β βββ active_context_builder.py # active_context.json builder
β βββ long_memory_extractor_agent.py # Memory extraction from transcripts
β βββ long_memory_apply.py # Memory operations + LLM Surgeon
β βββ long_memory_normalize.py # Normalization and fuzzy matching
β βββ long_memory_ops.py # Operation schemas and validation
β βββ long_memory_query_service.py # Long-term memory search
β βββ summarize_context_agent.py # Day summarization
β βββ llm_client_utils.py # LLM timeout and diagnostics
β βββ memory_config.py # Memory paths and constants
β βββ entity_policies.py # Entity link policies
β βββ time_policy.py # Timestamp policy
β βββ daily/ # Daily markdown logs
β βββ memory/ # Long-memory JSON files + backups
βββ live_api_docs/ # Gemini Live API documentation
# 1. Clone and enter directory
cd MadMax
# 2. Create virtual environment
python3 -m venv venv
source venv/bin/activate
# 3. Install dependencies
pip install -r requirements.txt
# 4. Download wake-word model (~40 MB)
./setup.sh
# 5. Configure environment
# Create .env file (or export variables):
# GOOGLE_API_KEY=your_key_here
# 6. Run the agent
python main.pyRequirements:
- Linux
- Python 3.11+
- Microphone and speakers
- Google AI Studio API Key
- Function Calling for memory β live agent calls
memory_lookup_person,memory_lookup_goal,memory_lookup_experience,memory_recent_episodesduring dialogue - Google Search (grounding) β agent receives fresh information from the internet in realtime
- Transactional memory isolation β backup before write, rollback on errors, single-use guard for
apply_payload - LLM Surgeon β automatic memory conflict resolution via a separate LLM call with batching
- Fail-fast error handling β explicit logs on corrupted JSON, graceful
CancelledError, latency diagnostics
Goal: Automatic removal of stale or irrelevant data from memory.
Planned logic:
- Fact prioritization β relevance score based on access frequency and freshness
- Old episode archival β move rarely used episodes to cold storage
- Automatic duplicate merging β find and merge similar facts/goals
- Temporary goal expiration β auto-complete or archive goals with expired deadlines
- Configurable retention rules β set data lifetime for different categories
Result: Memory stays relevant, does not grow uncontrollably, and is not cluttered with duplicates and outdated information.
- Pydantic for structured payloads instead of
dict[str, Any]
Consciously accepted trade-offs that are known and documented:
| Issue | Impact | Why we kept it |
|---|---|---|
Any instead of Pydantic for operation payloads |
No type safety, IDE does not suggest fields | It works, changing it requires rewriting 5+ modules |
Tight coupling: GeminiLiveClient imports AudioIO |
Hard to test, risk of circular dependency | No DI container, Protocols require refactoring |
| No CI/CD | No automatic type checking and tests | Project is developed locally, pytest is run manually |
Important: A significant part of this project was written using Agentic Engineering in pair-programming mode.
The project consists of three stable loops:
- Live conversation loop β realtime Speech-to-Speech dialogue with the user (Google Search, tool calling, Vosk wake-word)
- Post-session memory loop β automatic extraction, deduplication and knowledge persistence (people, places, facts, goals, experience, episodes, reflections, persona)
- Reliability loop β transactional isolation (backup + rollback), graceful shutdown, LLM timeouts, recovery of missed sessions
The voice agent is ready for daily use as-is. The main constraints are architectural debt (see section above), not functional issues.