Skip to content

pipedude/MadMax

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

9 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

😎 MadMax Live Agent

A voice agent for devices, powered by Google Gemini Live API with a local wake-word detector and long-term memory.

Capabilities:

  • πŸ—£οΈ Realtime Speech-to-Speech dialogue via Gemini Live API
  • πŸ” Fresh information from the internet via Google Search (grounding)
  • πŸ’° Budget-friendly: offline mode with local wake-word (Vosk) and auto-shutdown timer
  • 🧠 Managed long-term memory based on JSON files: people, places, facts, goals, experience, episodes, reflections and persona
  • πŸ› οΈ Tool calling in live mode

Technical highlights:

  • πŸ”’ Transactional memory isolation: backup + rollback on errors, single-use guard
  • πŸ§ͺ LLM Surgeon: automatic memory conflict resolution via LLM
  • πŸ“ Auto-save all sessions to daily markdown
  • πŸ”„ Automatic recovery of missed sessions
  • ⏱️ Graceful shutdown with configurable timeouts
  • πŸ“Š Latency diagnostics for all LLM calls in logs

In development:

  • πŸ“· Photo and video stream processing
  • 🧹 Smart long-term memory cleanup
  • πŸ€– Integration with ROS2 modules for robot control
  • πŸ”§ Other integrations and improvements

πŸš€ Key Features

πŸ’¬ Realtime Voice Loop

  • Local wake-word detection via Vosk (no LLM costs)
  • Instant transition to live mode on the wake word
  • Speech-to-Speech dialogue with minimal latency

🧠 Post-Session Memory Pipeline

  • Automatic extraction: facts, goals, experience, episodes, reflections and persona after every session
  • LLM Surgeon: memory conflict resolution (UPDATE / MERGE / APPEND / IGNORE) via a separate LLM call
  • Rebuild of active_context.json for upcoming dialogues
  • Automatic recovery of unprocessed sessions

🎭 Agent Persona (Max)

  • Name, gender and communication style are set in agent_instructions.md
  • SOUL.md β€” philosophical persona manifesto: attitude toward the world, inclinations, shadow, meta-reflection, written by the agent itself after hours of testing conversations
  • Automatic extraction of reflections and persona traits from dialogues into reflections.json

πŸ—οΈ Architecture

1️⃣ Sleep Mode

Agent is offline; microphone is monitored locally via Vosk. No LLM costs.

2️⃣ Active Session

After the wake word audio switches to Gemini Live API. Dialogue runs in realtime.

3️⃣ Post-Session Processing

After a session ends:

  1. Save transcript to memory_engine/daily/YYYY-MM-DD.md
  2. Call process_missing_sessions(day_date) β€” process all unprocessed sessions of the day
  3. Call build_memory_context() β€” rebuild active context

πŸ’Ύ Memory Structure

πŸ“… Daily Markdown Logs

Every session is saved to memory_engine/daily/YYYY-MM-DD.md with metadata:

  • session_id
  • started_at and ended_at (ISO 8601 with timezone offset)
  • Dialogue transcript

Source of truth for post-session processing.

🧩 Active Context

File memory_engine/active_context.json:

{
  "last_context": "...",
  "summary_yesterday": "...",
  "summary_today": "...",
  "reply_count_today": 0,
  "summary_reply_count": 0,
  "long_term_injections": []
}

Used as the working context for upcoming live sessions.

πŸ—„οΈ Long-Memory Storage

Directory memory_engine/memory/:

  • people.json β€” information about people
  • places.json β€” places and locations
  • facts.json β€” facts and knowledge
  • goals.json β€” goals and tasks
  • experience.json β€” experience and skills
  • reflections.json β€” reflections and insights
  • episodes.log.jsonl β€” episode chronology
  • processed_sessions.json β€” registry of processed sessions (prevents reprocessing the same session)

βš™οΈ Configuration

  • Google AI Studio API Key in .env
  • Key parameters in config.py
  • Memory settings in memory_config.py

πŸ“‚ Project Structure

MadMax/
β”œβ”€β”€ main.py                          # Entry point
β”œβ”€β”€ config.py                        # Configuration
β”œβ”€β”€ agent_instructions.md            # Agent system prompt (Max)
β”œβ”€β”€ SOUL.md                          # Agent persona and philosophy
β”œβ”€β”€ core/
β”‚   β”œβ”€β”€ orchestrator.py              # Agent lifecycle (sleep / live / post-session)
β”‚   β”œβ”€β”€ audio_io.py                  # Audio I/O and Vosk wake-word
β”‚   β”œβ”€β”€ gemini_client.py             # Gemini Live API client
β”‚   β”œβ”€β”€ agent_tools.py               # Memory tools for live mode
β”‚   β”œβ”€β”€ session_transcript_logger.py # Session transcript persistence
β”‚   β”œβ”€β”€ errors.py                    # Exceptions
β”‚   └── state.py                     # Session state
β”œβ”€β”€ memory_engine/
β”‚   β”œβ”€β”€ active_context_builder.py    # active_context.json builder
β”‚   β”œβ”€β”€ long_memory_extractor_agent.py  # Memory extraction from transcripts
β”‚   β”œβ”€β”€ long_memory_apply.py         # Memory operations + LLM Surgeon
β”‚   β”œβ”€β”€ long_memory_normalize.py     # Normalization and fuzzy matching
β”‚   β”œβ”€β”€ long_memory_ops.py           # Operation schemas and validation
β”‚   β”œβ”€β”€ long_memory_query_service.py # Long-term memory search
β”‚   β”œβ”€β”€ summarize_context_agent.py   # Day summarization
β”‚   β”œβ”€β”€ llm_client_utils.py          # LLM timeout and diagnostics
β”‚   β”œβ”€β”€ memory_config.py             # Memory paths and constants
β”‚   β”œβ”€β”€ entity_policies.py           # Entity link policies
β”‚   β”œβ”€β”€ time_policy.py               # Timestamp policy
β”‚   β”œβ”€β”€ daily/                       # Daily markdown logs
β”‚   └── memory/                      # Long-memory JSON files + backups
└── live_api_docs/                   # Gemini Live API documentation

πŸš€ Quick Start

# 1. Clone and enter directory
cd MadMax

# 2. Create virtual environment
python3 -m venv venv
source venv/bin/activate

# 3. Install dependencies
pip install -r requirements.txt

# 4. Download wake-word model (~40 MB)
./setup.sh

# 5. Configure environment
# Create .env file (or export variables):
# GOOGLE_API_KEY=your_key_here

# 6. Run the agent
python main.py

Requirements:

  • Linux
  • Python 3.11+
  • Microphone and speakers
  • Google AI Studio API Key

πŸ› οΈ Roadmap

βœ… Already implemented

  • Function Calling for memory β€” live agent calls memory_lookup_person, memory_lookup_goal, memory_lookup_experience, memory_recent_episodes during dialogue
  • Google Search (grounding) β€” agent receives fresh information from the internet in realtime
  • Transactional memory isolation β€” backup before write, rollback on errors, single-use guard for apply_payload
  • LLM Surgeon β€” automatic memory conflict resolution via a separate LLM call with batching
  • Fail-fast error handling β€” explicit logs on corrupted JSON, graceful CancelledError, latency diagnostics

🎯 Planned

🧹 Smart long-term memory cleanup

Goal: Automatic removal of stale or irrelevant data from memory.

Planned logic:

  • Fact prioritization β€” relevance score based on access frequency and freshness
  • Old episode archival β€” move rarely used episodes to cold storage
  • Automatic duplicate merging β€” find and merge similar facts/goals
  • Temporary goal expiration β€” auto-complete or archive goals with expired deadlines
  • Configurable retention rules β€” set data lifetime for different categories

Result: Memory stays relevant, does not grow uncontrollably, and is not cluttered with duplicates and outdated information.

πŸ”§ Refactoring & Type Safety

  • Pydantic for structured payloads instead of dict[str, Any]

πŸ—οΈ Technical Debt

Consciously accepted trade-offs that are known and documented:

Issue Impact Why we kept it
Any instead of Pydantic for operation payloads No type safety, IDE does not suggest fields It works, changing it requires rewriting 5+ modules
Tight coupling: GeminiLiveClient imports AudioIO Hard to test, risk of circular dependency No DI container, Protocols require refactoring
No CI/CD No automatic type checking and tests Project is developed locally, pytest is run manually

πŸ€– Agentic Engineering

Important: A significant part of this project was written using Agentic Engineering in pair-programming mode.


πŸ“Š Current Status

The project consists of three stable loops:

  • Live conversation loop β€” realtime Speech-to-Speech dialogue with the user (Google Search, tool calling, Vosk wake-word)
  • Post-session memory loop β€” automatic extraction, deduplication and knowledge persistence (people, places, facts, goals, experience, episodes, reflections, persona)
  • Reliability loop β€” transactional isolation (backup + rollback), graceful shutdown, LLM timeouts, recovery of missed sessions

The voice agent is ready for daily use as-is. The main constraints are architectural debt (see section above), not functional issues.

About

Real-time voice AI agent with persistent memory, built on Gemini Live API and Python

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors