Real-time multi-agent LLM orchestration system with self-improving eval loop
Weave is a multi-agent orchestration system that decomposes complex queries across specialised AI agents, coordinates them via a LangGraph state machine with dynamic routing and token-budget enforcement, and streams every intermediate step to the client over SSE. It continuously improves its own agent prompts through a 6-dimensional evaluation harness that scores 15 test cases, identifies weak dimensions, and proposes targeted prompt rewrites โ all under human-in-the-loop review.
+---------------------------+
| FastAPI (port 8000) |
+--------+ | POST /query (SSE stream) | +---------+
| Client | ----------> | GET /jobs/{id}/trace | ---------> | Celery |
+--------+ | POST /eval/run | | Worker |
| GET /eval/latest | +---------+
| POST /prompt-rewrites/review| |
| POST /eval/re-run-failed | |
+-------------+---------------+ |
| |
v |
+----------------------------+ |
| LangGraph Orchestrator | |
| (StateGraph + dynamic | |
| routing function) | |
+--+-----+-----+-----+----+--+ |
| | | | | |
+--------------+ | | | +----------+ |
v v | v v |
+--------------+ +------+ | +----------+ +-------------+ |
| Decomposition| | RAG | | | Critique | | Compression | |
| budget: 1200 | | 2000 | | | 1500 | | 800 | |
+--------------+ +--+---+ | +----------+ +-------------+ |
| | |
+--+--+ | |
|FAISS| v |
+-----+ +-----------+ |
| Synthesis | |
| 1500 | |
+-----------+ |
| |
+------------------------------v----------------------------+ |
| SharedContext (in-memory) | |
| sub_tasks | agent_outputs | contradictions | provenance | |
+------+----------+----------+----------+----------+-------+ |
| | | | | |
v v v v v v
+----------+ +-------+ +-----------+ +-------+ +----------+
|PostgreSQL| | Redis | |OpenRouter | | Tools | | Log UI |
| 5 tables | |Celery | | (LLM) | | x4 | | port 8080|
+----------+ +-------+ +-----------+ +-------+ +----------+
| Agent | Budget | Role | Writes to context |
|---|---|---|---|
| Decomposition | 1 200 tok | Breaks query into a SubTask dependency DAG | sub_tasks[] |
| RAG | 2 000 tok | Multi-hop FAISS retrieval (15 docs, 2-hop, min 4 chunks) | agent_outputs["rag"], citations[] |
| Critique | 1 500 tok | Per-claim confidence scoring, span-level flagging | contradictions[], flagged_spans[] |
| Synthesis | 1 500 tok | Resolves contradictions, builds provenance map | provenance_map, final answer |
| Compression | 800 tok | Triggered by NeedCompressionError โ summarises context | Compressed agent_outputs content |
| Meta | 1 000 tok | Analyses eval failures, proposes prompt rewrites | PromptRewrite (pending in DB) |
| Tool | Failure modes | Max retries |
|---|---|---|
| web_search | timeout, empty, parse_error |
2 |
| sql_lookup | timeout, empty, parse_error (blocked DDL/DML) |
2 |
| code_sandbox | timeout, error (blocked imports: os, sys, subprocess, shutil, pathlib) |
2 |
| self_reflection | timeout, empty, error |
2 |
git clone https://github.com/KhushneetSingh/Weave.git
cd Weave
cp .env.example .env
# add your OPENROUTER_API_KEY to .env
docker compose up| Method | Path | Description |
|---|---|---|
POST |
/query |
Run the multi-agent pipeline. Returns SSE stream. |
GET |
/jobs/{job_id}/trace |
Ordered event trace (agent + tool logs) for a job. |
POST |
/eval/run |
Start a 15-case evaluation run via Celery. |
GET |
/eval/latest |
Latest eval results grouped by category + dimension. |
POST |
/prompt-rewrites/{id}/review |
Approve or reject a pending prompt rewrite. |
POST |
/eval/re-run-failed |
Re-run only previously failed eval cases. |
GET |
/health |
Liveness probe โ returns {"status": "ok"}. |
| Variable | Required | Default | Description |
|---|---|---|---|
OPENROUTER_API_KEY |
Yes | โ | Your OpenRouter API key |
OPENROUTER_MODEL |
No | meta-llama/llama-3.1-8b-instruct:free |
Primary LLM model |
OPENROUTER_FALLBACK_MODEL |
No | mistralai/mistral-7b-instruct:free |
Fallback if primary fails |
POSTGRES_USER |
No | megaai |
Postgres username |
POSTGRES_PASSWORD |
No | megaai |
Postgres password |
POSTGRES_DB |
No | megaai |
Postgres database name |
POSTGRES_HOST |
No | db |
Postgres host (Docker service name) |
POSTGRES_PORT |
No | 5432 |
Postgres port |
REDIS_URL |
No | redis://redis:6379/0 |
Redis URL for Celery broker |
MAX_CONTEXT_TOKENS |
No | 4000 |
Max token budget per query |
LOG_LEVEL |
No | INFO |
Logging level |
The evaluation harness runs 15 test cases through the full orchestration pipeline and scores each across 6 dimensions.
| Category | Cases | What's tested |
|---|---|---|
| Baseline | 5 | Known correct answers โ factual recall |
| Ambiguous | 5 | Underspecified inputs โ tests decomposition quality |
| Adversarial | 5 | Prompt injections, wrong premises, forced contradictions |
6 scoring dimensions:
- ๐ answer_correctness โ does the final output match the expected answer?
- ๐ citation_accuracy โ are RAG citations valid and grounded in retrieved chunks?
- โ๏ธ contradiction_resolution โ were flagged contradictions resolved by synthesis?
- โก tool_efficiency โ were tool calls appropriate for the case type?
- ๐ฐ budget_compliance โ did agents stay within token budget?
- ๐ค critique_agreement โ did synthesis honour critique feedback?
POST /eval/runโ runs all 15 cases through the pipeline- Each case is scored across 6 dimensions and stored as an
EvalRunin Postgres - Meta-agent reads failures โ finds the worst dimension โ identifies the responsible agent
- Meta-agent calls the LLM to propose a
PromptRewritewith unified diff + justification - Human reviews โ
POST /prompt-rewrites/{id}/reviewwithapproveorreject - If approved โ agent's
system_promptis patched in memory โ targeted re-eval on failed cases only - Delta stored in DB for tracking improvement over time
- OpenRouter free-tier models (Llama 3.1 8B) are significantly weaker than GPT-4 โ citation quality and adversarial robustness suffer
- FAISS index is in-memory only โ restarts lose the index; no persistence to disk or pgvector
- Code sandbox is NOT truly isolated โ runs
subprocess.run(["python3", "-c", code])with no container, no seccomp, no gVisor - Eval scoring is heuristic for ambiguous/adversarial cases โ keyword matching, not ground-truth comparison
- Meta-agent prompt rewrites are LLM-generated โ plausible but not guaranteed to improve scores
- No authentication on any endpoint โ all routes are publicly accessible
.env.exampledefaults (megaai) don't matchconfig.pydefaults (weave) โ can cause confusion on first setup without Docker- No rate limiting on the API โ a flood of
/queryrequests will exhaust LLM budget
- ๐ Replace FAISS with pgvector for persistent vector storage across restarts
- ๐ Add a proper code sandbox via gVisor or Firecracker microVMs
- ๐ฅ๏ธ Build a web UI for reviewing prompt rewrites and browsing eval results
- ๐ก๏ธ Add a prompt injection detection layer before the orchestrator
- ๐ Implement weighted dimension scoring with configurable weights per use case
Weave/
โโโ app/
โ โโโ __init__.py
โ โโโ config.py # Settings from env (pydantic-settings)
โ โโโ database.py # SQLAlchemy async engine + session + Base
โ โโโ main.py # FastAPI app โ all endpoints + error handling
โ โโโ agents/
โ โ โโโ __init__.py # Re-exports all agents
โ โ โโโ base.py # BaseAgent ABC โ budget, LLM, logging
โ โ โโโ decomposition.py # Query โ SubTask DAG
โ โ โโโ rag.py # Multi-hop FAISS retrieval + citations
โ โ โโโ critique.py # Per-claim confidence + span flagging
โ โ โโโ synthesis.py # Contradiction resolution + provenance
โ โ โโโ compression.py # Context compression on budget overflow
โ โ โโโ meta.py # Eval failure analysis โ prompt rewrites
โ โโโ core/
โ โ โโโ __init__.py # Re-exports BudgetManager
โ โ โโโ budget_manager.py # Token budget enforcement
โ โ โโโ llm.py # OpenRouter async client with fallback
โ โ โโโ logger.py # structlog JSON logging + DB persistence
โ โ โโโ orchestrator.py # LangGraph StateGraph with dynamic routing
โ โโโ eval/
โ โ โโโ __init__.py
โ โ โโโ harness.py # Runs test cases through pipeline + scores
โ โ โโโ scorer.py # 6-dimension hand-rolled scorer
โ โ โโโ test_cases.py # 15 eval cases (baseline/ambiguous/adversarial)
โ โโโ models/
โ โ โโโ __init__.py # Re-exports all ORM models for Alembic
โ โ โโโ job.py # Job ORM model
โ โ โโโ agent_log.py # AgentLog ORM model
โ โ โโโ tool_log.py # ToolLog ORM model
โ โ โโโ eval_run.py # EvalRun ORM model
โ โ โโโ prompt_rewrite.py # PromptRewrite ORM model
โ โโโ schemas/
โ โ โโโ __init__.py # Re-exports all Pydantic schemas
โ โ โโโ context.py # SharedContext, AgentOutput, SubTask, etc.
โ โ โโโ eval.py # EvalScore, ScoredDimension, PromptRewrite
โ โ โโโ tools.py # ToolResult schema
โ โโโ tools/
โ โ โโโ __init__.py # TOOL_REGISTRY + re-exports
โ โ โโโ base.py # BaseTool ABC โ timeout, retry, logging
โ โ โโโ web_search.py # Simulated web search (fake results)
โ โ โโโ sql_lookup.py # NL โ SQL via LLM โ asyncpg execution
โ โ โโโ code_sandbox.py # Python subprocess sandbox
โ โ โโโ self_reflection.py # Contradiction detection via LLM
โ โโโ worker/
โ โโโ __init__.py # Celery app configuration
โ โโโ tasks.py # Background tasks (eval, meta-agent)
โโโ alembic/
โ โโโ env.py # Async Alembic configuration
โ โโโ script.py.mako # Migration template
โ โโโ versions/
โ โโโ 0001_initial.py # Creates 5 core tables
โ โโโ 0002_seed_products_orders.py # products + orders for sql_lookup
โโโ log_ui/
โ โโโ __init__.py
โ โโโ main.py # Standalone FastAPI log viewer (port 8080)
โโโ tests/
โ โโโ __init__.py
โ โโโ test_budget_manager.py # 10 tests for ContextBudgetManager
โโโ archon_viz.py # Terminal architecture visualizer (rich + networkx)
โโโ docker-compose.yml # 5 services: db, redis, api, worker, log_ui
โโโ Dockerfile # Python 3.11-slim
โโโ requirements.txt
โโโ alembic.ini
โโโ pytest.ini
โโโ .env.example
โโโ .gitignore
Built with AI assistance. See AI_COLLABORATION.md for full attestation.