Skip to content

ksploitx/Weave

Folders and files

NameName
Last commit message
Last commit date

Latest commit

ย 

History

12 Commits
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 

Repository files navigation

Weave

Real-time multi-agent LLM orchestration system with self-improving eval loop

Python FastAPI LangGraph PostgreSQL Docker OpenRouter License


๐Ÿ” What is Weave

Weave is a multi-agent orchestration system that decomposes complex queries across specialised AI agents, coordinates them via a LangGraph state machine with dynamic routing and token-budget enforcement, and streams every intermediate step to the client over SSE. It continuously improves its own agent prompts through a 6-dimensional evaluation harness that scores 15 test cases, identifies weak dimensions, and proposes targeted prompt rewrites โ€” all under human-in-the-loop review.


๐Ÿ—๏ธ Architecture

                          +---------------------------+
                          |    FastAPI  (port 8000)    |
  +--------+              |  POST /query (SSE stream)  |            +---------+
  | Client | ----------> |  GET  /jobs/{id}/trace      | ---------> | Celery  |
  +--------+              |  POST /eval/run             |            | Worker  |
                          |  GET  /eval/latest          |            +---------+
                          |  POST /prompt-rewrites/review|               |
                          |  POST /eval/re-run-failed   |               |
                          +-------------+---------------+               |
                                        |                               |
                                        v                               |
                          +----------------------------+                |
                          |   LangGraph Orchestrator    |                |
                          |  (StateGraph + dynamic      |                |
                          |   routing function)         |                |
                          +--+-----+-----+-----+----+--+                |
                             |     |     |     |    |                   |
              +--------------+     |     |     |    +----------+       |
              v                    v     |     v               v       |
       +--------------+  +------+ |  +----------+  +-------------+    |
       | Decomposition|  | RAG  | |  | Critique |  | Compression |    |
       | budget: 1200 |  | 2000 | |  |   1500   |  |    800      |    |
       +--------------+  +--+---+ |  +----------+  +-------------+    |
                            |     |                                    |
                         +--+--+  |                                    |
                         |FAISS|  v                                    |
                         +-----+  +-----------+                        |
                                  | Synthesis |                        |
                                  |   1500    |                        |
                                  +-----------+                        |
                                        |                              |
         +------------------------------v----------------------------+ |
         |                   SharedContext (in-memory)                | |
         |  sub_tasks | agent_outputs | contradictions | provenance  | |
         +------+----------+----------+----------+----------+-------+ |
                |          |          |          |          |           |
                v          v          v          v          v           v
          +----------+ +-------+ +-----------+ +-------+ +----------+
          |PostgreSQL| | Redis | |OpenRouter | | Tools | | Log UI   |
          | 5 tables | |Celery | |  (LLM)    | |  x4   | | port 8080|
          +----------+ +-------+ +-----------+ +-------+ +----------+

๐Ÿค– Agents

Agent Budget Role Writes to context
Decomposition 1 200 tok Breaks query into a SubTask dependency DAG sub_tasks[]
RAG 2 000 tok Multi-hop FAISS retrieval (15 docs, 2-hop, min 4 chunks) agent_outputs["rag"], citations[]
Critique 1 500 tok Per-claim confidence scoring, span-level flagging contradictions[], flagged_spans[]
Synthesis 1 500 tok Resolves contradictions, builds provenance map provenance_map, final answer
Compression 800 tok Triggered by NeedCompressionError โ€” summarises context Compressed agent_outputs content
Meta 1 000 tok Analyses eval failures, proposes prompt rewrites PromptRewrite (pending in DB)

๐Ÿ”ง Tools

Tool Failure modes Max retries
web_search timeout, empty, parse_error 2
sql_lookup timeout, empty, parse_error (blocked DDL/DML) 2
code_sandbox timeout, error (blocked imports: os, sys, subprocess, shutil, pathlib) 2
self_reflection timeout, empty, error 2

๐Ÿš€ Quick start

git clone https://github.com/KhushneetSingh/Weave.git
cd Weave
cp .env.example .env
# add your OPENROUTER_API_KEY to .env
docker compose up

Endpoints

Method Path Description
POST /query Run the multi-agent pipeline. Returns SSE stream.
GET /jobs/{job_id}/trace Ordered event trace (agent + tool logs) for a job.
POST /eval/run Start a 15-case evaluation run via Celery.
GET /eval/latest Latest eval results grouped by category + dimension.
POST /prompt-rewrites/{id}/review Approve or reject a pending prompt rewrite.
POST /eval/re-run-failed Re-run only previously failed eval cases.
GET /health Liveness probe โ€” returns {"status": "ok"}.

๐ŸŒ Environment variables

Variable Required Default Description
OPENROUTER_API_KEY Yes โ€” Your OpenRouter API key
OPENROUTER_MODEL No meta-llama/llama-3.1-8b-instruct:free Primary LLM model
OPENROUTER_FALLBACK_MODEL No mistralai/mistral-7b-instruct:free Fallback if primary fails
POSTGRES_USER No megaai Postgres username
POSTGRES_PASSWORD No megaai Postgres password
POSTGRES_DB No megaai Postgres database name
POSTGRES_HOST No db Postgres host (Docker service name)
POSTGRES_PORT No 5432 Postgres port
REDIS_URL No redis://redis:6379/0 Redis URL for Celery broker
MAX_CONTEXT_TOKENS No 4000 Max token budget per query
LOG_LEVEL No INFO Logging level

๐Ÿ“Š Eval pipeline

The evaluation harness runs 15 test cases through the full orchestration pipeline and scores each across 6 dimensions.

Category Cases What's tested
Baseline 5 Known correct answers โ€” factual recall
Ambiguous 5 Underspecified inputs โ€” tests decomposition quality
Adversarial 5 Prompt injections, wrong premises, forced contradictions

6 scoring dimensions:

  • ๐Ÿ“ answer_correctness โ€” does the final output match the expected answer?
  • ๐Ÿ“Ž citation_accuracy โ€” are RAG citations valid and grounded in retrieved chunks?
  • โš”๏ธ contradiction_resolution โ€” were flagged contradictions resolved by synthesis?
  • โšก tool_efficiency โ€” were tool calls appropriate for the case type?
  • ๐Ÿ’ฐ budget_compliance โ€” did agents stay within token budget?
  • ๐Ÿค critique_agreement โ€” did synthesis honour critique feedback?

๐Ÿ”„ Self-improving loop

  1. POST /eval/run โ†’ runs all 15 cases through the pipeline
  2. Each case is scored across 6 dimensions and stored as an EvalRun in Postgres
  3. Meta-agent reads failures โ†’ finds the worst dimension โ†’ identifies the responsible agent
  4. Meta-agent calls the LLM to propose a PromptRewrite with unified diff + justification
  5. Human reviews โ†’ POST /prompt-rewrites/{id}/review with approve or reject
  6. If approved โ†’ agent's system_prompt is patched in memory โ†’ targeted re-eval on failed cases only
  7. Delta stored in DB for tracking improvement over time

โš ๏ธ Known limitations

  • OpenRouter free-tier models (Llama 3.1 8B) are significantly weaker than GPT-4 โ€” citation quality and adversarial robustness suffer
  • FAISS index is in-memory only โ€” restarts lose the index; no persistence to disk or pgvector
  • Code sandbox is NOT truly isolated โ€” runs subprocess.run(["python3", "-c", code]) with no container, no seccomp, no gVisor
  • Eval scoring is heuristic for ambiguous/adversarial cases โ€” keyword matching, not ground-truth comparison
  • Meta-agent prompt rewrites are LLM-generated โ€” plausible but not guaranteed to improve scores
  • No authentication on any endpoint โ€” all routes are publicly accessible
  • .env.example defaults (megaai) don't match config.py defaults (weave) โ€” can cause confusion on first setup without Docker
  • No rate limiting on the API โ€” a flood of /query requests will exhaust LLM budget

๐Ÿ”ฎ What's next

  • ๐Ÿ”€ Replace FAISS with pgvector for persistent vector storage across restarts
  • ๐Ÿ”’ Add a proper code sandbox via gVisor or Firecracker microVMs
  • ๐Ÿ–ฅ๏ธ Build a web UI for reviewing prompt rewrites and browsing eval results
  • ๐Ÿ›ก๏ธ Add a prompt injection detection layer before the orchestrator
  • ๐Ÿ“ˆ Implement weighted dimension scoring with configurable weights per use case

๐Ÿ“ Project structure

Weave/
โ”œโ”€โ”€ app/
โ”‚   โ”œโ”€โ”€ __init__.py
โ”‚   โ”œโ”€โ”€ config.py                # Settings from env (pydantic-settings)
โ”‚   โ”œโ”€โ”€ database.py              # SQLAlchemy async engine + session + Base
โ”‚   โ”œโ”€โ”€ main.py                  # FastAPI app โ€” all endpoints + error handling
โ”‚   โ”œโ”€โ”€ agents/
โ”‚   โ”‚   โ”œโ”€โ”€ __init__.py          # Re-exports all agents
โ”‚   โ”‚   โ”œโ”€โ”€ base.py              # BaseAgent ABC โ€” budget, LLM, logging
โ”‚   โ”‚   โ”œโ”€โ”€ decomposition.py     # Query โ†’ SubTask DAG
โ”‚   โ”‚   โ”œโ”€โ”€ rag.py               # Multi-hop FAISS retrieval + citations
โ”‚   โ”‚   โ”œโ”€โ”€ critique.py          # Per-claim confidence + span flagging
โ”‚   โ”‚   โ”œโ”€โ”€ synthesis.py         # Contradiction resolution + provenance
โ”‚   โ”‚   โ”œโ”€โ”€ compression.py       # Context compression on budget overflow
โ”‚   โ”‚   โ””โ”€โ”€ meta.py              # Eval failure analysis โ†’ prompt rewrites
โ”‚   โ”œโ”€โ”€ core/
โ”‚   โ”‚   โ”œโ”€โ”€ __init__.py          # Re-exports BudgetManager
โ”‚   โ”‚   โ”œโ”€โ”€ budget_manager.py    # Token budget enforcement
โ”‚   โ”‚   โ”œโ”€โ”€ llm.py               # OpenRouter async client with fallback
โ”‚   โ”‚   โ”œโ”€โ”€ logger.py            # structlog JSON logging + DB persistence
โ”‚   โ”‚   โ””โ”€โ”€ orchestrator.py      # LangGraph StateGraph with dynamic routing
โ”‚   โ”œโ”€โ”€ eval/
โ”‚   โ”‚   โ”œโ”€โ”€ __init__.py
โ”‚   โ”‚   โ”œโ”€โ”€ harness.py           # Runs test cases through pipeline + scores
โ”‚   โ”‚   โ”œโ”€โ”€ scorer.py            # 6-dimension hand-rolled scorer
โ”‚   โ”‚   โ””โ”€โ”€ test_cases.py        # 15 eval cases (baseline/ambiguous/adversarial)
โ”‚   โ”œโ”€โ”€ models/
โ”‚   โ”‚   โ”œโ”€โ”€ __init__.py          # Re-exports all ORM models for Alembic
โ”‚   โ”‚   โ”œโ”€โ”€ job.py               # Job ORM model
โ”‚   โ”‚   โ”œโ”€โ”€ agent_log.py         # AgentLog ORM model
โ”‚   โ”‚   โ”œโ”€โ”€ tool_log.py          # ToolLog ORM model
โ”‚   โ”‚   โ”œโ”€โ”€ eval_run.py          # EvalRun ORM model
โ”‚   โ”‚   โ””โ”€โ”€ prompt_rewrite.py    # PromptRewrite ORM model
โ”‚   โ”œโ”€โ”€ schemas/
โ”‚   โ”‚   โ”œโ”€โ”€ __init__.py          # Re-exports all Pydantic schemas
โ”‚   โ”‚   โ”œโ”€โ”€ context.py           # SharedContext, AgentOutput, SubTask, etc.
โ”‚   โ”‚   โ”œโ”€โ”€ eval.py              # EvalScore, ScoredDimension, PromptRewrite
โ”‚   โ”‚   โ””โ”€โ”€ tools.py             # ToolResult schema
โ”‚   โ”œโ”€โ”€ tools/
โ”‚   โ”‚   โ”œโ”€โ”€ __init__.py          # TOOL_REGISTRY + re-exports
โ”‚   โ”‚   โ”œโ”€โ”€ base.py              # BaseTool ABC โ€” timeout, retry, logging
โ”‚   โ”‚   โ”œโ”€โ”€ web_search.py        # Simulated web search (fake results)
โ”‚   โ”‚   โ”œโ”€โ”€ sql_lookup.py        # NL โ†’ SQL via LLM โ†’ asyncpg execution
โ”‚   โ”‚   โ”œโ”€โ”€ code_sandbox.py      # Python subprocess sandbox
โ”‚   โ”‚   โ””โ”€โ”€ self_reflection.py   # Contradiction detection via LLM
โ”‚   โ””โ”€โ”€ worker/
โ”‚       โ”œโ”€โ”€ __init__.py          # Celery app configuration
โ”‚       โ””โ”€โ”€ tasks.py             # Background tasks (eval, meta-agent)
โ”œโ”€โ”€ alembic/
โ”‚   โ”œโ”€โ”€ env.py                   # Async Alembic configuration
โ”‚   โ”œโ”€โ”€ script.py.mako           # Migration template
โ”‚   โ””โ”€โ”€ versions/
โ”‚       โ”œโ”€โ”€ 0001_initial.py      # Creates 5 core tables
โ”‚       โ””โ”€โ”€ 0002_seed_products_orders.py  # products + orders for sql_lookup
โ”œโ”€โ”€ log_ui/
โ”‚   โ”œโ”€โ”€ __init__.py
โ”‚   โ””โ”€โ”€ main.py                  # Standalone FastAPI log viewer (port 8080)
โ”œโ”€โ”€ tests/
โ”‚   โ”œโ”€โ”€ __init__.py
โ”‚   โ””โ”€โ”€ test_budget_manager.py   # 10 tests for ContextBudgetManager
โ”œโ”€โ”€ archon_viz.py                # Terminal architecture visualizer (rich + networkx)
โ”œโ”€โ”€ docker-compose.yml           # 5 services: db, redis, api, worker, log_ui
โ”œโ”€โ”€ Dockerfile                   # Python 3.11-slim
โ”œโ”€โ”€ requirements.txt
โ”œโ”€โ”€ alembic.ini
โ”œโ”€โ”€ pytest.ini
โ”œโ”€โ”€ .env.example
โ””โ”€โ”€ .gitignore

Built with AI assistance. See AI_COLLABORATION.md for full attestation.

About

Weave is a high-performance, multi-agent LLM orchestration system designed for reliability and observability. Weave provides a robust framework for managing complex agentic workflows with strict token budget enforcement, detailed tool/agent execution logging, and automated evaluation tracking.

Topics

Resources

License

Contributing

Stars

Watchers

Forks

Releases

No releases published

Contributors

Languages