- [May 07, 2025]: The full datasets of Version 1.0 are released (homepage, huggingface)!
Workspace-Bench is a benchmark for evaluating AI agents on workspace tasks with large-scale file dependencies. It is built to study a capability we call Workspace Learning: whether an agent can identify, reason over, exploit, and update explicit and implicit dependencies among heterogeneous files in a real worker's workspace.
Rubric pass rates on Workspace-Bench-Lite across multiple combinations of agent harnesses and backbone LLMs See Details.
Workspace-Bench contains:
- 5 realistic worker profiles: Operations Manager, Logistics Manager, AI Product Manager, Researcher, and Backend Developer
- 74 file types across heterogeneous workspace environments
- 20,476 files, with workspaces scaling up to 20GB
- 388 tasks, each paired with an explicit file dependency graph
- 7,399 fine-grained rubrics for evaluation
- Workspace-Bench-Lite, a 100-task subset that preserves the benchmark distribution while reducing evaluation cost by about 70%
Follow these steps to download Workspace-Bench-Lite and run a one-task smoke evaluation.
- Docker
- Python 3
- API credentials for the agent you want to run
- An Anthropic-compatible API endpoint for the judge model
cd evaluation
cp .env.example .envFill .env before running an evaluation. For the default smoke command below,
set KIMIK25_BASE_URL and KIMIK25_API_KEY. For rubric judging, also set
JUDGE_BASE_URL, JUDGE_MODEL, and JUDGE_API_KEY; the judge endpoint must
be Anthropic-compatible because agent_as_a_judge.py runs the judge through
the ClaudeCode harness.
Download the Lite task set and workspace files:
python3 scripts/download_hf_assets.py --lite --workspacesdocker compose -f docker/docker-compose.yaml build
docker compose -f docker/docker-compose.yaml run --rm workspace-bench \
bash /workspace/Workspace-Bench/evaluation/docker/bootstrap.shRun a single-task smoke evaluation with Codex:
docker compose -f docker/docker-compose.yaml run --rm workspace-bench \
bash /workspace/Workspace-Bench/evaluation/docker/run-benchmark.sh \
--harness codex \
--model kimi-k2.5 \
--dataset smokeCheck the report:
python3 scripts/assert_agent_runner_report.py \
output/Codex--Kimi-K2.5--Smoke/agent_runner_report.jsonThe expected output is:
[ok] output/Codex--Kimi-K2.5--Smoke/agent_runner_report.json: 1/1 passed
This report only checks whether the agent run completed and produced the
expected output files. To score the task against its rubrics, run
agent_as_a_judge.py inside Docker:
docker compose -f docker/docker-compose.yaml run --rm workspace-bench \
python3 -u /workspace/Workspace-Bench/evaluation/src/agent_as_a_judge.py \
--task-dir /workspace/Workspace-Bench/evaluation/output/Codex--Kimi-K2.5--Smoke \
--eval-yaml /workspace/Workspace-Bench/evaluation/runs/judge.yaml \
--overwriteRubric judgments are written into each task directory as:
evaluation/output/Codex--Kimi-K2.5--Smoke/100/rubrics_judge--{JUDGE_MODEL}.json
Task outputs, logs, and judge artifacts are written to:
evaluation/output/Codex--Kimi-K2.5--Smoke/
Run the 100-task Lite benchmark:
docker compose -f docker/docker-compose.yaml run --rm workspace-bench \
bash /workspace/Workspace-Bench/evaluation/docker/run-benchmark.sh \
--harness codex \
--model kimi-k2.5 \
--dataset liteThen judge the completed Lite run:
docker compose -f docker/docker-compose.yaml run --rm workspace-bench \
python3 -u /workspace/Workspace-Bench/evaluation/src/agent_as_a_judge.py \
--task-dir /workspace/Workspace-Bench/evaluation/output/Codex--Kimi-K2.5--Lite \
--eval-yaml /workspace/Workspace-Bench/evaluation/runs/judge.yaml \
--parallel \
--workers 3You can change the harness, model, and dataset split from the command line:
docker compose -f docker/docker-compose.yaml run --rm workspace-bench \
bash /workspace/Workspace-Bench/evaluation/docker/run-benchmark.sh \
--harness openclaw \
--model glm-5.1 \
--dataset liteSupported harness values are codex, openclaw, deepagent, and
claudecode. Common model aliases include gpt-5.4, gemini-3.1-pro,
kimi-k2.5, glm-5.1, minimax-m2.7, grok-4.3, and qwen-3.6.
When using claudecode, the selected model endpoint must be compatible with
the Anthropic API.
For a custom provider model, add --model-id, --model-name, and
--env-prefix.
Completed run outputs are stored under evaluation/output/.
Download the full task set:
python3 scripts/download_hf_assets.py --fullThen run the full benchmark:
docker compose -f docker/docker-compose.yaml run --rm workspace-bench \
bash /workspace/Workspace-Bench/evaluation/docker/run-benchmark.sh \
--harness codex \
--model kimi-k2.5 \
--dataset fullTo browse runs and rubric judgments in the web dashboard (requires Node.js):
cd viz
npm install
npm run devThe dashboard will be available at http://localhost:5173 and automatically discovers results under evaluation/output/.
@misc{tang2026workspacebench10benchmarkingai,
title={Workspace-Bench 1.0: Benchmarking AI Agents on Workspace Tasks with Large-Scale File Dependencies},
author={Zirui Tang and Xuanhe Zhou and Yumou Liu and Linchun Li and Weizheng Wang and Hongzhang Huang and Jun Zhou and Jiachen Song and Shaoli Yu and Jinqi Wang and Zihang Zhou and Hongyi Zhou and Yuting Lv and Jinyang Li and Jiashuo Liu and Ruoyu Chen and Chunwei Liu and GuoLiang Li and Jihua Kang and Fan Wu},
year={2026},
eprint={2605.03596},
archivePrefix={arXiv},
primaryClass={cs.AI},
url={https://arxiv.org/abs/2605.03596}
}

