Workspace-Bench

Benchmarking AI Agents on Workspace Tasks with Large-Scale File Dependencies

📰 News

[May 07, 2025]: The full datasets of Version 1.0 are released (homepage, huggingface)!

👋 Overview

Workspace-Bench is a benchmark for evaluating AI agents on workspace tasks with large-scale file dependencies. It is built to study a capability we call Workspace Learning: whether an agent can identify, reason over, exploit, and update explicit and implicit dependencies among heterogeneous files in a real worker's workspace.

💫 LeaderBoard

Rubrics success rate across agent settings

Rubric pass rates on Workspace-Bench-Lite across multiple combinations of agent harnesses and backbone LLMs See Details.

💽 Dataset Introduction

Workspace-Bench contains:

5 realistic worker profiles: Operations Manager, Logistics Manager, AI Product Manager, Researcher, and Backend Developer
74 file types across heterogeneous workspace environments
20,476 files, with workspaces scaling up to 20GB
388 tasks, each paired with an explicit file dependency graph
7,399 fine-grained rubrics for evaluation
Workspace-Bench-Lite, a 100-task subset that preserves the benchmark distribution while reducing evaluation cost by about 70%

🚀 Quick Start

Follow these steps to download Workspace-Bench-Lite and run a one-task smoke evaluation.

Prerequisites

Docker
Python 3
API credentials for the agent you want to run
An Anthropic-compatible API endpoint for the judge model

cd evaluation
cp .env.example .env

Fill .env before running an evaluation. For the default smoke command below, set KIMIK25_BASE_URL and KIMIK25_API_KEY. For rubric judging, also set JUDGE_BASE_URL, JUDGE_MODEL, and JUDGE_API_KEY; the judge endpoint must be Anthropic-compatible because agent_as_a_judge.py runs the judge through the ClaudeCode harness.

Download Data

Download the Lite task set and workspace files:

python3 scripts/download_hf_assets.py --lite --workspaces

Build Environment

docker compose -f docker/docker-compose.yaml build
docker compose -f docker/docker-compose.yaml run --rm workspace-bench \
  bash /workspace/Workspace-Bench/evaluation/docker/bootstrap.sh

Run One Task

Run a single-task smoke evaluation with Codex:

docker compose -f docker/docker-compose.yaml run --rm workspace-bench \
  bash /workspace/Workspace-Bench/evaluation/docker/run-benchmark.sh \
  --harness codex \
  --model kimi-k2.5 \
  --dataset smoke

Check the report:

python3 scripts/assert_agent_runner_report.py \
  output/Codex--Kimi-K2.5--Smoke/agent_runner_report.json

The expected output is:

[ok] output/Codex--Kimi-K2.5--Smoke/agent_runner_report.json: 1/1 passed

This report only checks whether the agent run completed and produced the expected output files. To score the task against its rubrics, run agent_as_a_judge.py inside Docker:

docker compose -f docker/docker-compose.yaml run --rm workspace-bench \
  python3 -u /workspace/Workspace-Bench/evaluation/src/agent_as_a_judge.py \
  --task-dir /workspace/Workspace-Bench/evaluation/output/Codex--Kimi-K2.5--Smoke \
  --eval-yaml /workspace/Workspace-Bench/evaluation/runs/judge.yaml \
  --overwrite

Rubric judgments are written into each task directory as:

evaluation/output/Codex--Kimi-K2.5--Smoke/100/rubrics_judge--{JUDGE_MODEL}.json

Task outputs, logs, and judge artifacts are written to:

evaluation/output/Codex--Kimi-K2.5--Smoke/

Run Workspace-Bench-Lite

Run the 100-task Lite benchmark:

docker compose -f docker/docker-compose.yaml run --rm workspace-bench \
  bash /workspace/Workspace-Bench/evaluation/docker/run-benchmark.sh \
  --harness codex \
  --model kimi-k2.5 \
  --dataset lite

Then judge the completed Lite run:

docker compose -f docker/docker-compose.yaml run --rm workspace-bench \
  python3 -u /workspace/Workspace-Bench/evaluation/src/agent_as_a_judge.py \
  --task-dir /workspace/Workspace-Bench/evaluation/output/Codex--Kimi-K2.5--Lite \
  --eval-yaml /workspace/Workspace-Bench/evaluation/runs/judge.yaml \
  --parallel \
  --workers 3

Other Run Configs

You can change the harness, model, and dataset split from the command line:

docker compose -f docker/docker-compose.yaml run --rm workspace-bench \
  bash /workspace/Workspace-Bench/evaluation/docker/run-benchmark.sh \
  --harness openclaw \
  --model glm-5.1 \
  --dataset lite

Supported harness values are codex, openclaw, deepagent, and claudecode. Common model aliases include gpt-5.4, gemini-3.1-pro, kimi-k2.5, glm-5.1, minimax-m2.7, grok-4.3, and qwen-3.6. When using claudecode, the selected model endpoint must be compatible with the Anthropic API. For a custom provider model, add --model-id, --model-name, and --env-prefix. Completed run outputs are stored under evaluation/output/.

Run the Full Benchmark

Download the full task set:

python3 scripts/download_hf_assets.py --full

Then run the full benchmark:

docker compose -f docker/docker-compose.yaml run --rm workspace-bench \
  bash /workspace/Workspace-Bench/evaluation/docker/run-benchmark.sh \
  --harness codex \
  --model kimi-k2.5 \
  --dataset full

Visualize Results

To browse runs and rubric judgments in the web dashboard (requires Node.js):

cd viz
npm install
npm run dev

The dashboard will be available at http://localhost:5173 and automatically discovers results under evaluation/output/.

🔎 Publications

Workspace-Bench 1.0: Benchmarking AI Agents on Workspace Tasks with Large-Scale File Dependencies

@misc{tang2026workspacebench10benchmarkingai,
      title={Workspace-Bench 1.0: Benchmarking AI Agents on Workspace Tasks with Large-Scale File Dependencies}, 
      author={Zirui Tang and Xuanhe Zhou and Yumou Liu and Linchun Li and Weizheng Wang and Hongzhang Huang and Jun Zhou and Jiachen Song and Shaoli Yu and Jinqi Wang and Zihang Zhou and Hongyi Zhou and Yuting Lv and Jinyang Li and Jiashuo Liu and Ruoyu Chen and Chunwei Liu and GuoLiang Li and Jihua Kang and Fan Wu},
      year={2026},
      eprint={2605.03596},
      archivePrefix={arXiv},
      primaryClass={cs.AI},
      url={https://arxiv.org/abs/2605.03596}
}

Name		Name	Last commit message	Last commit date
Latest commit History 29 Commits
.github/workflows		.github/workflows
assets		assets
deepagents		deepagents
docs		docs
evaluation		evaluation
viz		viz
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
mkdocs.yml		mkdocs.yml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Workspace-Bench

Benchmarking AI Agents on Workspace Tasks with Large-Scale File Dependencies

📰 News

👋 Overview

💫 LeaderBoard

💽 Dataset Introduction

🚀 Quick Start

Prerequisites

Download Data

Build Environment

Run One Task

Run Workspace-Bench-Lite

Other Run Configs

Run the Full Benchmark

Visualize Results

🔎 Publications

🤝 Acknowledgement

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Workspace-Bench

Benchmarking AI Agents on Workspace Tasks with Large-Scale File Dependencies

📰 News

👋 Overview

💫 LeaderBoard

💽 Dataset Introduction

🚀 Quick Start

Prerequisites

Download Data

Build Environment

Run One Task

Run Workspace-Bench-Lite

Other Run Configs

Run the Full Benchmark

Visualize Results

🔎 Publications

🤝 Acknowledgement

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages