Skip to content

OpenDataBox/Workspace-Bench

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

29 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

Workspace-Bench

Workspace-Bench

Benchmarking AI Agents on Workspace Tasks with Large-Scale File Dependencies


πŸ“° News

πŸ‘‹ Overview

Workspace-Bench is a benchmark for evaluating AI agents on workspace tasks with large-scale file dependencies. It is built to study a capability we call Workspace Learning: whether an agent can identify, reason over, exploit, and update explicit and implicit dependencies among heterogeneous files in a real worker's workspace.

Workspace-Bench framework overview

πŸ’« LeaderBoard

Rubrics success rate across agent settings

Rubric pass rates on Workspace-Bench-Lite across multiple combinations of agent harnesses and backbone LLMs See Details.

πŸ’½ Dataset Introduction

Workspace-Bench contains:

Workspace-Bench dataset distribution
  • 5 realistic worker profiles: Operations Manager, Logistics Manager, AI Product Manager, Researcher, and Backend Developer
  • 74 file types across heterogeneous workspace environments
  • 20,476 files, with workspaces scaling up to 20GB
  • 388 tasks, each paired with an explicit file dependency graph
  • 7,399 fine-grained rubrics for evaluation
  • Workspace-Bench-Lite, a 100-task subset that preserves the benchmark distribution while reducing evaluation cost by about 70%

πŸš€ Quick Start

Follow these steps to download Workspace-Bench-Lite and run a one-task smoke evaluation.

Prerequisites

  • Docker
  • Python 3
  • API credentials for the agent you want to run
  • An Anthropic-compatible API endpoint for the judge model
cd evaluation
cp .env.example .env

Fill .env before running an evaluation. For the default smoke command below, set KIMIK25_BASE_URL and KIMIK25_API_KEY. For rubric judging, also set JUDGE_BASE_URL, JUDGE_MODEL, and JUDGE_API_KEY; the judge endpoint must be Anthropic-compatible because agent_as_a_judge.py runs the judge through the ClaudeCode harness.

Download Data

Download the Lite task set and workspace files:

python3 scripts/download_hf_assets.py --lite --workspaces

Build Environment

docker compose -f docker/docker-compose.yaml build
docker compose -f docker/docker-compose.yaml run --rm workspace-bench \
  bash /workspace/Workspace-Bench/evaluation/docker/bootstrap.sh

Run One Task

Run a single-task smoke evaluation with Codex:

docker compose -f docker/docker-compose.yaml run --rm workspace-bench \
  bash /workspace/Workspace-Bench/evaluation/docker/run-benchmark.sh \
  --harness codex \
  --model kimi-k2.5 \
  --dataset smoke

Check the report:

python3 scripts/assert_agent_runner_report.py \
  output/Codex--Kimi-K2.5--Smoke/agent_runner_report.json

The expected output is:

[ok] output/Codex--Kimi-K2.5--Smoke/agent_runner_report.json: 1/1 passed

This report only checks whether the agent run completed and produced the expected output files. To score the task against its rubrics, run agent_as_a_judge.py inside Docker:

docker compose -f docker/docker-compose.yaml run --rm workspace-bench \
  python3 -u /workspace/Workspace-Bench/evaluation/src/agent_as_a_judge.py \
  --task-dir /workspace/Workspace-Bench/evaluation/output/Codex--Kimi-K2.5--Smoke \
  --eval-yaml /workspace/Workspace-Bench/evaluation/runs/judge.yaml \
  --overwrite

Rubric judgments are written into each task directory as:

evaluation/output/Codex--Kimi-K2.5--Smoke/100/rubrics_judge--{JUDGE_MODEL}.json

Task outputs, logs, and judge artifacts are written to:

evaluation/output/Codex--Kimi-K2.5--Smoke/

Run Workspace-Bench-Lite

Run the 100-task Lite benchmark:

docker compose -f docker/docker-compose.yaml run --rm workspace-bench \
  bash /workspace/Workspace-Bench/evaluation/docker/run-benchmark.sh \
  --harness codex \
  --model kimi-k2.5 \
  --dataset lite

Then judge the completed Lite run:

docker compose -f docker/docker-compose.yaml run --rm workspace-bench \
  python3 -u /workspace/Workspace-Bench/evaluation/src/agent_as_a_judge.py \
  --task-dir /workspace/Workspace-Bench/evaluation/output/Codex--Kimi-K2.5--Lite \
  --eval-yaml /workspace/Workspace-Bench/evaluation/runs/judge.yaml \
  --parallel \
  --workers 3

Other Run Configs

You can change the harness, model, and dataset split from the command line:

docker compose -f docker/docker-compose.yaml run --rm workspace-bench \
  bash /workspace/Workspace-Bench/evaluation/docker/run-benchmark.sh \
  --harness openclaw \
  --model glm-5.1 \
  --dataset lite

Supported harness values are codex, openclaw, deepagent, and claudecode. Common model aliases include gpt-5.4, gemini-3.1-pro, kimi-k2.5, glm-5.1, minimax-m2.7, grok-4.3, and qwen-3.6. When using claudecode, the selected model endpoint must be compatible with the Anthropic API. For a custom provider model, add --model-id, --model-name, and --env-prefix. Completed run outputs are stored under evaluation/output/.

Run the Full Benchmark

Download the full task set:

python3 scripts/download_hf_assets.py --full

Then run the full benchmark:

docker compose -f docker/docker-compose.yaml run --rm workspace-bench \
  bash /workspace/Workspace-Bench/evaluation/docker/run-benchmark.sh \
  --harness codex \
  --model kimi-k2.5 \
  --dataset full

Visualize Results

To browse runs and rubric judgments in the web dashboard (requires Node.js):

cd viz
npm install
npm run dev

The dashboard will be available at http://localhost:5173 and automatically discovers results under evaluation/output/.

πŸ”Ž Publications

@misc{tang2026workspacebench10benchmarkingai,
      title={Workspace-Bench 1.0: Benchmarking AI Agents on Workspace Tasks with Large-Scale File Dependencies}, 
      author={Zirui Tang and Xuanhe Zhou and Yumou Liu and Linchun Li and Weizheng Wang and Hongzhang Huang and Jun Zhou and Jiachen Song and Shaoli Yu and Jinqi Wang and Zihang Zhou and Hongyi Zhou and Yuting Lv and Jinyang Li and Jiashuo Liu and Ruoyu Chen and Chunwei Liu and GuoLiang Li and Jihua Kang and Fan Wu},
      year={2026},
      eprint={2605.03596},
      archivePrefix={arXiv},
      primaryClass={cs.AI},
      url={https://arxiv.org/abs/2605.03596}
}

🀝 Acknowledgement

Lark / Feishu Β Β Β Β  Shanghai Jiao Tong University

Releases

No releases published

Packages

 
 
 

Contributors