New Experiment

Predefined Benchmarks

Select a benchmark to pre-fill the form below, or configure manually.

GAIA

Diverse real-world tasks with pass/fail ground truth.

466 tasks · paper

Coming Soon

τ-bench

Tool-augmented realistic user/agent conversations.

120 tasks · paper

Coming Soon

AgentBench

Multi-environment agent evaluation (OS, DB, web, games).

1091 tasks · paper

Coming Soon

No datasets found. Upload one or enter an ID manually.

Number of rows to use. Leave blank to use all rows.

llm_judge
60%
20%
20%