Select a benchmark to pre-fill the form below, or configure manually.
Diverse real-world tasks with pass/fail ground truth.
466 tasks · paper
Tool-augmented realistic user/agent conversations.
120 tasks · paper
Multi-environment agent evaluation (OS, DB, web, games).
1091 tasks · paper
No datasets found. Upload one or enter an ID manually.
Number of rows to use. Leave blank to use all rows.