New Experiment

Predefined Benchmarks

Select a benchmark to pre-fill the form below, or configure manually.

Diverse real-world tasks with pass/fail ground truth.

466 tasks · paper

Coming Soon

Tool-augmented realistic user/agent conversations.

120 tasks · paper

Coming Soon

Multi-environment agent evaluation (OS, DB, web, games).

1091 tasks · paper

Coming Soon