Optimize any agentic workflow
against cost, speed, and quality.
Drop in a high-level task or an existing graph. AutoAW co-evolves topology, prompts, models, and tools — searching the Pareto frontier of your utility function so you ship the cheapest version that still hits your bar.
From a brittle prototype to a Pareto-optimal pipeline.
AutoAW doesn't just tune prompts — it rewrites the graph. Below is the actual diff from the customer-support copilot experiment.
Two paradigms. One optimizer. No assumptions.
Multi-agent committees and agentic glue are different bets on where reasoning should live. AutoAW makes neither bet upfront — it searches both and lets your fitness function settle the argument.
| Dimension | Multi-Agent Committee | Agentic Glue + Skills |
|---|---|---|
| Latency | High — each hand-off is a fresh LLM call | Low — one LLM, tools run as fast code |
| Context | Fragmented — state re-interpreted at each hop | Unified — one context window, no drift |
| Debugging | Complex — trace across N reasoning chains | Standard — did the LLM call the skill? Did it run? |
| Cost | Proportional to agent count × task length | Proportional to conductor reasoning only |
AutoAW seeds its initial population with candidates spanning both paradigms. The multi-objective fitness function — penalizing cost and latency directly — creates natural selection pressure toward whichever architecture is actually cheaper and faster for your specific task and dataset.
The genetic operators handle the transitions: mutate_structure and the Compaction operator collapse N agents into a single conductor when the fitness landscape rewards it; the Delegation Split operator spawns parallel agents when the task benefits from specialization. No topology is privileged — the search finds where your task lives.
Four steps. One Pareto frontier.
Describe the task
Either give a high-level task — "triage support tickets and propose a draft reply" — or import an existing workflow (LangGraph, DSPy, your own Python).
Pick datasets & weights
Choose an eval dataset (yours or one of 18 included). Set the utility weights — quality, $/run, p50 latency. AutoAW does the rest.
AutoAW searches
A search loop swaps models, splits/merges agents, edits prompts, prunes tools, and re-evaluates every candidate. The Pareto frontier fills in as it runs.
Promote & deploy
Pick any point on the frontier. Export as a single graph (JSON, Python, or a hosted endpoint). Fork to keep optimizing.
What it found in our last run.
Task: customer-support copilot, 1,400-ticket eval. Objective = 0.7·quality + 0.25·cost⁻¹ + 0.05·speed.
Live: top results on GAIA.
Public benchmark — anyone can submit. AutoAW-optimized graphs above single-model baselines for both quality and cost.
| # | Team / submission | Configuration | Quality | Cost / run | Latency p50 |
|---|---|---|---|---|---|
| #1 | AutoAW · optimizedbest · pareto submitted May 18 | ensemble (CS+G5m) | 0.741 | $4.20 | 2.4s |
| #2 | Anthropic baseline submitted Apr 30 | Claude Sonnet 4.5 | 0.692 | $14.10 | 5.8s |
| #3 | OpenAI baseline submitted Apr 11 | GPT-5 | 0.688 | $26.80 | 8.7s |
| #4 | Cosine.ai submitted Mar 22 | Genie-2 | 0.671 | $19.40 | 9.2s |
| #5 | Google DeepMind submitted Apr 04 | Gemini 2.5 Pro | 0.659 | $11.90 | 5.1s |
| #6 | Manus AI submitted Feb 28 | manus-r1 | 0.641 | $8.30 | 4.4s |