v0.4 · invite-only beta

Optimize any agentic workflow
against cost, speed, and quality.

Drop in a high-level task or an existing graph. AutoAW co-evolves topology, prompts, models, and tools — searching the Pareto frontier of your utility function so you ship the cheapest version that still hits your bar.

Start an experiment See experiments

74.1% on GAIA · vs 69.2% Sonnet 4.5 baseline6.7× cheaper · vs same-quality GPT-5 single-agent38min to SOTA · 94 candidates evaluated

exp-7c4e · customer-support-copilot · gen 18/40

running

quality

0.732

+0.041 vs gen 1

cost / run

$4.62

−84% vs gen 1

p50 latency

2.4s

−72% vs gen 1

02 · diff

From a brittle prototype to a Pareto-optimal pipeline.

AutoAW doesn't just tune prompts — it rewrites the graph. Below is the actual diff from the customer-support copilot experiment.

before · v0 · hand-written

quality

0.41

cost/run

$38.50

p50

12.4s

nodes

6 (1 model)

22 generations

after · gen 22 · AutoAW-optimized

quality

0.741

cost/run

$4.20

p50

2.4s

nodes

4 (3 models)

04 · architectures

Two paradigms. One optimizer. No assumptions.

Multi-agent committees and agentic glue are different bets on where reasoning should live. AutoAW makes neither bet upfront — it searches both and lets your fitness function settle the argument.

multi-agent committee

LLM-to-LLM hand-offsfragmented contexthigh coordination overhead

wins when

·Security or permission isolation is required between agents

·Domain knowledge is too large for a single context window

·Task genuinely needs competing hypotheses (debate topology)

agentic glue + skills

one conductordeterministic toolsunified context

wins when

·Standard engineering, data extraction, or enterprise automation tasks

·Cost and latency are in the objective — the fitness function will find it

·One capable model can hold the full reasoning context end-to-end

Dimension	Multi-Agent Committee	Agentic Glue + Skills
Latency	High — each hand-off is a fresh LLM call	Low — one LLM, tools run as fast code
Context	Fragmented — state re-interpreted at each hop	Unified — one context window, no drift
Debugging	Complex — trace across N reasoning chains	Standard — did the LLM call the skill? Did it run?
Cost	Proportional to agent count × task length	Proportional to conductor reasoning only

How AutoAW discovers which wins — without assuming either

AutoAW seeds its initial population with candidates spanning both paradigms. The multi-objective fitness function — penalizing cost and latency directly — creates natural selection pressure toward whichever architecture is actually cheaper and faster for your specific task and dataset.

The genetic operators handle the transitions: mutate_structure and the Compaction operator collapse N agents into a single conductor when the fitness landscape rewards it; the Delegation Split operator spawns parallel agents when the task benefits from specialization. No topology is privileged — the search finds where your task lives.

Compaction

N agents → 1 conductor + skills

→ glue

Delegation Split

1 conductor → k parallel agents

→ committee

Critique Inject

Insert validator after any node

→ hybrid

05 · how it works

Four steps. One Pareto frontier.

STEP 01

Describe the task

Either give a high-level task — "triage support tickets and propose a draft reply" — or import an existing workflow (LangGraph, DSPy, your own Python).

# task.yaml

goal: "Triage support tickets,

propose draft reply with citations

STEP 02

Pick datasets & weights

Choose an eval dataset (yours or one of 18 included). Set the utility weights — quality, $/run, p50 latency. AutoAW does the rest.

STEP 03

AutoAW searches

A search loop swaps models, splits/merges agents, edits prompts, prunes tools, and re-evaluates every candidate. The Pareto frontier fills in as it runs.

STEP 04

Promote & deploy

Pick any point on the frontier. Export as a single graph (JSON, Python, or a hosted endpoint). Fork to keep optimizing.

$ autoaw promote candidate-B

✓ deployed → exp-7c4e/prod

endpoint: https://run.autoaw.io/v1/...

06 · results

What it found in our last run.

Task: customer-support copilot, 1,400-ticket eval. Objective = 0.7·quality + 0.25·cost⁻¹ + 0.05·speed.

GAIA

74.1%

+5.3 vs. SOTA

baseline Sonnet 4.5: 69.2

Cost / run

$4.20

−6.7× cheaper

baseline GPT-5: $26.80

Latency p50

2.4s

−3.6× faster

baseline GPT-5: 8.7s

Generations to SOTA

≈ 38 min wall-clock

94 candidates explored

07 · leaderboard

Live: top results on GAIA.

Public benchmark — anyone can submit. AutoAW-optimized graphs above single-model baselines for both quality and cost.

See experiments

#	Team / submission	Configuration	Quality	Cost / run	Latency p50
#1	AutoAW · optimizedbest · pareto submitted May 18	ensemble (CS+G5m)	0.741	$4.20	2.4s
#2	Anthropic baseline submitted Apr 30	Claude Sonnet 4.5	0.692	$14.10	5.8s
#3	OpenAI baseline submitted Apr 11	GPT-5	0.688	$26.80	8.7s
#4	Cosine.ai submitted Mar 22	Genie-2	0.671	$19.40	9.2s
#5	Google DeepMind submitted Apr 04	Gemini 2.5 Pro	0.659	$11.90	5.1s
#6	Manus AI submitted Feb 28	manus-r1	0.641	$8.30	4.4s

Request a demo & quotation

See AutoAW in action on your own workflow.

We'll walk you through a live demo and put together a custom quote for your team.

Request a demo

Optimize any agentic workflowagainst cost, speed, and quality.

From a brittle prototype to a Pareto-optimal pipeline.

Two paradigms. One optimizer. No assumptions.

Four steps. One Pareto frontier.

Describe the task

Pick datasets & weights

AutoAW searches

Promote & deploy

What it found in our last run.

Live: top results on GAIA.

See AutoAW in action on your own workflow.

Optimize any agentic workflow
against cost, speed, and quality.