v0.4 · invite-only beta

Optimize any agentic workflow
against cost, speed, and quality.

Drop in a high-level task or an existing graph. AutoAW co-evolves topology, prompts, models, and tools — searching the Pareto frontier of your utility function so you ship the cheapest version that still hits your bar.

74.1% on GAIA · vs 69.2% Sonnet 4.5 baseline6.7× cheaper · vs same-quality GPT-5 single-agent38min to SOTA · 94 candidates evaluated
exp-7c4e · customer-support-copilot · gen 18/40
running
taskplannercs-4.5executor·ag2-flashexecutor·bg2-flashjudgehaikuanswer
quality
0.732
+0.041 vs gen 1
cost / run
$4.62
−84% vs gen 1
p50 latency
2.4s
−72% vs gen 1
02 · diff

From a brittle prototype to a Pareto-optimal pipeline.

AutoAW doesn't just tune prompts — it rewrites the graph. Below is the actual diff from the customer-support copilot experiment.

before · v0 · hand-written
taskreact-agentgpt-5docscrm.apiweb.searchcalcknowledgeanswer
quality
0.41
cost/run
$38.50
p50
12.4s
nodes
6 (1 model)
22 generations
after · gen 22 · AutoAW-optimized
taskplannercs-4.5exec·ag2-flashexec·bg2-flashjudgehaiku
quality
0.741
cost/run
$4.20
p50
2.4s
nodes
4 (3 models)
04 · architectures

Two paradigms. One optimizer. No assumptions.

Multi-agent committees and agentic glue are different bets on where reasoning should live. AutoAW makes neither bet upfront — it searches both and lets your fitness function settle the argument.

multi-agent committee
taskanswerplannergpt-5researchercs-4.5writercs-4.5validatorhaikucross-talkfeedback loop
LLM-to-LLM hand-offsfragmented contexthigh coordination overhead
wins when
·Security or permission isolation is required between agents
·Domain knowledge is too large for a single context window
·Task genuinely needs competing hypotheses (debate topology)
agentic glue + skills
taskanswerconductorclaude-sonnetweb_searchsql_querycode_execdeterministic tools — no LLM calls
one conductordeterministic toolsunified context
wins when
·Standard engineering, data extraction, or enterprise automation tasks
·Cost and latency are in the objective — the fitness function will find it
·One capable model can hold the full reasoning context end-to-end
DimensionMulti-Agent CommitteeAgentic Glue + Skills
LatencyHigh — each hand-off is a fresh LLM callLow — one LLM, tools run as fast code
ContextFragmented — state re-interpreted at each hopUnified — one context window, no drift
DebuggingComplex — trace across N reasoning chainsStandard — did the LLM call the skill? Did it run?
CostProportional to agent count × task lengthProportional to conductor reasoning only
How AutoAW discovers which wins — without assuming either

AutoAW seeds its initial population with candidates spanning both paradigms. The multi-objective fitness function — penalizing cost and latency directly — creates natural selection pressure toward whichever architecture is actually cheaper and faster for your specific task and dataset.

The genetic operators handle the transitions: mutate_structure and the Compaction operator collapse N agents into a single conductor when the fitness landscape rewards it; the Delegation Split operator spawns parallel agents when the task benefits from specialization. No topology is privileged — the search finds where your task lives.

Compaction
N agents → 1 conductor + skills
→ glue
Delegation Split
1 conductor → k parallel agents
→ committee
Critique Inject
Insert validator after any node
→ hybrid
05 · how it works

Four steps. One Pareto frontier.

STEP 01

Describe the task

Either give a high-level task — "triage support tickets and propose a draft reply" — or import an existing workflow (LangGraph, DSPy, your own Python).

# task.yaml
goal: "Triage support tickets,
  propose draft reply with citations 
STEP 02

Pick datasets & weights

Choose an eval dataset (yours or one of 18 included). Set the utility weights — quality, $/run, p50 latency. AutoAW does the rest.

qualitycostspeed
STEP 03

AutoAW searches

A search loop swaps models, splits/merges agents, edits prompts, prunes tools, and re-evaluates every candidate. The Pareto frontier fills in as it runs.

STEP 04

Promote & deploy

Pick any point on the frontier. Export as a single graph (JSON, Python, or a hosted endpoint). Fork to keep optimizing.

$ autoaw promote candidate-B
✓ deployed → exp-7c4e/prod
endpoint: https://run.autoaw.io/v1/...
06 · results

What it found in our last run.

Task: customer-support copilot, 1,400-ticket eval. Objective = 0.7·quality + 0.25·cost⁻¹ + 0.05·speed.

GAIA
74.1%
+5.3 vs. SOTA
baseline Sonnet 4.5: 69.2
Cost / run
$4.20
−6.7× cheaper
baseline GPT-5: $26.80
Latency p50
2.4s
−3.6× faster
baseline GPT-5: 8.7s
Generations to SOTA
22
≈ 38 min wall-clock
94 candidates explored
07 · leaderboard

Live: top results on GAIA.

Public benchmark — anyone can submit. AutoAW-optimized graphs above single-model baselines for both quality and cost.

See experiments
#Team / submissionConfigurationQualityCost / runLatency p50
#1
AutoAW · optimizedbest · pareto
submitted May 18
ensemble (CS+G5m)
0.741
$4.202.4s
#2
Anthropic baseline
submitted Apr 30
Claude Sonnet 4.5
0.692
$14.105.8s
#3
OpenAI baseline
submitted Apr 11
GPT-5
0.688
$26.808.7s
#4
Cosine.ai
submitted Mar 22
Genie-2
0.671
$19.409.2s
#5
Google DeepMind
submitted Apr 04
Gemini 2.5 Pro
0.659
$11.905.1s
#6
Manus AI
submitted Feb 28
manus-r1
0.641
$8.304.4s
Request a demo & quotation

See AutoAW in action on your own workflow.

We'll walk you through a live demo and put together a custom quote for your team.
Request a demo