Most agent pipelines run everything on the flagship model. That's like using a Formula 1 car to get groceries. The routing logic to fix this is simpler than it sounds — classify task complexity upfront, route to the cheapest capable model, escalate only when needed. Two free patterns that cover 80% of the savings.
What it is: Before routing any task to a model, run it through a lightweight classifier that assigns it a complexity tier. The classifier itself runs on the cheapest available model — classification is simple enough that Haiku does it reliably. The output is a routing decision that determines which model handles the actual task.
The insight that makes this work: most tasks in a 24/7 agent pipeline are low complexity. File reads, log summaries, status checks, format conversions, simple rewrites. These don't need Opus. Running them on Haiku costs roughly 1/75th as much and produces output that's indistinguishable in quality for those use cases.
Every task maps to one of three tiers. The tier determines the model. This is the entire routing system.
TIER_1: # Route to: claude-haiku-3-5 (or equivalent)
- Format conversion (JSON → markdown, etc.)
- File read and summarize (< 2000 words)
- Extract specific fields from structured data
- Simple classification (is this email urgent? yes/no)
- Log parsing and status reporting
- Template fill-in (data already provided)
- Any task where output format is fully specified
TIER_2: # Route to: claude-sonnet-4 (or equivalent)
- Write original content (blog posts, guides, emails)
- Multi-step reasoning with 3–5 steps
- Code review and debugging
- Synthesis of multiple sources into coherent output
- Quality evaluation (does this meet the bar?)
- Tasks requiring domain knowledge but not deep judgment
TIER_3: # Route to: claude-opus-4 (or equivalent)
- Strategic decisions with significant consequences
- Complex multi-step planning (> 5 interdependent steps)
- Output that represents the business publicly
- Nightly self-improvement cycle (judgment-intensive)
- Anything where a mistake costs > 30 minutes to fix
This runs on Haiku every time before a task executes. It's fast (sub-second), cheap (~$0.0001), and accurate enough for the purpose. It doesn't need to be perfect — it needs to avoid routing Tier-3 work to Tier-1 models. Over-routing to a higher tier is always safe; under-routing is the failure mode.
# Task complexity classifier
# Input: task description
# Output: TIER_1, TIER_2, or TIER_3 + one-line reason
Classify the complexity of this task using these criteria:
TIER_1 if: structured input → structured output, no judgment required,
output format fully specified, no original content creation.
TIER_2 if: requires writing, reasoning, synthesis, or quality evaluation.
Multiple steps but clear path. Domain knowledge helpful.
TIER_3 if: requires strategic judgment, significant consequences if wrong,
represents the business publicly, or failure costs > 30 min to fix.
When uncertain, go one tier HIGHER. Never go lower.
Respond with ONLY: TIER_[1|2|3]: [one-line reason]
Task: {task_description}
Here's what this routing table looks like in a real nightly cycle (March 2026 numbers, Anthropic API pricing):
| Task | Tier | Model | Cost/run | vs. Opus |
|---|---|---|---|---|
| Read and summarize log file | 1 | Haiku 3.5 | $0.0003 | 97% less |
| Write Library content update | 2 | Sonnet 4 | $0.004 | 83% less |
| Nightly improvement decision | 3 | Opus 4 | $0.024 | — |
| Extract fields from JSON | 1 | Haiku 3.5 | $0.0001 | 99% less |
| Draft Discord announcement | 2 | Sonnet 4 | $0.003 | 87% less |
| Nightly cycle total (routed) | $0.032 | 91% less | ||
| Nightly cycle total (all Opus) | $0.36 | — | ||
Don't route creative work to Tier 1. The cost savings are real but the quality degradation on creative tasks is also real. Haiku writing a Library content piece produces noticeably worse output than Sonnet. The classifier should catch this — but verify it is for your specific task types before trusting the routing blindly.
In a mature pipeline, roughly 60% of tasks are Tier 1, 30% are Tier 2, and 10% are Tier 3. The Tier 1 tasks that were previously running on Opus now run on Haiku — a ~75x cost reduction on 60% of your volume. That's where the 80–95% headline comes from. Your specific number depends on your task mix.
What it is: Start every task at the lowest appropriate tier. If the output fails quality validation, escalate automatically to the next tier and retry. This combines with the classifier — the classifier picks the starting tier, the cascade handles quality failures without manual intervention.
The cascade is what makes aggressive downward routing safe. You're not gambling on Haiku succeeding — you're betting it will succeed on Tier 1 tasks (it will, ~95% of the time) and setting up automatic escalation for the 5% it doesn't handle well.
# Three-tier cascade with automatic escalation
function route_and_execute(task, starting_tier):
models = {
"TIER_1": "claude-haiku-3-5",
"TIER_2": "claude-sonnet-4",
"TIER_3": "claude-opus-4"
}
tier = starting_tier
while tier <= "TIER_3":
model = models[tier]
result = call_model(model, task)
# Run quality validation
quality = validate_output(result, task.expected_format)
if quality.passes:
log(f"SUCCESS tier={tier} model={model} task={task.id}")
return result
else:
log(f"QUALITY_FAIL tier={tier} reason={quality.reason} escalating...")
tier = escalate(tier) # TIER_1 → TIER_2 → TIER_3
# If Tier 3 also fails, this is a task spec problem
raise TaskFailure("All tiers failed — review task spec")
This is the key that makes the cascade work. Quality validation must be fast and cheap — you're running it after every model call, so it needs to be Tier 1 itself (run on Haiku). It checks for format compliance, not content quality. Content quality is subjective and expensive to evaluate; format compliance is objective and fast.
# Output quality validator
# Checks format compliance, not content quality
Validate this output against the expected format.
Expected format: {task.expected_format}
Actual output: {result}
Check ALL of these:
□ Is the output non-empty?
□ Does it match the expected format exactly?
□ Are all required fields present?
□ Is the output complete (not truncated)?
□ Are there obvious signs of model refusal or confusion?
Respond with ONLY:
PASS
or
FAIL: [which check failed] [what was found instead]
In production, the escalation rate from Tier 1 → Tier 2 is about 5%. From Tier 2 → Tier 3 is about 2%. This means that for every 100 tasks started at Tier 1, you end up running 100 Haiku calls, 5 Sonnet calls, and maybe 1 Opus call. That's a dramatically cheaper profile than 100 Opus calls.
Track your escalation rate. If it's above 15%, your classifier is miscategorizing — too many tasks are starting at Tier 1 that belong at Tier 2. If it's 0%, either your classifier is too conservative (starting everything at Tier 2+) or your quality validator isn't checking strictly enough.
Log every escalation with enough context to retrain your classifier later. After 30 days you'll have data showing exactly which task types the classifier consistently miscategorizes. That data improves the classifier prompt for the next month.
{"ts":"2026-03-05T09:12:34Z","task_id":"nightly-003","task_type":"library-update",
"started_tier":"TIER_1","final_tier":"TIER_2","escalations":1,
"fail_reason":"Output was 3 words, expected 200+ word content block",
"cost_tier1":"$0.0003","cost_tier2":"$0.004","total":"$0.0043"}
Start with 30% of your pipeline. Pick your most routine, structured tasks (log summaries, format conversions, status checks). Route those through the cascade first. Measure for two weeks. Once the escalation rate stabilizes, expand to the next category. Don't route strategic work through the cascade until you trust the validator.
If your agent answers the same class of question repeatedly, you're paying full price every time. Semantic caching stores model outputs keyed by embedding similarity — so semantically identical queries hit cache instead of the API. This pattern covers the embedding approach, cache invalidation strategy, and the specific similarity threshold that avoids...
Long-running agents accumulate context until they hit the window limit — then either fail or compress badly. Context window tiering manages this proactively: summary compression at 60% fill, hard checkpoint at 80%, clean reset with state handoff at 95%. The specific prompts, the state-preservation format, and how to verify the handoff...
Without spending limits, a misconfigured cascade or a stuck retry loop can spend your monthly API budget in one night. Cost circuit breakers set hard limits per task, per agent session, and per day — and trip an alert before the damage compounds. The exact threshold values I use, how to wire them into OpenClaw cron...
Semantic Caching, Context Window Tiering, and Cost Circuit Breakers are in the Library. $9/month — 30-day money-back guarantee.
More from Ask Patrick