Free Guide

Multi-Model Routing: Cut API Costs by 95%

By Patrick March 2026 ✓ Tested in production

Most agent pipelines run everything on the flagship model. That's like using a Formula 1 car to get groceries. The routing logic to fix this is simpler than it sounds — classify task complexity upfront, route to the cheapest capable model, escalate only when needed. Two free patterns that cover 80% of the savings.

In this guide

Task Complexity Classification — free
The Three-Tier Cascade (Haiku → Sonnet → Opus) — free
Semantic Response Caching 🔒
Context Window Tiering for Long-Running Agents 🔒
Cost Circuit Breakers (Stop Runaway Spend Before It Happens) 🔒

1 Task Complexity Classification

What it is: Before routing any task to a model, run it through a lightweight classifier that assigns it a complexity tier. The classifier itself runs on the cheapest available model — classification is simple enough that Haiku does it reliably. The output is a routing decision that determines which model handles the actual task.

The insight that makes this work: most tasks in a 24/7 agent pipeline are low complexity. File reads, log summaries, status checks, format conversions, simple rewrites. These don't need Opus. Running them on Haiku costs roughly 1/75th as much and produces output that's indistinguishable in quality for those use cases.

The three complexity tiers

Every task maps to one of three tiers. The tier determines the model. This is the entire routing system.

complexity classification schema

TIER_1: # Route to: claude-haiku-3-5 (or equivalent)
  - Format conversion (JSON → markdown, etc.)
  - File read and summarize (< 2000 words)
  - Extract specific fields from structured data
  - Simple classification (is this email urgent? yes/no)
  - Log parsing and status reporting
  - Template fill-in (data already provided)
  - Any task where output format is fully specified

TIER_2: # Route to: claude-sonnet-4 (or equivalent)
  - Write original content (blog posts, guides, emails)
  - Multi-step reasoning with 3–5 steps
  - Code review and debugging
  - Synthesis of multiple sources into coherent output
  - Quality evaluation (does this meet the bar?)
  - Tasks requiring domain knowledge but not deep judgment

TIER_3: # Route to: claude-opus-4 (or equivalent)
  - Strategic decisions with significant consequences
  - Complex multi-step planning (> 5 interdependent steps)
  - Output that represents the business publicly
  - Nightly self-improvement cycle (judgment-intensive)
  - Anything where a mistake costs > 30 minutes to fix

The classifier prompt

This runs on Haiku every time before a task executes. It's fast (sub-second), cheap (~$0.0001), and accurate enough for the purpose. It doesn't need to be perfect — it needs to avoid routing Tier-3 work to Tier-1 models. Over-routing to a higher tier is always safe; under-routing is the failure mode.

classifier prompt (runs on haiku)

# Task complexity classifier
# Input: task description
# Output: TIER_1, TIER_2, or TIER_3 + one-line reason

Classify the complexity of this task using these criteria:

TIER_1 if: structured input → structured output, no judgment required,
         output format fully specified, no original content creation.

TIER_2 if: requires writing, reasoning, synthesis, or quality evaluation.
         Multiple steps but clear path. Domain knowledge helpful.

TIER_3 if: requires strategic judgment, significant consequences if wrong,
         represents the business publicly, or failure costs > 30 min to fix.

When uncertain, go one tier HIGHER. Never go lower.

Respond with ONLY: TIER_[1|2|3]: [one-line reason]

Task: {task_description}

Real cost benchmarks from production

Here's what this routing table looks like in a real nightly cycle (March 2026 numbers, Anthropic API pricing):

Task	Tier	Model	Cost/run	vs. Opus
Read and summarize log file	1	Haiku 3.5	$0.0003	97% less
Write Library content update	2	Sonnet 4	$0.004	83% less
Nightly improvement decision	3	Opus 4	$0.024	—
Extract fields from JSON	1	Haiku 3.5	$0.0001	99% less
Draft Discord announcement	2	Sonnet 4	$0.003	87% less
Nightly cycle total (routed)			$0.032	91% less
Nightly cycle total (all Opus)			$0.36	—

⚠

Don't route creative work to Tier 1. The cost savings are real but the quality degradation on creative tasks is also real. Haiku writing a Library content piece produces noticeably worse output than Sonnet. The classifier should catch this — but verify it is for your specific task types before trusting the routing blindly.

Where the 95% comes from

In a mature pipeline, roughly 60% of tasks are Tier 1, 30% are Tier 2, and 10% are Tier 3. The Tier 1 tasks that were previously running on Opus now run on Haiku — a ~75x cost reduction on 60% of your volume. That's where the 80–95% headline comes from. Your specific number depends on your task mix.

Audit your current tasks — how many are format conversions, summaries, or structured extractions? That's your Tier 1 volume.
Start by classifying manually for one week. Verify the classifier agrees with you before trusting it autonomously.
Build in an override: if a task ever produces poor quality output at a lower tier, flag it and reclassify permanently.

2 The Three-Tier Cascade

What it is: Start every task at the lowest appropriate tier. If the output fails quality validation, escalate automatically to the next tier and retry. This combines with the classifier — the classifier picks the starting tier, the cascade handles quality failures without manual intervention.

The cascade is what makes aggressive downward routing safe. You're not gambling on Haiku succeeding — you're betting it will succeed on Tier 1 tasks (it will, ~95% of the time) and setting up automatic escalation for the 5% it doesn't handle well.

The cascade logic

cascade pseudocode (implement in your orchestrator)

# Three-tier cascade with automatic escalation

function route_and_execute(task, starting_tier):

  models = {
    "TIER_1": "claude-haiku-3-5",
    "TIER_2": "claude-sonnet-4",
    "TIER_3": "claude-opus-4"
  }

  tier = starting_tier

  while tier <= "TIER_3":
    model = models[tier]
    result = call_model(model, task)

    # Run quality validation
    quality = validate_output(result, task.expected_format)

    if quality.passes:
      log(f"SUCCESS tier={tier} model={model} task={task.id}")
      return result

    else:
      log(f"QUALITY_FAIL tier={tier} reason={quality.reason} escalating...")
      tier = escalate(tier)  # TIER_1 → TIER_2 → TIER_3

  # If Tier 3 also fails, this is a task spec problem
  raise TaskFailure("All tiers failed — review task spec")

The quality validator

This is the key that makes the cascade work. Quality validation must be fast and cheap — you're running it after every model call, so it needs to be Tier 1 itself (run on Haiku). It checks for format compliance, not content quality. Content quality is subjective and expensive to evaluate; format compliance is objective and fast.

quality validator prompt (runs on haiku)

# Output quality validator
# Checks format compliance, not content quality

Validate this output against the expected format.

Expected format: {task.expected_format}
Actual output:  {result}

Check ALL of these:
□ Is the output non-empty?
□ Does it match the expected format exactly?
□ Are all required fields present?
□ Is the output complete (not truncated)?
□ Are there obvious signs of model refusal or confusion?

Respond with ONLY:
PASS
or
FAIL: [which check failed] [what was found instead]

What escalation looks like in practice

In production, the escalation rate from Tier 1 → Tier 2 is about 5%. From Tier 2 → Tier 3 is about 2%. This means that for every 100 tasks started at Tier 1, you end up running 100 Haiku calls, 5 Sonnet calls, and maybe 1 Opus call. That's a dramatically cheaper profile than 100 Opus calls.

✓

Track your escalation rate. If it's above 15%, your classifier is miscategorizing — too many tasks are starting at Tier 1 that belong at Tier 2. If it's 0%, either your classifier is too conservative (starting everything at Tier 2+) or your quality validator isn't checking strictly enough.

Escalation logging — the audit trail you'll thank yourself for

Log every escalation with enough context to retrain your classifier later. After 30 days you'll have data showing exactly which task types the classifier consistently miscategorizes. That data improves the classifier prompt for the next month.

escalation log format (append to escalation-log.jsonl)

{"ts":"2026-03-05T09:12:34Z","task_id":"nightly-003","task_type":"library-update",
 "started_tier":"TIER_1","final_tier":"TIER_2","escalations":1,
 "fail_reason":"Output was 3 words, expected 200+ word content block",
 "cost_tier1":"$0.0003","cost_tier2":"$0.004","total":"$0.0043"}

Common failure modes

Infinite escalation loop: Quality validator is too strict and nothing passes. Every tier escalates. Hit Tier 3, still fails, crashes. Fix: add a max_escalations=2 hard cap and a task-spec-review flag when it's hit.
Silent downgrade: Classifier routes Tier 3 work to Tier 1 because the task description was vague. The output passes format validation but is low quality. Fix: include a sample output in the expected_format spec so the validator can catch content-level issues, not just structural ones.
Over-conservative classifier: Everything goes to Tier 3 "just to be safe." Costs don't come down. Fix: audit the classifier weekly for the first month — check every Tier 3 routing decision and reclassify any that were actually Tier 1 or 2.

✓

Start with 30% of your pipeline. Pick your most routine, structured tasks (log summaries, format conversions, status checks). Route those through the cascade first. Measure for two weeks. Once the escalation rate stabilizes, expand to the next category. Don't route strategic work through the cascade until you trust the validator.

🔒 Library

3 Semantic Response Caching

If your agent answers the same class of question repeatedly, you're paying full price every time. Semantic caching stores model outputs keyed by embedding similarity — so semantically identical queries hit cache instead of the API. This pattern covers the embedding approach, cache invalidation strategy, and the specific similarity threshold that avoids...

🔒 Library

4 Context Window Tiering for Long-Running Agents

Long-running agents accumulate context until they hit the window limit — then either fail or compress badly. Context window tiering manages this proactively: summary compression at 60% fill, hard checkpoint at 80%, clean reset with state handoff at 95%. The specific prompts, the state-preservation format, and how to verify the handoff...

🔒 Library

5 Cost Circuit Breakers

Without spending limits, a misconfigured cascade or a stuck retry loop can spend your monthly API budget in one night. Cost circuit breakers set hard limits per task, per agent session, and per day — and trip an alert before the damage compounds. The exact threshold values I use, how to wire them into OpenClaw cron...

Get the other 3 patterns

Semantic Caching, Context Window Tiering, and Cost Circuit Breakers are in the Library. $9/month — 30-day money-back guarantee.

All 5 cost optimization patterns with full configs
Real cost benchmarks from production (updated monthly)
SOUL.md template collection (CEO, growth, support)
Silent failure patterns and monitoring configs
Error recovery playbooks from real production failures
Daily Briefing — nightly improvements, every morning

Get Library Access — $9/mo →

30-day money-back guarantee. No questions asked.

Multi-Model Routing: Cut API Costs by 95%

In this guide

1 Task Complexity Classification

The three complexity tiers

The classifier prompt

Real cost benchmarks from production

Where the 95% comes from

2 The Three-Tier Cascade

The cascade logic

The quality validator

What escalation looks like in practice

Escalation logging — the audit trail you'll thank yourself for

Common failure modes

3 Semantic Response Caching

4 Context Window Tiering for Long-Running Agents

5 Cost Circuit Breakers

Get the other 3 patterns

Related guides

Want More Like This?