Cost Optimization ⏱ 30 min to implement ✓ Production tested

Multi-Model Routing: Which Model for Which Task

Most people default to one model for everything. That's expensive on complex tasks and wasteful on simple ones. The right architecture routes different tasks to different models based on complexity, cost, and speed. This is the routing logic running in production at Ask Patrick — it cuts API costs 60–70% without sacrificing quality on anything that matters.

The Three-Tier Model Stack

TierModelUse When
Tier 1 — FastHaiku / Flash / GPT-4o miniClassification, routing, simple extraction, yes/no decisions
Tier 2 — BalancedSonnet / GPT-4oResearch, content drafting, code gen, analysis
Tier 3 — DeepOpus / o1 / GPT-4 TurboComplex reasoning, high-stakes decisions, architecture planning

The Routing Decision Tree

Is this task < 100 tokens input AND a simple extraction/classification?
  → Tier 1 (Haiku)

Does this task require: multi-step reasoning, code gen, or research synthesis?
  → Tier 2 (Sonnet) — unless stakes are high

Is this task: irreversible action, major strategy, complex debugging?
  → Tier 3 (Opus)

Is a human waiting for this response synchronously?
  → Tier 1 or 2 only (latency matters)

Is this an async background job?
  → Any tier, optimize for quality

Pattern 1: Classifier → Worker Split

Use a cheap Tier 1 model to classify the request, then route to the appropriate worker.

ROUTING_PROMPT = """Classify this task into exactly one category.
Return ONLY the category name, nothing else.

Categories:
- simple_extraction (get a specific piece of data from text)
- classification (categorize into predefined options)
- content_generation (write something new)
- code_task (write, review, or debug code)
- complex_reasoning (multi-step analysis or strategy)

Task: {task}"""

def route_task(task: str) -> str:
    category = haiku(ROUTING_PROMPT.format(task=task)).strip()
    
    routing_map = {
        "simple_extraction": "haiku",
        "classification": "haiku",
        "content_generation": "sonnet",
        "code_task": "sonnet",
        "complex_reasoning": "opus"
    }
    return routing_map.get(category, "sonnet")  # default to sonnet

Cost: ~$0.00025 per routing decision (Haiku). Pays for itself on first Opus-to-Sonnet redirect.

Pattern 2: Draft → Review with Escalation

Draft cheap, review with a peer model, escalate only on failure.

def draft_with_review(task: str, quality_threshold: float = 0.8) -> str:
    # Step 1: Draft with Sonnet
    draft = sonnet(f"Complete this task: {task}")
    
    # Step 2: Self-review with Haiku (cheap QA pass)
    review_prompt = f"""Rate the quality of this response on a scale of 0.0 to 1.0.
Consider: accuracy, completeness, clarity.
Return ONLY a float. Nothing else.

Task: {task}
Response: {draft}"""
    
    score = float(haiku(review_prompt).strip())
    
    if score >= quality_threshold:
        return draft
    
    # Step 3: Escalate to Opus only when Sonnet fails QA
    return opus(f"""Improve this response. Make it accurate and complete.

Task: {task}
Draft (score {score}/1.0): {draft}

Return the improved version only.""")

Pattern 3: Context-Aware Model Selection

Different parts of your agent loop need different models. My actual production config:

# agent-model-config.yaml
loops:
  heartbeat:
    model: haiku
    max_tokens: 512

3 more patterns + real production numbers inside

The rest of this item covers Pattern 3 (context-aware model selection), Pattern 4 (token budget routing), Pattern 5 (fallback chains), and the exact cost breakdown from 30 days of production data.

  • Complete YAML config for multi-loop agent setups
  • Token-aware routing function (copy-paste ready)
  • Full fallback chain with error handling
  • Real monthly cost comparison: $340 → $18
  • Exact routing rules for Haiku / Sonnet / Opus by task type
Get Library Access — $9/mo →

Includes all 40+ library items + Daily Briefing. 30-day money-back guarantee.

← Back to Library