The Core Tradeoffs
Before diving into specific models, understand the axes you're optimizing on:
- Instruction-following — Does the model reliably do what the system prompt says, every time?
- Tool-calling reliability — Does it format function calls correctly, pick the right tool, and handle edge cases gracefully?
- Context window — How much memory does your agent need? Long chains of tool calls eat context fast.
- Speed — Agents often make multiple LLM calls per task. Slow models compound.
- Cost — A model that costs 10x more per call costs 10x more at scale.
- Local vs. cloud — Privacy, latency, and cost all shift when you run locally.
No model wins on all axes. You're always trading something.
What "Instruction-Following" Actually Means for Agents
With chatbots, instruction-following means "responds in the right tone." With agents, it means something harder:
- Does it respect
DO NOT do Xrules consistently, even under pressure? - Does it stay in character when the user pushes back?
- Does it exit a loop correctly instead of repeating itself?
- Does it call tools in the order you specified?
- Does it avoid hallucinating tool names that don't exist?
The models that shine in benchmarks don't always shine in production agents. Test your actual system prompt, not just general capability benchmarks.
Tier 1: Heavy Lifters (Complex Reasoning, Multi-Step Planning)
Best for: orchestrator agents, planning layers, anything that needs real judgment.
Claude 3.5 Sonnet / Claude 3 Opus — Exceptionally strong instruction-following. Very reliable on complex system prompts with lots of rules. Good tool-calling. Not the cheapest.
GPT-4o — Strong all-rounder. Very mature tool-calling support. Slightly more forgiving of ambiguous prompts (sometimes a pro, sometimes a con — you want your agent to be strict).
Gemini 1.5 Pro — Worth considering if you need a massive context window (1M tokens). Good for agents that need to read a lot of documents before acting.
When to use: Orchestrators, agents that make judgment calls, anything where a mistake has real consequences.
Tier 2: Balanced (Most Production Agents)
Best for: the majority of agent tasks — tool use, structured output, reliable loops.
Claude 3.5 Haiku — Fast and cheap with strong instruction-following. Surprisingly capable. Good starting point.
GPT-4o-mini — Fast, cheap, reliable tool-calling. A lot of teams use this as their workhorse.
Gemini 1.5 Flash — Fast and inexpensive. Good if you're already in the Google ecosystem.
When to use: Worker agents, high-volume tasks, anything running in a loop where you're paying per call.
Tier 3: Local Models (Privacy, Cost Control)
Best for: sensitive data, offline operation, high-volume tasks where cloud costs hurt.
Qwen 2.5 7B / 14B — Arguably the best tool-calling performance in the local model category. Reliable JSON output, good instruction-following. The 14B runs comfortably on a machine with 16GB RAM.
Llama 3.1 8B / 70B — Strong general capability. The 70B approaches GPT-4o-mini quality on many tasks. Requires more RAM.
Mistral 7B / Nemo — Fast, lightweight. Good for simple tasks where you need sub-second response times.
What you give up: Local models tend to be less reliable on complex system prompts, more likely to ignore rules in edge cases, and require more prompt engineering to coax consistent behavior. Budget extra time for testing.
When to use: Data that can't leave your machine, high-frequency tasks where cloud costs are prohibitive, offline environments.
A Decision Framework
Answer these questions in order:
1. Does this agent handle sensitive data? Yes → go local or use a provider with a data processing agreement.
2. How many LLM calls does one task make? 1-3 → cost is less of a concern, use the best model. 10+ → cost compounds fast, pick a balanced tier model.
3. How complex is the system prompt? Simple (under 500 tokens, few rules) → any model handles this. Complex (long rules, many edge cases, strict persona) → Tier 1 only. Test extensively.
4. Does it use tools? Yes → test tool-calling specifically, not just general chat quality. Run 50+ real tasks and count failures.
5. What's the context window demand? Estimate: (system prompt tokens) + (average tool output tokens × number of calls) + (user context tokens). Add 30% buffer. Make sure your model can handle it.
Testing Before You Commit
Don't pick a model based on benchmarks alone. Run your actual system prompt through your actual workflow, 50+ times, and measure:
- Success rate — Did it complete the task correctly?
- Tool error rate — How often did it call the wrong tool or malform a call?
- Rule adherence — Did it ever violate a "never do X" instruction?
- Failure modes — When it fails, does it fail gracefully (stops and asks) or catastrophically (loops, hallucinates, takes wrong action)?
Build a simple eval harness. It doesn't have to be fancy — even a spreadsheet tracking pass/fail across 50 test cases tells you more than any benchmark.
Common Mistakes
Using the most expensive model for everything. Most of your agent's calls don't require genius-level reasoning. A cheaper model for routine steps, a smarter model for decision nodes, is usually the right architecture.
Assuming benchmark performance = agent performance. Benchmarks measure a model's peak capability. Agents need consistent, reliable behavior across thousands of calls. Different thing.
Not testing tool-calling specifically. A model can write beautiful prose and still mangle JSON schemas. Test the part that matters.
Ignoring context window consumption. Agents burn through context fast. A model with a small context window will silently start forgetting early instructions mid-task. Know your consumption pattern before production.
Setting it and forgetting it. Model providers update their models. GPT-4o today is not GPT-4o in six months. Re-test after major version bumps.
The Bottom Line
For most teams getting started: Claude 3.5 Haiku or GPT-4o-mini for worker agents, Claude 3.5 Sonnet or GPT-4o for orchestrators. If you're running locally: Qwen 2.5 14B for tool-calling tasks.
Then test. Measure. Swap if needed.
Want the full playbook?
Get copy-paste AI templates, prompt frameworks, and agent patterns — all in one place.
Get Access — It’s FreeNo credit card. No fluff. Just the good stuff.