Running AI agents is powerful — but token costs can sneak up on you fast. Here's a practical guide to keeping your bill reasonable without sacrificing capability.
Why Costs Spiral
Most cost problems come from three places:
- Bloated context windows — Agents that load everything every turn
- Wrong model for the job — Using GPT-4o for tasks a smaller model handles fine
- No exit conditions — Agents that loop or retry forever
Fix these three things and you'll cut most waste.
Rule 1: Match Model to Task
Not every step in your workflow needs your best (most expensive) model.
| Task | Recommended Tier | |------|-----------------| | Routing, classification, yes/no decisions | Small/fast (GPT-4o-mini, Haiku, Gemini Flash) | | Summarization, drafting | Mid-tier (GPT-4o, Sonnet) | | Complex reasoning, code, nuanced writing | Full-power (o3, Opus, Gemini Pro) |
Pattern: Use a cheap model to classify → route to an expensive model only when needed.
User message → GPT-4o-mini: "Is this a support question or a billing question?" → If billing: GPT-4o-mini can handle it → If complex technical: escalate to GPT-4o
Rule 2: Keep Prompts Lean
System prompts run on every single call. A 2,000-token system prompt × 500 calls/day = 1M tokens just in prompts.
Audit your system prompt:
- Remove instructions that never apply
- Move reference material to retrieval (RAG) instead of embedding it
- Use placeholders that only expand when needed
Before:
You are a helpful assistant. Here is our full product catalog: [5,000 words]...
After:
You are a helpful assistant. Relevant product info will be provided in [CONTEXT] when needed.
Rule 3: Summarize, Don't Accumulate
Long conversations compound fast. After 10 turns, you're paying for all 10 turns of history on every new message.
Solution: Rolling summary pattern
- Every N turns, summarize the conversation into ~200 words
- Replace the full history with the summary
- Keep only the last 2-3 turns verbatim
Most frameworks support this natively. In OpenClaw, set memorystrategy: rollingsummary.
Rule 4: Set Hard Limits
Every agent should have:
- Max turns per task — If it hasn't finished in 10 steps, something's wrong. Stop.
- Max tokens per response —
max_tokens: 500for most tasks, more only when you've specifically tested it needs more - Retry cap — Maximum 2-3 retries on failure, then escalate to human
Without these, a misbehaving agent can burn your entire monthly budget overnight.
Rule 5: Cache When Possible
If you're calling an AI to answer the same question repeatedly, cache the answer.
- Static FAQs → Pre-generate answers, serve from cache
- Repeated summaries → Hash the input, cache by hash
- Classification results → Same input = same output; no need to re-call
Even a simple in-memory cache can cut costs 30-50% for high-volume workflows.
Rule 6: Monitor Before It Hurts
Set up basic cost monitoring before you scale:
- Track tokens per task type — Know your baseline
- Set budget alerts in your provider dashboard (OpenAI, Anthropic all support this)
- Log outliers — Any call using 10x normal tokens should be flagged
Simple logging pattern:
completion = client.chat(messages=messages)
log({
"task": task_name,
"tokens_in": completion.usage.prompt_tokens,
"tokens_out": completion.usage.completion_tokens,
"cost_usd": estimate_cost(completion.usage)
})Quick Reference: Cost Reduction Checklist
- [ ] Am I using the cheapest model that can do this job?
- [ ] Is my system prompt under 500 tokens?
- [ ] Do I have a rolling summary instead of full history?
- [ ] Is there a max-turn limit on this agent?
- [ ] Am I caching repeated queries?
- [ ] Do I have a budget alert set in my provider dashboard?
The 80/20 Rule for Agent Costs
In practice:
- 80% of your cost comes from 20% of your agent calls
- Find those expensive calls and optimize them first
- Don't micro-optimize cheap tasks — the gains aren't worth the complexity
Start with monitoring. You can't optimize what you can't measure.
Want the full playbook?
Get copy-paste AI templates, prompt frameworks, and agent patterns — all in one place.
Get Access — It’s FreeNo credit card. No fluff. Just the good stuff.