One of the most common reasons AI agents fail silently isn't bad prompts or flaky APIs. It's context overflow. The agent fills its context window, starts "forgetting" early instructions, and quietly produces garbage — while you assume everything is fine.
Here's how to prevent it.
Why This Matters
Most LLMs have context windows between 8K and 200K tokens. Sounds huge. But long-running agents accumulate:
- System prompt (500–2,000 tokens)
- Tool schemas (200–1,000 tokens each)
- Conversation history (grows every turn)
- Tool call results (can be massive — think web pages or file contents)
- Chain-of-thought reasoning (if visible)
A single web scrape can dump 10,000 tokens into context. After 5–6 tool calls, you're pushing limits even on 128K models.
The Core Problem: Silent Degradation
When context fills up, models don't throw an error. They:
- Start ignoring early instructions
- Lose track of the original goal
- Begin hallucinating facts from earlier in the conversation
- Produce confident-sounding nonsense
You won't always notice. The output looks reasonable. This is the dangerous part.
Pattern 1: Summarize-and-Reset
The most reliable approach. After N turns (or when context exceeds a threshold), compress the conversation into a dense summary and start a fresh context with only that summary.
[After turn 10 or ~50K tokens] Summarizer prompt: "Summarize the following agent conversation. Capture: - The original goal - What has been accomplished so far - Key facts discovered - What still needs to be done - Any open questions or blockers Be dense. Omit pleasantries, filler, and tool call metadata. Target: 500 tokens or less."
Then restart with:
System: [original system prompt] User: Here is a compressed summary of what we've done so far: [summary] Now continue from where we left off. The next task is: [next task]
Works well for: Research agents, multi-step workflows, anything that takes more than 10 turns.
Pattern 2: Selective Tool Result Truncation
Most tool results don't need to live in context forever. Truncate aggressively at ingestion time.
def add_tool_result(result: str, max_tokens: int = 1000) -> str:
tokens = count_tokens(result)
if tokens > max_tokens:
# Keep first 800 tokens + summary marker
truncated = truncate_to_tokens(result, 800)
return truncated + f"\n\n[...{tokens - 800} tokens truncated. Key info extracted above.]"
return resultThe key insight: once the agent has acted on a tool result, you rarely need the full result in context. A one-sentence summary is enough.
Pattern 3: Sliding Window History
Instead of keeping all messages, keep only the last N turns plus the system prompt.
MAX_HISTORY_TURNS = 8
def build_context(system_prompt, full_history, current_message):
recent = full_history[-MAX_HISTORY_TURNS * 2:] # user + assistant pairs
return [
{"role": "system", "content": system_prompt},
*recent,
{"role": "user", "content": current_message}
]Risk: The agent loses early context. Mitigate by injecting a brief "memory" block at the top of the system prompt with key facts from older turns.
Pattern 4: External Memory with Retrieval
For agents that run for hours or days, don't store memory in context at all. Store facts externally and retrieve relevant ones at each turn.
Simple version using a local file:
Agent discovers: "User's deadline is March 15"
→ Write to memory store: {"fact": "deadline is March 15", "timestamp": ..., "source": "turn_3"}
Next turn:
→ Query memory: "What do I know about deadlines?"
→ Inject relevant facts into context: "Known facts: deadline is March 15"Tools like Mem0, Zep, or even a simple SQLite DB work here. The OpenClaw Library has ready-to-use configs for this pattern.
Pattern 5: Token Budget Awareness
Build token counting into your agent loop. Check remaining budget before each tool call.
CONTEXT_LIMIT = 128_000
SAFETY_BUFFER = 10_000
def remaining_budget(messages):
used = sum(count_tokens(m["content"]) for m in messages)
return CONTEXT_LIMIT - used - SAFETY_BUFFER
def should_compress(messages):
return remaining_budget(messages) < 20_000When budget drops below your threshold, trigger Pattern 1 (summarize-and-reset) before continuing.
Putting It Together: A Simple Context Manager
class AgentContextManager:
def __init__(self, system_prompt, model="claude-3-5-sonnet"):
self.system_prompt = system_prompt
self.model = model
self.history = []
self.turn_count = 0
def add_turn(self, role, content):
self.history.append({"role": role, "content": content})
self.turn_count += 1
# Compress every 10 turns or when context is getting full
if self.turn_count % 10 == 0 or self.is_near_limit():
self.compress()
def is_near_limit(self):
total = sum(count_tokens(m["content"]) for m in self.history)
return total > 80_000 # compress at 80K to stay safe
def compress(self):
summary = summarize_history(self.history)
self.history = [{"role": "user", "content": f"[Context summary]: {summary}"},
{"role": "assistant", "content": "Understood. Continuing from where we left off."}]
def get_messages(self):
return [{"role": "system", "content": self.system_prompt}] + self.historyQuick Reference: Which Pattern to Use
| Situation | Best Pattern | |-----------|-------------| | Research task, 10–30 tool calls | Summarize-and-reset | | Retrieval-heavy (web scraping, docs) | Truncate tool results | | Conversational agent with memory | Sliding window + memory injection | | Multi-day autonomous agent | External memory with retrieval | | All of the above | Token budget awareness as a baseline |
Red Flags to Watch For
- Agent starts contradicting earlier decisions
- Outputs reference facts that were never stated
- Agent "forgets" constraints from the system prompt
- Response quality degrades after many turns
- Agent appears to restart the task from scratch
Any of these: check your context management first.
Where to Go From Here
The Ask Patrick Library has working implementations of all five patterns, including:
- A drop-in
ContextManagerclass for Python agent loops - OpenClaw memory configs for persistent external memory
- Prompt templates for the summarize-and-reset pattern
- Token counting utilities for Claude, GPT-4, and Gemini
Available to Library + Briefing subscribers at askpatrick.co.
Want the full playbook?
Get copy-paste AI templates, prompt frameworks, and agent patterns — all in one place.
Get Access — It’s FreeNo credit card. No fluff. Just the good stuff.