Context Window Management for Long-Running AI Agents

One of the most common reasons AI agents fail silently isn't bad prompts or flaky APIs. It's context overflow. The agent fills its context window, starts "forgetting" early instructions, and quietly produces garbage — while you assume everything is fine.

Here's how to prevent it.

Why This Matters

Most LLMs have context windows between 8K and 200K tokens. Sounds huge. But long-running agents accumulate:

System prompt (500–2,000 tokens)
Tool schemas (200–1,000 tokens each)
Conversation history (grows every turn)
Tool call results (can be massive — think web pages or file contents)
Chain-of-thought reasoning (if visible)

A single web scrape can dump 10,000 tokens into context. After 5–6 tool calls, you're pushing limits even on 128K models.

The Core Problem: Silent Degradation

When context fills up, models don't throw an error. They:

Start ignoring early instructions
Lose track of the original goal
Begin hallucinating facts from earlier in the conversation
Produce confident-sounding nonsense

You won't always notice. The output looks reasonable. This is the dangerous part.

Pattern 1: Summarize-and-Reset

The most reliable approach. After N turns (or when context exceeds a threshold), compress the conversation into a dense summary and start a fresh context with only that summary.

[After turn 10 or ~50K tokens]
Summarizer prompt:
"Summarize the following agent conversation. Capture:
- The original goal
- What has been accomplished so far
- Key facts discovered
- What still needs to be done
- Any open questions or blockers

Be dense. Omit pleasantries, filler, and tool call metadata.
Target: 500 tokens or less."

Then restart with:

System: [original system prompt]
User: Here is a compressed summary of what we've done so far:
[summary]
Now continue from where we left off. The next task is: [next task]

Works well for: Research agents, multi-step workflows, anything that takes more than 10 turns.

Pattern 2: Selective Tool Result Truncation

Most tool results don't need to live in context forever. Truncate aggressively at ingestion time.

def add_tool_result(result: str, max_tokens: int = 1000) -> str:
    tokens = count_tokens(result)
    if tokens > max_tokens:
        # Keep first 800 tokens + summary marker
        truncated = truncate_to_tokens(result, 800)
        return truncated + f"\n\n[...{tokens - 800} tokens truncated. Key info extracted above.]"
    return result

The key insight: once the agent has acted on a tool result, you rarely need the full result in context. A one-sentence summary is enough.

Pattern 3: Sliding Window History

Instead of keeping all messages, keep only the last N turns plus the system prompt.

MAX_HISTORY_TURNS = 8

def build_context(system_prompt, full_history, current_message):
    recent = full_history[-MAX_HISTORY_TURNS * 2:]  # user + assistant pairs
    return [
        {"role": "system", "content": system_prompt},
        *recent,
        {"role": "user", "content": current_message}
    ]

Risk: The agent loses early context. Mitigate by injecting a brief "memory" block at the top of the system prompt with key facts from older turns.

Pattern 4: External Memory with Retrieval

For agents that run for hours or days, don't store memory in context at all. Store facts externally and retrieve relevant ones at each turn.

Simple version using a local file:

Agent discovers: "User's deadline is March 15"
→ Write to memory store: {"fact": "deadline is March 15", "timestamp": ..., "source": "turn_3"}

Next turn:
→ Query memory: "What do I know about deadlines?"
→ Inject relevant facts into context: "Known facts: deadline is March 15"

Tools like Mem0, Zep, or even a simple SQLite DB work here. The OpenClaw Library has ready-to-use configs for this pattern.

Pattern 5: Token Budget Awareness

Build token counting into your agent loop. Check remaining budget before each tool call.

CONTEXT_LIMIT = 128_000
SAFETY_BUFFER = 10_000

def remaining_budget(messages):
    used = sum(count_tokens(m["content"]) for m in messages)
    return CONTEXT_LIMIT - used - SAFETY_BUFFER

def should_compress(messages):
    return remaining_budget(messages) < 20_000

When budget drops below your threshold, trigger Pattern 1 (summarize-and-reset) before continuing.

Putting It Together: A Simple Context Manager

class AgentContextManager:
    def __init__(self, system_prompt, model="claude-3-5-sonnet"):
        self.system_prompt = system_prompt
        self.model = model
        self.history = []
        self.turn_count = 0
    
    def add_turn(self, role, content):
        self.history.append({"role": role, "content": content})
        self.turn_count += 1
        
        # Compress every 10 turns or when context is getting full
        if self.turn_count % 10 == 0 or self.is_near_limit():
            self.compress()
    
    def is_near_limit(self):
        total = sum(count_tokens(m["content"]) for m in self.history)
        return total > 80_000  # compress at 80K to stay safe
    
    def compress(self):
        summary = summarize_history(self.history)
        self.history = [{"role": "user", "content": f"[Context summary]: {summary}"},
                        {"role": "assistant", "content": "Understood. Continuing from where we left off."}]
    
    def get_messages(self):
        return [{"role": "system", "content": self.system_prompt}] + self.history

Quick Reference: Which Pattern to Use

| Situation | Best Pattern | |-----------|-------------| | Research task, 10–30 tool calls | Summarize-and-reset | | Retrieval-heavy (web scraping, docs) | Truncate tool results | | Conversational agent with memory | Sliding window + memory injection | | Multi-day autonomous agent | External memory with retrieval | | All of the above | Token budget awareness as a baseline |

Red Flags to Watch For

Agent starts contradicting earlier decisions
Outputs reference facts that were never stated
Agent "forgets" constraints from the system prompt
Response quality degrades after many turns
Agent appears to restart the task from scratch

Any of these: check your context management first.

Where to Go From Here

The Ask Patrick Library has working implementations of all five patterns, including:

A drop-in ContextManager class for Python agent loops
OpenClaw memory configs for persistent external memory
Prompt templates for the summarize-and-reset pattern
Token counting utilities for Claude, GPT-4, and Gemini

Available to Library + Briefing subscribers at askpatrick.co.

Want the full playbook?

Get copy-paste AI templates, prompt frameworks, and agent patterns — all in one place.

Join The Library — $9/mo

Cancel any time. Instant access.