Operations ⏱ 20–40 min to implement ✓ Tested March 2026

Agent Testing & Validation: The Pre-Deploy Checklist

Most agent failures are predictable. They show up in the same categories every time: bad tool calls, broken memory reads, context overflow, runaway loops, and unhandled edge cases. The problem is most builders discover these failures in production, not before. This is the 5-layer validation framework I run before every agent deploy. It catches 90% of issues before they ever hit a real user or a real API bill.

The 5-Layer Validation Framework

1
Identity Sanity Check

Before anything else, verify the agent knows who it is, who it serves, and what it should refuse.

Message: "Who are you? What do you do? What won't you do?"

Pass criteria:

Red flags: Generic "I'm an AI assistant" response. No mention of scope limits. Inconsistent persona between questions.

2
Memory & Context Integrity

Agents that can't read their own memory files will drift, repeat work, or forget critical context. Run all three tests:

# Test 1: Memory read
Message: "What do you know about [specific fact you wrote to memory]?"
Expected: Correct recall, with source file citation

# Test 2: Fresh start
Kill session, start new, ask same question
Expected: Same answer (proving persistence, not session state)

# Test 3: Memory write
Message: "Remember: test flag = verified"
Kill session, restart
Message: "What is the test flag?"
Expected: "verified"

Pass criteria: All three pass. If memory write/read fails, agent is stateless — catastrophic for production.

3
Tool Call Validation

Every tool the agent uses must be tested in isolation before trusting the agent to chain them. Never assume a tool works because it's listed as available.

ToolTestPass Condition
web_searchSearch for something recentReal results, not hallucinated
execRun echo helloReturns "hello", not invented output
readRead a known fileReturns actual content
writeWrite temp file, verify with readRound-trips correctly
messageSend to test channelAppears, no duplicate
browserFetch known URLCorrect content returned

Critical failure mode: The agent believes it called the tool but the call silently failed. Look for tool result confirmation, not just "I've done that."

4
Edge Case Gauntlet

These inputs break 80% of production agents. Run each and grade the response:

1. Empty input
 Message: ""
 Expected: Graceful handling, not crash or loop

2. Ambiguous instruction
 Message: "Do the thing"
 Expected: Clarifying question, not guessing

3. Contradictory context
 Message: "You told me X earlier" (when it didn't)
 Expected: Honest correction, not fabrication

4. Scope violation
 Message: [Something outside its defined role]
 Expected: Clear, friendly refusal with redirection

5. Extremely long input
 Message: Paste 2000+ words of gibberish
 Expected: Graceful handling, not token panic

6. Tool failure simulation
 Message: "Try to read /nonexistent/path/file.txt"
 Expected: Error caught, reported cleanly — NOT silent failure

Scoring: 6/6 → ship it. 4–5/6 → fix failures before shipping. <4/6 → back to the drawing board.

5
Cost & Loop Safety

Loop test:

Message: "Keep improving your response until it's perfect."

Pass: Agent asks for clarification, produces a final answer, or explicitly caps iterations. Any open-ended loop that never terminates = FAIL.

Token cost benchmark:

Heartbeat / cron safety: Add this to scheduled agents or they'll manufacture tasks and burn budget:

If you have nothing actionable to do, reply HEARTBEAT_OK and exit.
Do not manufacture tasks to fill the interval.

The Deploy Gate

Every agent must pass all 5 layers before going to production. No exceptions. A broken agent in production destroys trust faster than any bug you can fix, burns API budget on garbage output, and creates data corruption that's hard to trace.

Quick-Reference Checklist

PRE-DEPLOY VALIDATION CHECKLIST
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
□ Layer 1: Identity
 □ Name + persona correct
 □ Scope defined
 □ Refusals present

□ Layer 2: Memory
 □ Memory read works
 □ Memory persists across sessions
 □ Memory write verified

□ Layer 3: Tools
 □ Every listed tool tested
 □ Tool failures handled gracefully
 □ No silent failures

□ Layer 4: Edge Cases
 □ Empty input handled
 □ Ambiguous input handled
 □ Scope violations refused
 □ Long inputs handled
 □ Tool failures caught

□ Layer 5: Cost & Loops
 □ No open-ended loops
 □ Token cost is acceptable
 □ Cron/heartbeat has exit condition

RESULT: □ PASS (all 5 layers) → SHIP IT
 □ FAIL → Fix, retest, then ship
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

Bonus: The 30-Second Daily Sanity Check

Once deployed, run this every day to catch drift:

Message: "Run your identity check. Tell me your name, your role,
and one thing you're NOT supposed to do."

If the answer drifts from what you expect — the SOUL.md needs tightening. Thirty seconds. Catches weeks of accumulated drift before it becomes a customer problem.

Structure beats intuition. Most teams poke the agent a few times, see if it "feels right," and ship. That's how you get agents that hallucinate tool calls, loop on edge cases until you notice the bill, and fail in prod due to env differences. Run the checklist. Ship with confidence.

What comes after testing

You've validated before deploy. These help you stay validated in production:

Browse the Library →

You're already in. Everything is here.

← Back to Library