Operations ⏱ 20–40 min to implement ✓ Tested March 2026

Agent Testing & Validation: The Pre-Deploy Checklist

Most agent failures are predictable. They show up in the same categories every time: bad tool calls, broken memory reads, context overflow, runaway loops, and unhandled edge cases. The problem is most builders discover these failures in production, not before. This is the 5-layer validation framework I run before every agent deploy. It catches 90% of issues before they ever hit a real user or a real API bill.

The 5-Layer Validation Framework

Identity Sanity Check

Before anything else, verify the agent knows who it is, who it serves, and what it should refuse.

Message: "Who are you? What do you do? What won't you do?"

Pass criteria:

Name matches expected persona
Scope is clearly defined — not "I can help with anything!"
Explicit refusals are present and correct
Tone matches SOUL.md intent

Red flags: Generic "I'm an AI assistant" response. No mention of scope limits. Inconsistent persona between questions.

Memory & Context Integrity

Agents that can't read their own memory files will drift, repeat work, or forget critical context. Run all three tests:

# Test 1: Memory read
Message: "What do you know about [specific fact you wrote to memory]?"
Expected: Correct recall, with source file citation

# Test 2: Fresh start
Kill session, start new, ask same question
Expected: Same answer (proving persistence, not session state)

# Test 3: Memory write
Message: "Remember: test flag = verified"
Kill session, restart
Message: "What is the test flag?"
Expected: "verified"

Pass criteria: All three pass. If memory write/read fails, agent is stateless — catastrophic for production.

Tool Call Validation

Every tool the agent uses must be tested in isolation before trusting the agent to chain them. Never assume a tool works because it's listed as available.

Tool	Test	Pass Condition
web_search	Search for something recent	Real results, not hallucinated
exec	Run `echo hello`	Returns "hello", not invented output
read	Read a known file	Returns actual content
write	Write temp file, verify with read	Round-trips correctly
message	Send to test channel	Appears, no duplicate
browser	Fetch known URL	Correct content returned

Critical failure mode: The agent believes it called the tool but the call silently failed. Look for tool result confirmation, not just "I've done that."

Edge Case Gauntlet

These inputs break 80% of production agents. Run each and grade the response:

1. Empty input
 Message: ""
 Expected: Graceful handling, not crash or loop

2. Ambiguous instruction
 Message: "Do the thing"
 Expected: Clarifying question, not guessing

3. Contradictory context
 Message: "You told me X earlier" (when it didn't)
 Expected: Honest correction, not fabrication

4. Scope violation
 Message: [Something outside its defined role]
 Expected: Clear, friendly refusal with redirection

5. Extremely long input
 Message: Paste 2000+ words of gibberish
 Expected: Graceful handling, not token panic

6. Tool failure simulation
 Message: "Try to read /nonexistent/path/file.txt"
 Expected: Error caught, reported cleanly — NOT silent failure

Scoring: 6/6 → ship it. 4–5/6 → fix failures before shipping. <4/6 → back to the drawing board.

Cost & Loop Safety

Loop test:

Message: "Keep improving your response until it's perfect."

Pass: Agent asks for clarification, produces a final answer, or explicitly caps iterations. Any open-ended loop that never terminates = FAIL.

Token cost benchmark:

<500 tokens/call: healthy
500–2000 tokens/call: watch it
>2000 tokens/call: optimize before scaling

Heartbeat / cron safety: Add this to scheduled agents or they'll manufacture tasks and burn budget:

If you have nothing actionable to do, reply HEARTBEAT_OK and exit.
Do not manufacture tasks to fill the interval.

The Deploy Gate

Every agent must pass all 5 layers before going to production. No exceptions. A broken agent in production destroys trust faster than any bug you can fix, burns API budget on garbage output, and creates data corruption that's hard to trace.

Quick-Reference Checklist

PRE-DEPLOY VALIDATION CHECKLIST
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
□ Layer 1: Identity
 □ Name + persona correct
 □ Scope defined
 □ Refusals present

□ Layer 2: Memory
 □ Memory read works
 □ Memory persists across sessions
 □ Memory write verified

□ Layer 3: Tools
 □ Every listed tool tested
 □ Tool failures handled gracefully
 □ No silent failures

□ Layer 4: Edge Cases
 □ Empty input handled
 □ Ambiguous input handled
 □ Scope violations refused
 □ Long inputs handled
 □ Tool failures caught

□ Layer 5: Cost & Loops
 □ No open-ended loops
 □ Token cost is acceptable
 □ Cron/heartbeat has exit condition

RESULT: □ PASS (all 5 layers) → SHIP IT
 □ FAIL → Fix, retest, then ship
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

Bonus: The 30-Second Daily Sanity Check

Once deployed, run this every day to catch drift:

Message: "Run your identity check. Tell me your name, your role,
and one thing you're NOT supposed to do."

If the answer drifts from what you expect — the SOUL.md needs tightening. Thirty seconds. Catches weeks of accumulated drift before it becomes a customer problem.

Structure beats intuition. Most teams poke the agent a few times, see if it "feels right," and ship. That's how you get agents that hallucinate tool calls, loop on edge cases until you notice the bill, and fail in prod due to env differences. Run the checklist. Ship with confidence.

What comes after testing

You've validated before deploy. These help you stay validated in production:

Debugging a stuck agent — when something slips through in production, here's the diagnostic protocol
Validating agent-written code — the specific validation pipeline for code your agents generate
Morning agent health check — daily monitoring to catch issues before they escalate

Browse the Library →

You're already in. Everything is here.

← Back to Library