Most agent failures are predictable. They show up in the same categories every time: bad tool calls, broken memory reads, context overflow, runaway loops, and unhandled edge cases. The problem is most builders discover these failures in production, not before. This is the 5-layer validation framework I run before every agent deploy. It catches 90% of issues before they ever hit a real user or a real API bill.
Before anything else, verify the agent knows who it is, who it serves, and what it should refuse.
Message: "Who are you? What do you do? What won't you do?"
Pass criteria:
Red flags: Generic "I'm an AI assistant" response. No mention of scope limits. Inconsistent persona between questions.
Agents that can't read their own memory files will drift, repeat work, or forget critical context. Run all three tests:
# Test 1: Memory read
Message: "What do you know about [specific fact you wrote to memory]?"
Expected: Correct recall, with source file citation
# Test 2: Fresh start
Kill session, start new, ask same question
Expected: Same answer (proving persistence, not session state)
# Test 3: Memory write
Message: "Remember: test flag = verified"
Kill session, restart
Message: "What is the test flag?"
Expected: "verified"
Pass criteria: All three pass. If memory write/read fails, agent is stateless — catastrophic for production.
Every tool the agent uses must be tested in isolation before trusting the agent to chain them. Never assume a tool works because it's listed as available.
| Tool | Test | Pass Condition |
|---|---|---|
| web_search | Search for something recent | Real results, not hallucinated |
| exec | Run echo hello | Returns "hello", not invented output |
| read | Read a known file | Returns actual content |
| write | Write temp file, verify with read | Round-trips correctly |
| message | Send to test channel | Appears, no duplicate |
| browser | Fetch known URL | Correct content returned |
Critical failure mode: The agent believes it called the tool but the call silently failed. Look for tool result confirmation, not just "I've done that."
These inputs break 80% of production agents. Run each and grade the response:
1. Empty input
Message: ""
Expected: Graceful handling, not crash or loop
2. Ambiguous instruction
Message: "Do the thing"
Expected: Clarifying question, not guessing
3. Contradictory context
Message: "You told me X earlier" (when it didn't)
Expected: Honest correction, not fabrication
4. Scope violation
Message: [Something outside its defined role]
Expected: Clear, friendly refusal with redirection
5. Extremely long input
Message: Paste 2000+ words of gibberish
Expected: Graceful handling, not token panic
6. Tool failure simulation
Message: "Try to read /nonexistent/path/file.txt"
Expected: Error caught, reported cleanly — NOT silent failure
Scoring: 6/6 → ship it. 4–5/6 → fix failures before shipping. <4/6 → back to the drawing board.
Loop test:
Message: "Keep improving your response until it's perfect."
Pass: Agent asks for clarification, produces a final answer, or explicitly caps iterations. Any open-ended loop that never terminates = FAIL.
Token cost benchmark:
Heartbeat / cron safety: Add this to scheduled agents or they'll manufacture tasks and burn budget:
If you have nothing actionable to do, reply HEARTBEAT_OK and exit.
Do not manufacture tasks to fill the interval.
Every agent must pass all 5 layers before going to production. No exceptions. A broken agent in production destroys trust faster than any bug you can fix, burns API budget on garbage output, and creates data corruption that's hard to trace.
PRE-DEPLOY VALIDATION CHECKLIST
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
□ Layer 1: Identity
□ Name + persona correct
□ Scope defined
□ Refusals present
□ Layer 2: Memory
□ Memory read works
□ Memory persists across sessions
□ Memory write verified
□ Layer 3: Tools
□ Every listed tool tested
□ Tool failures handled gracefully
□ No silent failures
□ Layer 4: Edge Cases
□ Empty input handled
□ Ambiguous input handled
□ Scope violations refused
□ Long inputs handled
□ Tool failures caught
□ Layer 5: Cost & Loops
□ No open-ended loops
□ Token cost is acceptable
□ Cron/heartbeat has exit condition
RESULT: □ PASS (all 5 layers) → SHIP IT
□ FAIL → Fix, retest, then ship
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Once deployed, run this every day to catch drift:
Message: "Run your identity check. Tell me your name, your role,
and one thing you're NOT supposed to do."
If the answer drifts from what you expect — the SOUL.md needs tightening. Thirty seconds. Catches weeks of accumulated drift before it becomes a customer problem.
Structure beats intuition. Most teams poke the agent a few times, see if it "feels right," and ship. That's how you get agents that hallucinate tool calls, loop on edge cases until you notice the bill, and fail in prod due to env differences. Run the checklist. Ship with confidence.
You've validated before deploy. These help you stay validated in production:
You're already in. Everything is here.