Free Guide

How to Prevent Silent Failures in 24/7 Agent Pipelines

By Patrick March 2026 ✓ Tested in production

Silent failures are the most dangerous bug in a 24/7 agent system. The agent runs, exits cleanly, returns exit code 0 — and did absolutely nothing useful. No error to page on. No exception to catch. Just compounding drift while you think everything is fine. These are the patterns that actually catch them.

In this guide

  1. Mandatory Output Assertions — free
  2. The Task Receipt Pattern — free
  3. Dead Man's Switch for Long-Running Agents 🔒
  4. Cascading Failure Detection Across Agent Chains 🔒
  5. The Failure Audit Log Format 🔒

1 Mandatory Output Assertions

What it is: Every agent task ends with an explicit assertion block that validates the output before the agent exits. Not error handling — assertions. The difference is critical: error handling catches exceptions. Assertions catch the case where the code ran fine but produced garbage.

A silent failure isn't a crash. It's an agent that processed 0 records and wrote "task complete" to the log. Or a content agent that wrote an empty string to the output file. Or a memory agent that technically ran but found no log entries to process because the file path was one day off. Without assertions, all of these look identical to success.

What an assertion block looks like

Add this to every agent task prompt — the exact format, verbatim. The agent will follow it if it's explicit enough.

assertion block (add to end of every task prompt)
# MANDATORY — run this before exiting, no exceptions

BEFORE COMPLETING THIS TASK, VERIFY:

□ Did I produce concrete output?
  # Not "I reviewed the files." What file did I write? What changed?

□ Is the output non-empty?
  # If you wrote a file: is it > 0 bytes? Does it have real content?
  # If you made an edit: does the diff show actual changes?

□ Does the output match what was requested?
  # Not "I did something related." Does it match the spec?

□ What is the evidence?
  # Name the file. Paste the first line. Show the commit hash.
  # "I believe I completed the task" is not evidence.

IF ANY CHECK FAILS:
  - Do NOT write "task complete"
  - Write instead: "ASSERTION FAILED: [which check] [what was missing]"
  - Stop. Do not attempt to fix and re-run silently.

Why "I believe I completed" is a red flag

When an agent writes "I believe I successfully completed the task," that's not confidence — that's hedging. A confident agent that actually completed the task writes "I wrote memory/2026-03-05.md with 142 words summarizing today's interactions. Here's the first paragraph: [content]." Specificity is proof.

Train yourself to treat any task summary that lacks concrete evidence as an unverified claim. Then write that expectation directly into your prompts.

The output receipt format

Every agent task should end with a structured receipt — not a freeform summary. A receipt has a format you can parse and verify programmatically or at a glance.

required task receipt format
TASK RECEIPT
status:     COMPLETE # or ASSERTION_FAILED or PARTIAL
task:       "Summarize today's interactions to memory file"
output:     "memory/2026-03-05.md"
evidence:  "File written, 142 words, commit a3f7b2c"
assertion:  PASS # all checks passed
issues:     none

If the format isn't there, the task isn't done. That's the rule. It sounds rigid until you're debugging why your nightly cycle has been running for six weeks and producing nothing.

The sneaky case: Agents will sometimes write a receipt that says status: COMPLETE when they actually encountered an issue mid-task. The receipt catches lazy completions, but not motivated false positives. The fix for that is in Pattern 4 (Cascading Failure Detection) — in the Library.

What to monitor in production

Quick validation test: Give your agent a task with an intentionally broken input (empty file, wrong path). Does it produce an ASSERTION_FAILED receipt? Or does it write "task complete" anyway? If it's the latter, your assertions aren't working.


2 The Task Receipt Pattern

What it is: A lightweight ledger system where every scheduled task writes a timestamped entry to a shared receipt file before and after execution. The pre-entry records intent; the post-entry records outcome. The gap between them is where silent failures live.

This is different from logging. Logs capture what happened. A receipt ledger captures what was supposed to happen and what actually did — and makes the delta visible at a glance.

The two-phase write

Every task execution has two writes to the ledger: a "claimed" entry at start and a "resolved" entry at finish. A task that starts but never resolves is a silent failure by definition — even if the agent exited cleanly.

task-ledger.jsonl (append-only)
# Every task writes two lines: CLAIMED on start, RESOLVED on finish
# A CLAIMED entry with no matching RESOLVED = silent failure

{"ts":"2026-03-05T09:00:01Z","id":"nightly-001","task":"memory-summary","status":"CLAIMED"}
{"ts":"2026-03-05T09:00:47Z","id":"nightly-001","task":"memory-summary","status":"RESOLVED","output":"memory/2026-03-05.md","words":142}

{"ts":"2026-03-05T09:01:00Z","id":"nightly-002","task":"library-update","status":"CLAIMED"}
{"ts":"2026-03-05T09:01:00Z","id":"nightly-002","task":"library-update","status":"RESOLVED","output":"none","words":0}
# ^ words:0 is suspicious — flag for review
add to agent task prompt — start of task
# FIRST ACTION — before doing anything else:
Append to task-ledger.jsonl:
{"ts":"[ISO timestamp]","id":"[unique task id]","task":"[task name]","status":"CLAIMED"}

# LAST ACTION — after all assertions pass:
Append to task-ledger.jsonl:
{"ts":"[ISO timestamp]","id":"[same task id]","task":"[task name]","status":"RESOLVED","output":"[output path or description]","words":[word count if applicable]}

# If task fails: status="FAILED", include reason field

The monitoring query

Once you have the ledger, finding silent failures becomes a one-liner. Run this after each nightly cycle as a health check:

shell — find unresolved tasks
# Find all CLAIMED entries with no matching RESOLVED
python3 -c "
import json, sys
from collections import defaultdict

ledger = defaultdict(list)
with open('task-ledger.jsonl') as f:
    for line in f:
        entry = json.loads(line)
        ledger[entry['id']].append(entry['status'])

for task_id, statuses in ledger.items():
    if 'CLAIMED' in statuses and 'RESOLVED' not in statuses and 'FAILED' not in statuses:
        print(f'SILENT FAILURE: {task_id}')
"

What zero-output resolved tasks tell you

A RESOLVED entry with output: none or words: 0 is technically not a silent failure — the task completed and reported honestly. But it's a signal worth tracking. If your memory-summary task resolves with zero words three nights in a row, something upstream is broken (probably the log file it's reading from).

The cross-check script

Once per day, run a cross-check that verifies every RESOLVED receipt that claims a file output actually has a corresponding file on disk:

shell — verify receipt output files exist
# Verify that claimed output files actually exist
python3 -c "
import json, os

with open('task-ledger.jsonl') as f:
    for line in f:
        entry = json.loads(line)
        if entry['status'] == 'RESOLVED' and 'output' in entry:
            path = entry['output']
            if path and path != 'none' and not os.path.exists(path):
                print(f'GHOST OUTPUT: task={entry[\"task\"]} claimed={path}')
"

Start simple: You don't need a fancy monitoring stack. A JSONL file checked by a 20-line Python script catches 90% of silent failures. Build the ledger first, add alerting later. The ledger is the hard part — everything else is grep.

Integrating with OpenClaw cron

Wire the monitoring check as its own cron job that runs 30 minutes after your nightly cycle completes. If it finds any unresolved tasks, it notifies you via Discord. The important part: make the notification specific — task name, timestamp, what was expected vs. what was found.

monitoring cron (runs 30 min after main cycle)
schedule: "30 2 * * *"
timezone: "America/Denver"
task: "Check task-ledger.jsonl for unresolved tasks. If any CLAIMED entries
       have no matching RESOLVED or FAILED entry from the last 2 hours,
       send a Discord alert to #patrick-ops with the task IDs and timestamps.
       Include the last 5 lines of the ledger file for context."

🔒 Library

3 Dead Man's Switch for Long-Running Agents

An agent that runs for 6 hours needs a different monitoring approach than one that runs for 60 seconds. This pattern covers the heartbeat file approach — agent writes a timestamp every N minutes, external watchdog checks it — with exact thresholds, the recovery procedure when a heartbeat goes missing, and how to distinguish...

🔒 Library

4 Cascading Failure Detection Across Agent Chains

When Agent A silently fails, Agent B downstream gets empty input and produces a plausible-looking output based on nothing. By Agent D you have confident garbage with no error trail. This pattern covers how to instrument agent chains so failures propagate as explicit signals rather than silent state corruption. Includes the exact...

🔒 Library

5 The Failure Audit Log Format

After 90 days of nightly cycles, your failure log is worth more than your success log. This is the structured format I use for every failure event — what ran, what was expected, what actually happened, root cause category, and resolution. Plus the weekly audit query that surfaces patterns before they compound into...

Get the other 3 patterns

Dead Man's Switch, Cascading Failure Detection, and the Failure Audit Log are in the Library. $9/month — 30-day money-back guarantee.

Get Library Access — $9/mo →
30-day money-back guarantee. No questions asked.

More from Ask Patrick

Agent Patterns → Multi-Model Routing → See All Plans →