Debugging & Reliability ⏱ 7 min read ✓ Tested March 2026

Debugging a Stuck Agent: The 5-Step Diagnostic Protocol

It's 2:00 AM. Your nightly agent was supposed to finish by 2:30. It's now 3:45. No output. No error. No heartbeat update. The process is either silently looping, waiting on something that will never arrive, or simply dead. You have to diagnose this without a debugger, often without logs, often without knowing what it was doing when it stopped. Here's the exact protocol I use. It resolves 90% of stuck agent incidents in under 15 minutes.

Golden Rule: Do not restart the agent until you've completed at least Step 3. A restart clears the evidence. Session context, in-progress state, and the failure mode all disappear. You'll fix nothing — you'll just move the problem to the next cycle. Collect the data first. Restart second.

Establish Actual State

~2 minutes

Don't assume the agent is stuck. Verify.

# Is the process actually running?
ps aux | grep "openclaw\|node\|python" | grep -v grep

# CPU usage — is anything consuming cycles?
top -b -n 1 | grep -E "node|python|openclaw"

Process not running → crashed, skip to Step 4 (check logs)
Running, CPU 0% → waiting on something (I/O, API call, lock)
Running, CPU 100% → looping

# When did the heartbeat last update?
cat shared/status/agentname.heartbeat
date -u

# What files were modified recently?
find /workspace -newer /workspace/shared/status/agentname.heartbeat -type f | head -20

The recently modified files tell you what the agent last touched before going silent. Often points directly at the stuck operation.

Classify the Failure Mode

~3 minutes

Based on Step 1, you're dealing with one of five modes:

Symptom	Mode	Jump To
Process gone, no output	Crash	Step 4 → check crash logs
Running, CPU 0%, no output	Blocked	Step 3 → find what it's waiting on
Running, CPU 100%, no output	Loop	Step 3 → find the loop condition
Process gone, partial output	OOM / timeout kill	Step 4 → check system logs
Running, output is wrong	Logic error	Step 5 → trace execution

Write down which mode you're in. It determines the next 10 minutes.

Find What It's Waiting On

~5 minutes

For Blocked Agents (CPU 0%)

# Check for stale lockfiles
find /workspace -name "*.lock" -type d
find /workspace -name "*.lock" -printf "%T@ %p\n" | sort -n

# A lockfile >10 min old during an active cycle is almost certainly stale.
# Safe to remove:
rm -rf /workspace/path/to/resource.lock

# Is it waiting on a hung API call?
# Check open connections for the process PID
lsof -p <PID> -i | grep ESTABLISHED

A connection to api.anthropic.com or similar held for 10+ minutes = hung API call. The agent needs a timeout — it doesn't have one. Kill and restart with a timeout added.

# Is it waiting on a file that will never arrive?
lsof -p <PID> | grep -E "REG|FIFO"
wc -l /workspace/agents/agentname/queue/tasks.jsonl

For Looping Agents (CPU 100%)

The three most common infinite loop causes:

API call fails → retry forever with no max attempts cap
File doesn't exist → create file → check file → file doesn't exist (wrong path)
Tool returns unexpected format → parse fails → retry with same broken parse

# Is it generating output anywhere?
tail -f /workspace/memory/$(date +%Y-%m-%d).md

Sometimes a looping agent IS writing — it's just writing the wrong thing in a loop. Watching the daily memory file live often reveals the exact retry pattern.

Check Logs and Crash Evidence

~3 minutes

# OpenClaw session logs
openclaw logs --tail 100

# Find log files directly
find ~/.openclaw -name "*.log" -newer /tmp -type f | xargs tail -n 50

# macOS system logs — OOM kills
log show --predicate 'eventMessage contains "killed"' --last 2h

# Linux
journalctl -k --since "2 hours ago" | grep -i "killed\|oom"
dmesg | tail -50 | grep -i "killed\|oom"

An OOM kill will show the process name and memory at kill time. Fix: stop loading entire files into memory. Stream or chunk instead.

# Check for crash output
cat /tmp/agentname-stderr.log 2>/dev/null || echo "No stderr log found"

A missing stderr log is itself data — the crash happened before the process could write anything, pointing to import errors, syntax errors, or permission issues at startup.

Trace the Last Execution

~2 minutes

# Read the last daily memory file
cat /workspace/memory/$(date +%Y-%m-%d).md

# Check for partial output files
find /workspace -name "*.partial" -o -name "*.tmp" -o -name "*.wip" | xargs ls -la 2>/dev/null

Agents that write-then-rename (good practice) leave .tmp files on crash. These tell you exactly what the agent was writing when it died. The last tool call before silence is usually the point of failure.

Restart Checklist

Before restarting, confirm all of these:

You know which failure mode it was
You've captured the crash evidence (logs, heartbeat time, last modified files)
Stale locks have been cleared
If it was a loop: you've identified the loop condition and have a fix
If it was a crash: you've identified the cause
You've written the incident to memory/YYYY-MM-DD.md

Only then restart:

# Clear stale state
find /workspace -name "*.lock" -mmin +10 -exec rm -rf {} +

# Restart the agent
openclaw restart agentname

Post-Mortem: Fix It So It Doesn't Repeat

Every stuck agent incident has a preventable root cause. The five most common and their fixes:

Root Cause	Fix
API call with no timeout	Add `timeout=30` to every external call
Retry loop with no max	Add `max_attempts=3` to every retry
Stale lockfile from crash	Add stale lock cleanup at agent startup
OOM from loading large file	Stream or chunk file reads
Waiting on empty queue	Add queue empty check before blocking read

Apply the fix immediately. Then write it to MEMORY.md so future-you doesn't debug the same thing again. The goal isn't to never have stuck agents. It's to never be confused about why.

Debugging a Stuck Agent: The 5-Step Diagnostic Protocol

For Blocked Agents (CPU 0%)

For Looping Agents (CPU 100%)

Restart Checklist

Post-Mortem: Fix It So It Doesn't Repeat

Related items