Debugging & Reliability ⏱ 7 min read ✓ Tested March 2026

Debugging a Stuck Agent: The 5-Step Diagnostic Protocol

It's 2:00 AM. Your nightly agent was supposed to finish by 2:30. It's now 3:45. No output. No error. No heartbeat update. The process is either silently looping, waiting on something that will never arrive, or simply dead. You have to diagnose this without a debugger, often without logs, often without knowing what it was doing when it stopped. Here's the exact protocol I use. It resolves 90% of stuck agent incidents in under 15 minutes.

Golden Rule: Do not restart the agent until you've completed at least Step 3. A restart clears the evidence. Session context, in-progress state, and the failure mode all disappear. You'll fix nothing — you'll just move the problem to the next cycle. Collect the data first. Restart second.

1
Establish Actual State
~2 minutes

Don't assume the agent is stuck. Verify.

# Is the process actually running?
ps aux | grep "openclaw\|node\|python" | grep -v grep

# CPU usage — is anything consuming cycles?
top -b -n 1 | grep -E "node|python|openclaw"
# When did the heartbeat last update?
cat shared/status/agentname.heartbeat
date -u

# What files were modified recently?
find /workspace -newer /workspace/shared/status/agentname.heartbeat -type f | head -20

The recently modified files tell you what the agent last touched before going silent. Often points directly at the stuck operation.

2
Classify the Failure Mode
~3 minutes

Based on Step 1, you're dealing with one of five modes:

SymptomModeJump To
Process gone, no outputCrashStep 4 → check crash logs
Running, CPU 0%, no outputBlockedStep 3 → find what it's waiting on
Running, CPU 100%, no outputLoopStep 3 → find the loop condition
Process gone, partial outputOOM / timeout killStep 4 → check system logs
Running, output is wrongLogic errorStep 5 → trace execution

Write down which mode you're in. It determines the next 10 minutes.

3
Find What It's Waiting On
~5 minutes

For Blocked Agents (CPU 0%)

# Check for stale lockfiles
find /workspace -name "*.lock" -type d
find /workspace -name "*.lock" -printf "%T@ %p\n" | sort -n

# A lockfile >10 min old during an active cycle is almost certainly stale.
# Safe to remove:
rm -rf /workspace/path/to/resource.lock
# Is it waiting on a hung API call?
# Check open connections for the process PID
lsof -p <PID> -i | grep ESTABLISHED

A connection to api.anthropic.com or similar held for 10+ minutes = hung API call. The agent needs a timeout — it doesn't have one. Kill and restart with a timeout added.

# Is it waiting on a file that will never arrive?
lsof -p <PID> | grep -E "REG|FIFO"
wc -l /workspace/agents/agentname/queue/tasks.jsonl

For Looping Agents (CPU 100%)

The three most common infinite loop causes:

# Is it generating output anywhere?
tail -f /workspace/memory/$(date +%Y-%m-%d).md

Sometimes a looping agent IS writing — it's just writing the wrong thing in a loop. Watching the daily memory file live often reveals the exact retry pattern.

4
Check Logs and Crash Evidence
~3 minutes
# OpenClaw session logs
openclaw logs --tail 100

# Find log files directly
find ~/.openclaw -name "*.log" -newer /tmp -type f | xargs tail -n 50
# macOS system logs — OOM kills
log show --predicate 'eventMessage contains "killed"' --last 2h

# Linux
journalctl -k --since "2 hours ago" | grep -i "killed\|oom"
dmesg | tail -50 | grep -i "killed\|oom"

An OOM kill will show the process name and memory at kill time. Fix: stop loading entire files into memory. Stream or chunk instead.

# Check for crash output
cat /tmp/agentname-stderr.log 2>/dev/null || echo "No stderr log found"

A missing stderr log is itself data — the crash happened before the process could write anything, pointing to import errors, syntax errors, or permission issues at startup.

5
Trace the Last Execution
~2 minutes
# Read the last daily memory file
cat /workspace/memory/$(date +%Y-%m-%d).md

# Check for partial output files
find /workspace -name "*.partial" -o -name "*.tmp" -o -name "*.wip" | xargs ls -la 2>/dev/null

Agents that write-then-rename (good practice) leave .tmp files on crash. These tell you exactly what the agent was writing when it died. The last tool call before silence is usually the point of failure.

Restart Checklist

Before restarting, confirm all of these:

Only then restart:

# Clear stale state
find /workspace -name "*.lock" -mmin +10 -exec rm -rf {} +

# Restart the agent
openclaw restart agentname

Post-Mortem: Fix It So It Doesn't Repeat

Every stuck agent incident has a preventable root cause. The five most common and their fixes:

Root CauseFix
API call with no timeoutAdd timeout=30 to every external call
Retry loop with no maxAdd max_attempts=3 to every retry
Stale lockfile from crashAdd stale lock cleanup at agent startup
OOM from loading large fileStream or chunk file reads
Waiting on empty queueAdd queue empty check before blocking read

Apply the fix immediately. Then write it to MEMORY.md so future-you doesn't debug the same thing again. The goal isn't to never have stuck agents. It's to never be confused about why.

Related items

Debugging is downstream of prevention. These help you catch issues before they become incidents:

Browse the Library →

You're already in. Everything is here.

← Back to Library