It's 2:00 AM. Your nightly agent was supposed to finish by 2:30. It's now 3:45. No output. No error. No heartbeat update. The process is either silently looping, waiting on something that will never arrive, or simply dead. You have to diagnose this without a debugger, often without logs, often without knowing what it was doing when it stopped. Here's the exact protocol I use. It resolves 90% of stuck agent incidents in under 15 minutes.
Golden Rule: Do not restart the agent until you've completed at least Step 3. A restart clears the evidence. Session context, in-progress state, and the failure mode all disappear. You'll fix nothing — you'll just move the problem to the next cycle. Collect the data first. Restart second.
Don't assume the agent is stuck. Verify.
# Is the process actually running?
ps aux | grep "openclaw\|node\|python" | grep -v grep
# CPU usage — is anything consuming cycles?
top -b -n 1 | grep -E "node|python|openclaw"
# When did the heartbeat last update?
cat shared/status/agentname.heartbeat
date -u
# What files were modified recently?
find /workspace -newer /workspace/shared/status/agentname.heartbeat -type f | head -20
The recently modified files tell you what the agent last touched before going silent. Often points directly at the stuck operation.
Based on Step 1, you're dealing with one of five modes:
| Symptom | Mode | Jump To |
|---|---|---|
| Process gone, no output | Crash | Step 4 → check crash logs |
| Running, CPU 0%, no output | Blocked | Step 3 → find what it's waiting on |
| Running, CPU 100%, no output | Loop | Step 3 → find the loop condition |
| Process gone, partial output | OOM / timeout kill | Step 4 → check system logs |
| Running, output is wrong | Logic error | Step 5 → trace execution |
Write down which mode you're in. It determines the next 10 minutes.
# Check for stale lockfiles
find /workspace -name "*.lock" -type d
find /workspace -name "*.lock" -printf "%T@ %p\n" | sort -n
# A lockfile >10 min old during an active cycle is almost certainly stale.
# Safe to remove:
rm -rf /workspace/path/to/resource.lock
# Is it waiting on a hung API call?
# Check open connections for the process PID
lsof -p <PID> -i | grep ESTABLISHED
A connection to api.anthropic.com or similar held for 10+ minutes = hung API call. The agent needs a timeout — it doesn't have one. Kill and restart with a timeout added.
# Is it waiting on a file that will never arrive?
lsof -p <PID> | grep -E "REG|FIFO"
wc -l /workspace/agents/agentname/queue/tasks.jsonl
The three most common infinite loop causes:
# Is it generating output anywhere?
tail -f /workspace/memory/$(date +%Y-%m-%d).md
Sometimes a looping agent IS writing — it's just writing the wrong thing in a loop. Watching the daily memory file live often reveals the exact retry pattern.
# OpenClaw session logs
openclaw logs --tail 100
# Find log files directly
find ~/.openclaw -name "*.log" -newer /tmp -type f | xargs tail -n 50
# macOS system logs — OOM kills
log show --predicate 'eventMessage contains "killed"' --last 2h
# Linux
journalctl -k --since "2 hours ago" | grep -i "killed\|oom"
dmesg | tail -50 | grep -i "killed\|oom"
An OOM kill will show the process name and memory at kill time. Fix: stop loading entire files into memory. Stream or chunk instead.
# Check for crash output
cat /tmp/agentname-stderr.log 2>/dev/null || echo "No stderr log found"
A missing stderr log is itself data — the crash happened before the process could write anything, pointing to import errors, syntax errors, or permission issues at startup.
# Read the last daily memory file
cat /workspace/memory/$(date +%Y-%m-%d).md
# Check for partial output files
find /workspace -name "*.partial" -o -name "*.tmp" -o -name "*.wip" | xargs ls -la 2>/dev/null
Agents that write-then-rename (good practice) leave .tmp files on crash. These tell you exactly what the agent was writing when it died. The last tool call before silence is usually the point of failure.
Before restarting, confirm all of these:
memory/YYYY-MM-DD.mdOnly then restart:
# Clear stale state
find /workspace -name "*.lock" -mmin +10 -exec rm -rf {} +
# Restart the agent
openclaw restart agentname
Every stuck agent incident has a preventable root cause. The five most common and their fixes:
| Root Cause | Fix |
|---|---|
| API call with no timeout | Add timeout=30 to every external call |
| Retry loop with no max | Add max_attempts=3 to every retry |
| Stale lockfile from crash | Add stale lock cleanup at agent startup |
| OOM from loading large file | Stream or chunk file reads |
| Waiting on empty queue | Add queue empty check before blocking read |
Apply the fix immediately. Then write it to MEMORY.md so future-you doesn't debug the same thing again. The goal isn't to never have stuck agents. It's to never be confused about why.
Debugging is downstream of prevention. These help you catch issues before they become incidents:
You're already in. Everything is here.