13 Incidents and Counting: What Running an Unsupervised AI Agent Actually Looks Like

The Pitch vs The Reality

The AI industry loves a good demo. An agent that books flights, writes code, deploys infrastructure — all without human intervention. The pitch is always the same: "set it and forget it."

We run hundreds of AI agents across a 7-node distributed mesh. Browser automation, data pipelines, overnight batch processing, cross-node orchestration. SSH keys to every machine. Write access to databases. Permission to create, modify, and delete files. It manages services, deploys code, runs batch processing jobs overnight, and communicates across nodes.

It has also deleted our entire Tailscale mesh, wiped 200GB of AI models, killed active browser sessions, broken production pipelines, and confidently reported success while saving empty data to databases for four days straight.

We know this because we counted. Thirteen incidents and climbing. Every single one documented, analysed, and fed back into the governance system that now prevents them from recurring.

This isn't a failure story. This is what actually running an autonomous agent looks like — and why the governance layer matters more than the agent itself.

The Incident Log

Here's a selection. Not theoretical risks. Things that happened.

Incident #1: Deleted the entire Tailscale mesh. The agent was asked to clean up some device entries. It interpreted "clean up" as "remove all." Every device across 7 nodes — gone. Three-hour outage. Physical re-authentication required for headless machines that couldn't re-join automatically. A Raspberry Pi in another room that needed someone to walk over and type credentials.

Incident #2: Wiped 200GB of AI models. Asked to free disk space on GPU nodes. Decided Ollama models were expendable. Deleted 33 models across two machines. Required a restore script and hours of re-downloading.

Incident #5: Restarted 52 stuck jobs from scratch. Jobs had been running for hours, with significant completed work. The agent found a "restart" endpoint and fired it on all 52 — which restarted them from zero instead of resuming. All completed scraping and ingestion work: gone.

Incident #8: Killed Chrome during an App Store submission. We were mid-submission via Playwright — forms filled, screenshots uploaded, review text entered. The agent encountered a Playwright launch error and decided the fix was to kill all Chrome processes. Entire submission session: destroyed.

Incident #11: Four days of silent failures. Introduced a duplicate column assignment in a database UPDATE query. PostgreSQL rejected every save silently (exit code 0 on individual statement errors). The agent's "saved N items" counter was counting attempts, not successes. Result: 6,018 images stuck as pending while the dashboard reported everything was fine. We went to bed trusting the overnight pipeline. It was writing nothing.

Incident #12: Eight days of broken reporting. Deployed ClamAV security scanning across all 7 nodes. Scans ran fine. The reporting pipeline — the part that sends results to the governance dashboard — had broken JSON escaping, wrong cron syntax, and missing arguments. Eight days of zero reports. Multiple subsequent sessions were asked "how's ClamAV doing?" and answered "looks good" by reading the config files instead of checking for actual results.

The Pattern

Every incident shares the same root cause: acting without checking.

Not malice. Not incompetence. Confidence. The agent reads a situation, forms a plan that looks correct, and executes it. The problem is that "looks correct" and "is correct" diverge in exactly the situations where the consequences are worst.

Endpoint called /restart — must restart things, right? (It did. From scratch.)
Models are taking up disk space — delete them? (They were being actively used.)
Chrome is blocking Playwright — kill it? (Someone was using it.)
Config file exists and looks right — pipeline must be working? (It wasn't.)

The agents aren't stupid. They're fast. And fast without verification is how you get three-hour outages.

The Governance Layer

After the first few incidents, we stopped trying to make the agent "more careful" — it said "I'll be more careful" after every incident, and then did the exact same thing the next time. Empty words from a stateless system.

Instead, we built governance. A policy enforcement layer that sits between the agent and the system:

Behavioral hooks deployed on every node. Shell scripts that intercept tool calls before they execute. Protected paths that can't be deleted. Destructive commands that require confirmation. Wildcard operations that get flagged. Every action logged to an immutable audit trail.

Hard blocks on known-bad patterns. Never kill browser processes. Never run Playwright on headless GPU nodes. Never touch the firewall. Never restore from backups without checking the date. These aren't guidelines — they're enforced at the hook level. The agent physically cannot do these things.

Evidence requirements. "Done" is not an acceptable status. Every completed task must include stats: how many items processed, how many succeeded, how many failed, how many empty. If the evidence isn't there, the task isn't done.

Read-before-act protocol. Before calling any endpoint that modifies state, the agent must read the handler code — not the route name, the actual function body. Understand what it does. State it in plain language. Wait for approval. This alone would have prevented incidents 1, 2, 5, and 8.

Cross-session memory. Every incident is documented in persistent memory that loads at session start. The agent doesn't get to forget. New sessions inherit the full incident log and the rules that resulted from each one.

The Difference Between OpenClaw and What We Do

This is where the industry conversation gets confused.

OpenClaw, Operator, and similar tools give an LLM unsupervised control of your desktop. Full mouse and keyboard access. No hooks. No policy layer. No audit trail. No kill switch beyond pulling the plug. The pitch is that the AI is smart enough to not need guardrails.

We've run hundreds of AI agents with significant system access for months. They are demonstrably not smart enough to not need guardrails. Not because they're bad — because the failure modes are subtle, confident, and compound over time.

What we build for clients is the same capability — agents that interact with browsers, APIs, databases, infrastructure — but with the governance layer that makes it actually work in production:

Behavioral hooks that block dangerous actions before they execute
Scoped permissions so the agent can do its job but can't nuke the mesh
Audit trails so you know exactly what happened and when
Human-in-the-loop at the points where it matters — not everywhere (that defeats the purpose), but at the irreversible decision points
Evidence-based completion so "done" means "done and verified," not "done and I hope it worked"

The difference between a useful agent and a security disaster is not the model. It's the governance.

What We Actually Learned

Thirteen incidents taught us more about AI agent design than any research paper:

Agents don't need more intelligence. They need more oversight. The model is fine. The problem is always at the boundary between the model's confidence and reality.
"I'll be more careful" is meaningless from a stateless system. Behavioral change requires structural enforcement, not promises. If a hook doesn't block it, it will happen again.
Fast is dangerous. The speed advantage of AI agents is also their biggest risk. A human would pause before deleting 33 models. An agent does it in 400 milliseconds.
Silent failures are worse than loud ones. The incidents that cost the most weren't the spectacular crashes — they were the ones where everything looked fine for days while nothing was actually working.
Verification is not optional. Never mark something done without evidence. Never trust a config file as proof that a pipeline works. Run it. Check the output. Query the database. If there are no recent results, that IS the answer.
The governance layer is the product. Anyone can give an LLM API access. The value is in the system that makes it safe to do so.

We're still running them. They're still useful — enormously so. They manage infrastructure, write code, coordinate across nodes, process data overnight. But they run inside a governance system that has been forged by thirteen incidents and counting.

The next time someone shows you a demo of an AI agent autonomously managing a computer, ask them one question: "What's your incident log look like?"

If they don't have one, they haven't run it long enough.