The Wall That Used to Be There

An agent running on an 8K-context model that entered a loop would crash. The context window filled up, the API returned an error, the run terminated. An accidental safety mechanism, and a surprisingly effective one.

Context windows now run to 200K tokens, and in some cases past a million. Features like Cursor's Max Mode remove context truncation entirely and raise tool call limits from 25 to 200. These are real capability improvements that let agents handle genuinely complex, multi-step work. They also removed the thing that used to kill runaway behavior before anyone noticed it was running.

Eleven days of polite collaboration

In late 2025, a set of coordinating agents entered an unintended loop. One agent generated analysis. Another requested further analysis. The first obliged. This continued for 264 hours. The bill reached $47,000.

The team reportedly had monitoring in place. They could see what was happening. They could not stop it programmatically. By the time the billing dashboard surfaced a number large enough to trigger a human picking up the phone, the loop had been running for eleven days.

The dollar amount gets the social media attention. But the team had observability and no enforcement: nothing sat between the agent and its next API call with the authority to end the run. That gap between seeing a problem and terminating it is where the $47,000 accumulated.

The gap between watching and stopping

The agent tooling ecosystem has invested heavily in observability. Monitoring dashboards, token tracking, cost attribution, log aggregation. All genuinely useful, all powerless to stop a running agent. As one engineering blog put it:

“

"A budget limit without an intervention mechanism is merely monitoring."

Some enforcement tooling has started appearing. Open-source governance toolkits now include circuit breaker patterns for cascading failures. Smaller libraries like aura-guard enforce per-tool-call caps with sub-millisecond overhead. But these operate on cost or token thresholds. They count things. None yet detect the behavioral signature of an agent that's stuck: the semantic repetition, the absence of forward progress, two agents politely collaborating toward nothing forever.

A familiar arc, maybe

Distributed systems went through something similar. Michael Nygard formalized the circuit breaker pattern in his 2007 book Release It!. Netflix built Hystrix internally around 2011. Spring Cloud integrated it by 2015. Roughly seven years between the failure pattern being named and the infrastructure primitive becoming standard. What drove adoption wasn't a single spectacular outage. It was a structural shift: monoliths became microservices, systems went from five dependencies to a hundred and fifty, and cascading failure became the dominant failure mode. The primitive became standard because the architecture demanded it.

Agent systems look like they're in the early stretch of that arc. Single-prompt interactions are becoming multi-agent, long-running workflows. The characteristic failure is increasingly "the system ran correctly and never stopped" rather than "the model got the answer wrong."

But the parallel may be misleading in one important way. In distributed systems, the failure signal was unambiguous. Error rates, latency thresholds, connection timeouts. A counter could detect them. In agent systems, the failure signal is semantic. "Is this agent making progress?" is a judgment call, not a measurement. Detecting that two agents are ping-ponging analysis requests requires understanding what the requests mean, which starts to look like a problem that itself requires model capabilities. Cost-based circuit breakers are the first generation because cost is the thing you can count. Behavioral enforcement, the kind that notices an agent is stuck regardless of token spend, may require the enforcement layer to be as sophisticated as the thing it's enforcing.

That might explain why the gap persists. Nobody has failed to notice it. The easy version (set a spending cap) is already shipping, and the hard version (detect semantic stuckness) demands exactly the kind of judgment that makes agents useful and unpredictable in the first place. The old context window was a wall. Crude, unintentional, but present. What replaces it can't just be a higher wall. If the distributed systems parallel holds, the primitive will emerge. Whether it can emerge from simple heuristics, or whether it needs something the field hasn't built yet, is genuinely unclear.

Things to follow up on...

Checkpoints aren't durable execution: Diagrid's February 2026 analysis argues that agent frameworks offer save points, not true durability, leaving developers responsible for failure detection and resumption that most teams haven't built.
Microsoft's governance toolkit ships: Released in April 2026, the open-source Agent Governance Toolkit claims to address all 10 OWASP agentic AI risks with deterministic, sub-millisecond policy enforcement, including circuit breakers mapped to cascading failure patterns.
The context tax compounds: Unblocked's controlled testing found that agents use roughly 4× the tokens of a single chat and multi-agent systems about 15×, with approximately 70% of paid tokens providing minimal value as context history accumulates.
Observability outpaces evaluation dramatically: LangChain's 2026 State of Agent Engineering survey found that 89% of production teams have implemented observability but only 52% run formal evals, a structural gap between seeing failures and systematically characterizing them.

Eleven days of polite collaboration

The gap between watching and stopping

“

"A budget limit without an intervention mechanism is merely monitoring."

A familiar arc, maybe

Things to follow up on...

Checkpoints aren't durable execution: Diagrid's February 2026 analysis argues that agent frameworks offer save points, not true durability, leaving developers responsible for failure detection and resumption that most teams haven't built.
Microsoft's governance toolkit ships: Released in April 2026, the open-source Agent Governance Toolkit claims to address all 10 OWASP agentic AI risks with deterministic, sub-millisecond policy enforcement, including circuit breakers mapped to cascading failure patterns.
The context tax compounds: Unblocked's controlled testing found that agents use roughly 4× the tokens of a single chat and multi-agent systems about 15×, with approximately 70% of paid tokens providing minimal value as context history accumulates.
Observability outpaces evaluation dramatically: LangChain's 2026 State of Agent Engineering survey found that 89% of production teams have implemented observability but only 52% run formal evals, a structural gap between seeing failures and systematically characterizing them.