The Crash That Stopped Coming and the Engineer Who Had to Replace It

Dag Halvorsen is a platform engineer at a mid-size logistics company who, until recently, was sleeping through the night. He spent six years on Kubernetes autoscaling before moving into agent orchestration, a transition he describes as "going from herding cattle to herding cats that occasionally hallucinate." We spoke over video call. A Post-it note stuck to his monitor read "RAN ≠ SUCCEEDED" in red marker. It had clearly been there a while.

Dag is not a real person, though the engineering problems he describes are drawn from documented production incidents, public framework limitations, and published practitioner analysis. His name is a coincidence, he insists, and has nothing to do with directed acyclic graphs.

You've said the moment everything changed was a model migration. Walk me through it.

Dag: We were running agents on a 128K-token model. Standard setup: task decomposition, tool calls, loop until done. Agents would occasionally get stuck in loops, but honestly? The problem was self-correcting. Context window fills up, the API throws a length error, the agent crashes, we get an alert, someone looks at it. Ugly, but terminal.

Then we migrated to a model with a million-token window.¹

First overnight batch run, I wake up, check the dashboard, and there's a single agent session that's been running for nine hours. No errors. No crashes. Just... running. Accumulating tokens at an exponential rate because each step appends its output to the context for the next step.² The bill for that one session was a car payment.

A car payment.

Dag: A used car payment. But the dollar amount isn't the point. The context-length error had been functioning as a loop terminator. Nobody designed it that way. Nobody documented it as a safety mechanism. It was an accident of constraints. And when the constraint disappeared, the safety disappeared with it, and nobody noticed until the invoice arrived.

So you set a token limit and moved on?

Dag: That's the easy problem. Hard cap on tokens per session, hard cap on steps per run, done by lunch. The hard problem is everything that comes after you decide to enforce something.

Take LangGraph, which a lot of teams build on. LangGraph saves your graph state to a checkpoint store at each superstep. Good. But if your process crashes, not the agent's reasoning, the actual process hosting it, nobody knows. No watchdog. No heartbeat. No supervisor checking whether your workflow is still alive.³ A crashed agent just sits there. Silent. Until a human notices.

In Kubernetes, if a pod stops responding to liveness probes, the system restarts it. That's been table stakes for a decade. Agent frameworks don't have liveness probes. They have checkpoints, which tell you where you were, not whether you're still going.

So you're building the missing layer yourself.

Dag: And the missing layer is where all the interesting problems live.

Here's the one I can't stop thinking about: what counts as a failure? In a normal system, failure has a shape. HTTP 500. Timeout. Exception. In an agent system, the model returns HTTP 200 while confidently producing garbage. One study found that somewhere between 41% and 87% of agent failures come back as successful HTTP responses.⁴

So if I wire my circuit breaker to status codes, which is how every circuit breaker library works, it stays closed while the agent hallucinates its way through my entire workflow.

That means you need semantic failure detection.

Dag: Right, and now you're asking the enforcement layer to understand what the agent is doing. Which means you need a second model evaluating the first model. Or you define your termination conditions so precisely, before the agent ever runs, that you can check them mechanically. One practitioner put it well: termination logic is architecture, not an afterthought.⁵ You need to know what "done" looks like in measurable terms before you write a single line of agent code.

Even then you hit loop detection. Hash-based detection catches exact repetition. Agent varies its query parameters slightly each iteration? Different hash. Same loop. No detection.⁶

You mentioned the enforcement layer creates its own problems.

Dag: This took me weeks to fully internalize.

When you kill an agent mid-execution, you've guaranteed partial completion. Step one created an order. Step two was supposed to charge the card. You killed it between steps. Now you have an orphan order in the database.

“

Retry the whole chain? Duplicate order. Skip step one on retry? Only works if you know step one succeeded, and "succeeded" is different from "ran."⁷

This is the Saga pattern from distributed transactions, solved since the '90s. Each step gets a compensating action, a rollback. If step three fails, you fire compensations for steps two and one in reverse order.⁸ Payments infrastructure has done this forever. Idempotency keys, deduplication tables, the whole stack.

But here's the wrinkle with agents: traditional idempotency assumes deterministic replay. Same input, same output. LLMs don't do that. You retry with the same prompt, you might get different tool selections, different arguments, a completely different reasoning path.⁹ So you can't just replay. You have to cache the decisions on first execution and replay from the cache, which means your enforcement layer now includes a decision-replay store, which is basically a workflow engine.

At which point you have to ask: if the correct behavior requires deterministic execution, recorded decisions, compensating transactions, and mechanical termination checks... did I need an agent here at all? Or did I need a workflow engine with an LLM in one node?

That sounds like a crisis of faith.

Dag: [laughs] More like a refinement of scope. The agent's superpower is dynamic replanning. Open-ended exploration. Handling the unexpected. But the enforcement layer's entire job is to suppress dynamic replanning when the path is already defined.

“

Those two things are in tension. And the enforcement layer has to win, every time, or you get the car payment.

What's the Post-it note about?

Dag: [glances at monitor] Daily reminder. A completion record that says a step "ran" tells you nothing about whether it succeeded. A tool call that returned garbage is not semantically complete, even if it executed without errors.¹⁰ Every piece of retry logic, every circuit breaker, every compensation trigger depends on that distinction. Most frameworks don't make it.

Where does this end up?

Dag: I think we're about eighteen months from agent frameworks shipping with the enforcement primitives that workflow engines have had for twenty years. Watchdogs, heartbeats, distributed locking, compensation registries. The patterns exist. They're just not default yet.¹¹

Until then, every team running agents in production is building this layer themselves, badly, at 2 AM, after the first car payment.

As of June 2026, thirteen frontier models ship context windows of 1 million tokens or more. See https://www.morphllm.com/llm-context-window-comparison ↩
TruFoundry documented this cost shape: a 4,000-token initial context doubling at each step reaches 128,000 tokens at step 5 with a 32× cost increase. See https://www.truefoundry.com/blog/rate-limiting-ai-agents-preventing-llm-api-exhaustion ↩
Diagrid's analysis of LangGraph production gaps: "If your process crashes, no one knows. There is no supervisor, no watchdog, and no heartbeat." See https://www.diagrid.io/blog/checkpoints-are-not-durable-execution-why-langgraph-crewai-google-adk-and-others-fall-short-for-production-agent-workflows ↩
UC Berkeley's MAST study (NeurIPS 2025) found that 41–87% of agent failures returned HTTP 200. ↩
Glen Rhodes, "Termination Logic Is the Underrated Design Problem in Agentic AI Systems" (March 8, 2026). See https://glenrhodes.com/prediction-termination-logic-is-the-underrated-design-problem-in-agentic-ai-systems-not-model-quality-or-prompt-design/ ↩
aiqnahub.com analysis: "The graph cycles forever because no branch condition evaluates to the terminal state. The state hash does not change between cycles." See https://www.aiqnahub.com/agentic-workflow-loop-forever/ ↩
TianPan.co: "A tool call that ran to completion but returned an unparseable response is not semantically complete." See https://tianpan.co/blog/2026-04-20-idempotency-llm-pipelines ↩
Motomtech: "CreateOrder succeeds, then ChargeCard fails permanently. The order is created and unpaid. Retrying the sequence creates a second order." See https://www.motomtech.com/blog-post/ai-agent-retries-idempotency-tool-failures/ ↩
TianPan.co: "Traditional idempotency breaks when outputs are stochastic." See https://tianpan.co/blog/2026-04-20-idempotency-llm-pipelines ↩
See gate_count vs. completion_count distinction in https://getaxonflow.com/blog/idempotency-boundaries-in-multi-system-ai-automation/ ↩
"Idempotency keys come from payment processing. Exponential backoff with jitter comes from every retry library shipped this decade. Saga compensation comes from distributed transactions in the 1990s." See https://www.digitalapplied.com/blog/agentic-workflow-anti-patterns-orchestration-mistakes-2026 ↩

You've said the moment everything changed was a model migration. Walk me through it.

Then we migrated to a model with a million-token window.¹

A car payment.

So you set a token limit and moved on?

Dag: That's the easy problem. Hard cap on tokens per session, hard cap on steps per run, done by lunch. The hard problem is everything that comes after you decide to enforce something.

So you're building the missing layer yourself.

Dag: And the missing layer is where all the interesting problems live.

So if I wire my circuit breaker to status codes, which is how every circuit breaker library works, it stays closed while the agent hallucinates its way through my entire workflow.

That means you need semantic failure detection.

Even then you hit loop detection. Hash-based detection catches exact repetition. Agent varies its query parameters slightly each iteration? Different hash. Same loop. No detection.⁶

You mentioned the enforcement layer creates its own problems.

Dag: This took me weeks to fully internalize.

“

Retry the whole chain? Duplicate order. Skip step one on retry? Only works if you know step one succeeded, and "succeeded" is different from "ran."⁷

That sounds like a crisis of faith.

“

Those two things are in tension. And the enforcement layer has to win, every time, or you get the car payment.

What's the Post-it note about?

Where does this end up?

Until then, every team running agents in production is building this layer themselves, badly, at 2 AM, after the first car payment.

As of June 2026, thirteen frontier models ship context windows of 1 million tokens or more. See https://www.morphllm.com/llm-context-window-comparison ↩
TruFoundry documented this cost shape: a 4,000-token initial context doubling at each step reaches 128,000 tokens at step 5 with a 32× cost increase. See https://www.truefoundry.com/blog/rate-limiting-ai-agents-preventing-llm-api-exhaustion ↩
Diagrid's analysis of LangGraph production gaps: "If your process crashes, no one knows. There is no supervisor, no watchdog, and no heartbeat." See https://www.diagrid.io/blog/checkpoints-are-not-durable-execution-why-langgraph-crewai-google-adk-and-others-fall-short-for-production-agent-workflows ↩
UC Berkeley's MAST study (NeurIPS 2025) found that 41–87% of agent failures returned HTTP 200. ↩
Glen Rhodes, "Termination Logic Is the Underrated Design Problem in Agentic AI Systems" (March 8, 2026). See https://glenrhodes.com/prediction-termination-logic-is-the-underrated-design-problem-in-agentic-ai-systems-not-model-quality-or-prompt-design/ ↩
aiqnahub.com analysis: "The graph cycles forever because no branch condition evaluates to the terminal state. The state hash does not change between cycles." See https://www.aiqnahub.com/agentic-workflow-loop-forever/ ↩
TianPan.co: "A tool call that ran to completion but returned an unparseable response is not semantically complete." See https://tianpan.co/blog/2026-04-20-idempotency-llm-pipelines ↩
Motomtech: "CreateOrder succeeds, then ChargeCard fails permanently. The order is created and unpaid. Retrying the sequence creates a second order." See https://www.motomtech.com/blog-post/ai-agent-retries-idempotency-tool-failures/ ↩
TianPan.co: "Traditional idempotency breaks when outputs are stochastic." See https://tianpan.co/blog/2026-04-20-idempotency-llm-pipelines ↩
See gate_count vs. completion_count distinction in https://getaxonflow.com/blog/idempotency-boundaries-in-multi-system-ai-automation/ ↩
"Idempotency keys come from payment processing. Exponential backoff with jitter comes from every retry library shipped this decade. Saga compensation comes from distributed transactions in the 1990s." See https://www.digitalapplied.com/blog/agentic-workflow-anti-patterns-orchestration-mistakes-2026 ↩

The Crash That Stopped Coming and the Engineer Who Had to Replace It

You've said the moment everything changed was a model migration. Walk me through it.

A car payment.

So you set a token limit and moved on?

So you're building the missing layer yourself.

That means you need semantic failure detection.

You mentioned the enforcement layer creates its own problems.

That sounds like a crisis of faith.

What's the Post-it note about?

Where does this end up?

Footnotes

You've said the moment everything changed was a model migration. Walk me through it.

A car payment.

So you set a token limit and moved on?

So you're building the missing layer yourself.

That means you need semantic failure detection.

You mentioned the enforcement layer creates its own problems.

That sounds like a crisis of faith.

What's the Post-it note about?

Where does this end up?

Footnotes