Arequest times out. You send it again. A connection drops. You reconnect. Retry logic is so fundamental to distributed systems that most frameworks ship it by default. It keeps systems reliable.
The trouble starts when the thing being retried already took effect. The server processed the payment, the acknowledgment got lost in transit, the client retried, the server processed it again. Two charges, one order. Distributed systems engineers have dealt with this for decades. The solution is idempotency: assign a unique key to each operation. If the server sees the same key twice, it returns the cached result instead of re-executing. Stripe popularized this pattern. It works because the operation behind the key is deterministic. Same input, same output, same side effects. The key is a reliable deduplication mechanism because the thing it represents behaves identically every time.
Now replace the deterministic executor with an LLM.
An agent calls a tool, the call times out, the orchestrator retries. But the first call completed. One practitioner described watching an agent double-charge a test customer $847 because the Stripe API call timed out and the retry logic didn't check whether the first charge had landed. So far, classical problem. The next part is where agent systems make it structurally worse.
In a traditional system, the retried operation is identical to the original. In an agent system, the LLM may reason differently on the second attempt. Different token sampling, different tool selection, a different sequence of actions entirely. The side effects still need to be deterministic, but the executor producing them is inherently nondeterministic. An idempotency key assumes the operation behind it is stable. An April 2026 paper formalized this as three challenges classical workflow engines never faced: non-deterministic node output, reasoning failures as a primary error mode, and non-idempotent retry. When the executor might take a different path on every invocation, the key doesn't know what it's deduplicating. It's a return address on a letter whose contents change every time you open the envelope.
Checkpointing doesn't close the gap either. A checkpoint records where the agent was. It says nothing about what the world looks like now. If the agent called an API and crashed before the framework logged success, the checkpoint shows the step as incomplete even though the API call landed. Resume, re-execute, duplicate. LangGraph recently made NodeTimeoutError retryable by default, with an explicit warning: if your nodes have side effects that aren't idempotent, opt out before upgrading.
One architectural response already exists: separate reasoning from execution. The agent proposes intent; a deterministic layer validates and executes. Temporal's integration with OpenAI's Agents SDK works this way, replaying workflows against recorded history and skipping completed steps. The LLM's nondeterminism becomes irrelevant because you never re-ask it on retry. But this boundary has weight. It requires architecting the entire system around a division that constrains the very thing that made agents attractive: their ability to improvise, to chain actions flexibly, to handle situations nobody pre-specified. The O'Reilly Radar framing calls it the missing layer. Call it a leash.
Most agent systems don't have this boundary. Reasoning and execution happen in the same pass. Retry the pass, and you get a new reasoning trace producing new side effects in a world that already absorbed the old ones. Reliability infrastructure, functioning correctly, introduces the errors it was built to prevent. For decades, making systems more reliable has also made them more correct. Agent systems break that coupling.
Things to follow up on...
-
Idempotency keys for agents: A developer built agent-ledger, a small Python library that hashes workflow ID, tool name, and arguments into a deduplication key at the tool-call boundary, replaying stored results on retry instead of re-executing.
-
Two kinds of retry: A practitioner analysis on idempotency in LLM pipelines distinguishes transient retries (network flake, want the same result) from sampling retries (bad output, want a different result), arguing they require opposite handling and that conflating them is a primary source of duplicate side effects.
-
Framework defaults are shifting: LangGraph's recent decision to make NodeTimeoutError retryable by default prompted an explicit upgrade warning about non-idempotent side effects, while CrewAI's checkpointing behavior remains notably underdocumented in production scenarios.
-
The missing kernel boundary: An O'Reilly Radar piece argues that agents need the equivalent of an OS kernel boundary separating unprivileged reasoning from privileged state modification, where every side effect passes through a deterministic validation layer before touching the real world.

