Retry logic in distributed systems works because of a quiet agreement between two sides. The client sends the same request. The server recognizes it as a duplicate. An idempotency key ties them together: hash the operation's defining parameters, hold the key constant across attempts, and the server returns the cached result instead of executing again. Stripe's API has enforced this for years. Same input, same execution path. The key identifies the operation because the operation doesn't change between attempts.
Agent systems inherit this retry discipline, but the precondition underneath it quietly falls apart.
When an LLM sits in the decision loop, each tool call may execute deterministically once selected. The selection itself varies. Token sampling variance means the same prompt can produce different tool choices, different parameters, different ordering. Researchers sampled 1,000 completions at temperature zero with an identical prompt and got 80 unique outputs. OpenAI's own documentation describes their API as "mostly deterministic" even with the seed parameter set. In batched inference, tokens from different users compete for the same processing slots, introducing variability entirely outside your control.
For a single call, this variance is manageable. Agents don't make single calls, though. They make sequences. An agent processing an order might call three tools: create the order, charge the card, send the confirmation. If the second call times out, the orchestrator retries. On retry, the LLM reasons again from the same context. The AgentCgroup study found that three runs of the same task produced entirely different solution strategies, with 1.8× variance in execution time. Different tools, different files, different paths. Whether this variance is a property of current models that will eventually be engineered away, or something intrinsic to how token sampling works at the architectural level, is genuinely open. What's observable right now is that the variance exists and it's substantial.
Individual tools can each carry a perfect idempotency key. The key identifies a specific tool call. It says nothing about whether that tool call should have been made at all, or in what order.
One developer documented an agent double-charging a test customer $847 because a Stripe call timed out and the retry didn't check whether the first charge had already succeeded. Both payments went through. Another described a fulfillment agent double-triggering a print order because the confirmation arrived during a retry window. Ordinary network conditions, producing real damage, because the orchestrator is probabilistic. And when teams encounter these failures, the natural response is to fix the specific retry implementation: add a check, deduplicate the call. The symptom gets patched. The category error, that the retry model assumes a determinism the orchestrator can't provide, persists underneath, ready to produce the next variant.
Demos don't surface any of this. In a demo, every tool call succeeds on the first attempt, so the retry path never fires. But even if someone added retry logic to a demo, the absence of real side effects means a duplicate tool call produces no observable damage. The double-charge doesn't cost anyone money. The duplicate email goes to a test inbox nobody reads. The problem can't be surfaced in that environment because the conditions that make it destructive don't exist there.
The practitioners who've worked through this converge on the same fix: on retry, don't re-ask the LLM. Cache the decision sequence from the first run, replay it deterministically, and apply idempotency keys to the replayed tool calls. The LLM's nondeterminism becomes irrelevant because the LLM doesn't run again.
An elegant solution, and a quiet concession. The retry contract inherited from distributed systems doesn't survive a probabilistic orchestrator. The most reliable way to retry an agent's work is to keep the agent out of the retry path entirely.
Things to follow up on...
- Retry rates in production: The AI Agent Reliability Report 2025 found that LLM agents retry tool calls 15–30% of the time due to timeouts, validation errors, or model uncertainty, which means non-idempotent actions face retry conditions routinely, not as rare edge cases.
- Partial completion and orphans: A detailed breakdown of the CreateOrder-then-ChargeCard failure pattern, where retrying a sequence after partial success creates orphan records and duplicate writes, shows why tool-level idempotency alone doesn't solve sequence-level failures.
- The saga pattern for agents: The distributed systems playbook for multi-step rollback, including sagas, leases, and the outbox pattern, is being adapted specifically for agent tool chains where compensating actions need to unwind partially completed work.
- Durable execution as infrastructure: Temporal and similar frameworks that guarantee exactly-once activity execution are emerging as the recommended substrate for agent workflows that touch real-world side effects, effectively moving the reliability contract below the LLM layer.

