LLM pricing looks straightforward. Providers publish per-token rates for input and output. You multiply, you budget, you move on. For single-turn inference, this works fine. For agent workloads, it produces estimates that are wrong by an order of magnitude or more.
The reason is a single mechanism: context accumulation. It takes about three minutes to understand. Walk into a conversation where a CAIO mentions agent costs without understanding it, and you have nothing useful to say.
The mechanism
Every major LLM API is stateless. The model retains nothing between calls. Anthropic's documentation states it plainly: "The Messages API is stateless, which means that you always send the full conversational history to the API." OpenAI's documentation is equally explicit: all previous input tokens are billed on every subsequent call, even when using their newer state management features.
So when an agent takes its second action, the application doesn't send just the new request. It resends the first request, the first response, every tool result from step one, and then the new request. Step three carries all of that plus its own new content. Step four carries everything from steps one through three.
The total input tokens across N turns follow a triangular series: N × (N+1) / 2. A 10-step agent loop doesn't cost 10 turns of input. It costs 55. A 20-step loop costs 210.
| Agent turns | Cumulative input cost (units) | Naive per-turn estimate | Actual multiplier |
|---|---|---|---|
| 1 | 1 | 1 | 1× |
| 5 | 15 | 5 | 3× |
| 10 | 55 | 10 | 5.5× |
| 20 | 210 | 20 | 10.5× |
If you hear someone call this quadratic growth, this is what they mean.
That's the math assuming each step adds roughly the same amount of new content, which is generous. Agent tool results are often large and variable. A step that retrieves a document or parses a codebase can dump thousands of tokens into the context, and every subsequent step hauls that payload forward. When later turns carry disproportionately large tool outputs, the actual cost multiplier climbs well beyond what the clean triangular math predicts.
Prompt caching, which most providers now offer, reduces the per-token price on repeated prefixes. It does not change the accumulation pattern. The context still grows, the tokens still get billed (at a discount), and the shape of the cost curve doesn't flatten. The Stanford team's preprint found that input tokens dominate cost "even when token caching is enabled."
The FinOps Foundation named this pattern Context Window Creep and called it:
"The single greatest hidden cost in most production AI applications. In a long-running conversation, this creep can easily dwarf the cost of the actual output being generated, leading to shocking and unexpected cost overruns."
What the research shows
A multi-institution team spanning Stanford, Michigan, Google DeepMind, Microsoft AI, and MIT published a preprint in April 2026 analyzing token consumption across eight frontier models on real software engineering tasks. Not yet peer-reviewed, but the methodology is transparent and the affiliations are serious.
Three findings from the paper are worth your time.
First, agentic tasks consume orders of magnitude more tokens than single-turn tasks in the same domain. The Stanford Digital Economy Lab's summary characterizes this as up to 1,000 times more tokens than code reasoning or code chat. Their description is apt: the agent builds "one big pricey context snowball." Same coding domain, same models, same benchmarks. The difference is purely structural: agents loop, and loops accumulate.
Second, input tokens dominate the cost. Most people's instinct is to focus on what the model generates, since output feels like the work. But in agent workloads, the model re-reads everything on every turn, and that re-reading is where the money goes.
Third, cost is highly variable and essentially unpredictable. The same agent running the same task can consume up to 30 times more tokens on one run than another. Models consistently underestimate their own token spend when asked to predict it. Task complexity doesn't reliably predict cost. More tokens don't even reliably produce better results; accuracy often peaks at intermediate token levels. You cannot budget for agent workloads the way you budget for API calls with predictable payloads.
The Stanford team found the same agent on the same task can vary by 30× across runs. That variance breaks any fixed-price procurement model. It's the kind of problem that makes a contracting officer's eye twitch.
Context engineering as the production discipline
The industry's answer is aggressive pruning of what the model carries forward. The practice has acquired a name: context engineering. The term diffused quickly enough through AI engineering communities that no single origin point stuck, but the concept is now well-established. It has quietly become the dominant production skill for teams deploying agents at scale. Production quality, it turns out, depends heavily on what context gets selected, filtered, compressed, and carried forward.
Augment Code, an AI coding assistant company, published an engineering guide analyzing context waste in agent loops. Their central claim, drawn from SWE-bench measurements: 39.9 to 59.7 percent of accumulated tool-result tokens are removable without performance loss. A sourcing caveat: Augment Code is a commercial vendor synthesizing external research, not conducting independent peer-reviewed work. But the finding aligns with what the Stanford team observed about the disconnect between token volume and task accuracy.
The practical techniques look like what you'd expect from a discipline still forming its boundaries. Summarize completed work phases. Store essential state in external memory. Spawn fresh sub-agents with clean contexts. Augment Code found that compression instructions scheduled every 10 to 15 tool calls achieved 22.7 percent token savings while matching baseline accuracy. They also found that naive approaches backfire: passive prompting, just asking the model to be concise, yielded only 6 percent savings and actually degraded accuracy.
A real tension runs through all of this. The model needs context to reason well. Pruning too aggressively means the agent loses track of rejected approaches, implicit constraints, decisions made three steps ago that still matter. The summarization step is inherently lossy. Context engineering is a tradeoff between cost and coherence, and best practices haven't stabilized. Anyone who tells you they've solved this cleanly is selling something.
Where your IDAM intuition applies and where it misleads
Think of the agent's accumulated context as a session token the application must present on every API call. The API is stateless, so the client carries state. The cost of carrying state scales with what's in it. The FinOps Foundation's Context Window Creep maps onto a familiar pattern: an ever-growing credential presented and validated on every request. You already understand stateless architectures that push state to the client, and you understand that this creates scaling problems.
Your intuition is useful up to a point. In identity, the entire design philosophy keeps tokens compact. Session management exists precisely to avoid resending full state. The LLM world doesn't have that luxury. The model genuinely needs the full context to reason coherently. No equivalent of a session store exists that the model can reference by ID. The context must travel with every call because the model has no other way to know what it's doing.
When you hear a buyer talk about "managing agent context," they're describing a problem with no clean architectural solution yet. Think early days of federation, where everyone agreed the problem was real and nobody agreed on the mechanism.
What to listen for
When a CAIO or CISO mentions agent costs, listen for three things.
First, whether they're thinking in per-call terms or accumulated-context terms. If they're budgeting based on single-turn token pricing multiplied by expected turns, they're underestimating by at least an order of magnitude. If they mention context engineering, context windows, or pruning strategies, they're past the first misconception and into the real operational problem.
Second, whether they've accounted for cost variability, not just cost magnitude. The Stanford team's 30× variance across identical runs means fixed-price-per-task models don't hold. If the buyer is modeling agent costs as predictable per-task expenses, they haven't hit this wall yet.
Third, whether they raise prompt caching as a solution. It helps on price-per-token but doesn't change the accumulation pattern. Knowing the difference tells you whether they've read the marketing or run the workload.
A question worth asking: "How are you modeling token costs for multi-turn agent workloads versus single-turn inference?" If they have an answer, you'll learn something about their maturity. If they don't, you've surfaced a problem they'll need to solve before any agent deployment survives its first invoice.
Things to follow up on...
- Agents can't self-budget: The Stanford team found that models consistently underestimate their own token spend when asked to predict costs before execution, which is the fundamental bottleneck for result-based agent pricing.
- FinOps is now mainstream: The State of FinOps 2026 survey reports 98% of respondents now manage AI spend, up from 31% just two years earlier, signaling that AI cost governance has moved from emerging concern to standard practice.
- Context files can backfire: Augment Code's research found that repository context files like AGENTS.md can increase inference costs by over 20% per session while sometimes reducing task success on complex tasks.
- Per-token prices keep dropping: One aggregator tracked an approximately 80% decline in LLM API prices between early 2025 and early 2026, yet enterprise AI bills are rising because volume and context accumulation outpace unit cost reductions.

