The Context Tax

By Leigh Garrity— May 8, 2026

Token-based pricing is simple until you run an agent. Then it's a different conversation.

Here's the mechanism. Language model providers charge per token — a token being roughly three-quarters of a word, or about four characters. You pay for tokens going in (your prompt, your instructions, everything the model reads) and tokens coming out (the model's response). Rates vary by model tier, but a working mental model for a capable frontier model: somewhere in the range of $10–15 per million input tokens, $30–60 per million output tokens. A single, well-constructed query — system prompt, user message, response — might run 1,000 tokens total. At those rates, that's less than two cents.

That math is what makes demos look cheap.

Token pricing: Providers charge separately for tokens the model reads (input) and tokens it generates (output), billed per million. A single call to a capable model costs fractions of a cent; the problem is what happens when you chain calls together.

The Loop Changes Everything

An agent isn't a single call. It's a loop: the model reasons, calls a tool, gets a result, reasons again, calls another tool, reasons again, produces output. Twenty turns is not unusual for a moderately complex workflow — a contract review, a procurement research task, a multi-step compliance check.

The model has no memory. Every call is stateless. To give the model context about what happened in turn 3 when you're on turn 12, you have to send turn 3's content again. And turn 4's. And turn 5's. The entire conversation history, every tool result, every reasoning step, rides along on every subsequent call, growing with each turn.

A procurement research agent makes the compounding concrete. The user asks it to find three compliant vendors for a software category, check their FedRAMP status, summarize pricing tiers, and flag any recent security incidents.

Turn 1: System prompt (500 tokens) + user query (150 tokens) = 650 tokens in. Model responds with a plan and a tool call (400 tokens out).

Turn 2: Everything from turn 1 (1,050 tokens) + tool result from the first search (900 tokens) + agent continuation (100 tokens) = 2,050 tokens in. Model reasons and calls another tool (400 tokens out).

Turn 3: Everything from turns 1 and 2 (3,450 tokens) + new tool result (900 tokens) + continuation (100 tokens) = 4,450 tokens in.

You see the shape. Each turn, the input grows by roughly 1,400 tokens — the prior response plus the tool result plus the continuation. By turn 10, you're sending 14,000 tokens on a call that started at 650. By turn 20, you're past 27,000 tokens on a single call.

Now run the math on total cost across all 20 turns. The naive estimate: 20 calls × 650 tokens = 13,000 input tokens. The actual total, summing across all turns as context accumulates: roughly 280,000 input tokens. That's a 21x multiplier on input alone — and input is the cheaper side. Add output tokens, add the fact that real tool results often run longer than 900 tokens (a document retrieval might return 3,000–5,000 tokens), and the 30–50x figure that practitioners cite becomes entirely plausible. Some workflows hit higher.

The demo ran five turns. The demo looked cheap. Production runs twenty turns, ten thousand times a day.

Context accumulation: In a multi-turn agent loop, the full conversation history — every tool result, every reasoning step — must be re-sent on every subsequent call because the model has no persistent memory. Input token counts grow with each turn, compounding total cost far beyond what per-call estimates suggest.

IDAM Concept Mapping

The concept that maps closest here is the session — specifically, the difference between session state and session tokens. In IDAM, you authenticate once, get a session token, and subsequent requests are cheap: the token is a pointer to state stored server-side. The session store holds the history; the token is just the key.

The LLM context window looks like a session from the outside. It accumulates state across turns. It represents "what has happened so far in this interaction." That part of the analogy holds.

The break is that there is no session store. The context window is not a pointer to state — it is the state, and you transmit it in full on every call. In IDAM terms, it's as if your session token contained the entire session history, and you had to send the full token on every request. A 20-turn agent loop doesn't cost 20x more than a 1-turn exchange; it costs orders of magnitude more, because the input payload grows with every turn. The cost model is fundamentally different from anything in the session paradigm.

Your IDAM intuition about "accumulated state across an interaction" is useful. Your intuition about what that state costs to maintain is not.

When You'll Need This

The conversation happens in two places. The first is the pilot review — a finance team or a CIO looking at an agent deployment that ran beautifully in demo and is now generating an invoice that doesn't match the projection. The second is the pre-procurement conversation, where a technically fluent buyer asks how you're thinking about inference costs at scale.

In both cases, the person across from you has probably done the naive math: calls per day × cost per call. That math is wrong by a factor of 30 to 50, and the gap is entirely explained by context accumulation. Naming that mechanism, and explaining why it's not a vendor problem or a pricing anomaly but a structural feature of how stateless inference works, is what separates a credible technical conversation from a shrug.

Context engineering is the production response to this problem. The term has become common enough in ML engineering circles that buyers with technical staff may have heard it; it's worth knowing what it means. Context engineering is the practice of aggressively managing what the model carries forward on each call: summarizing prior tool results rather than appending them verbatim, compressing conversation history at defined checkpoints, dropping intermediate reasoning steps once a decision is made, and using retrieval to inject only the relevant prior context rather than the full history. Done well, it can reduce effective token consumption by 60–80% on long agent loops without meaningfully degrading task performance.

Agent demos rarely survive contact with a finance team because demos are short. A five-turn demo with clean tool results doesn't reveal the accumulation problem. A twenty-turn production workflow does. Context engineering is what you put between the demo and the invoice.

Vendors are increasingly building context management into their agent frameworks — automatic summarization, sliding window approaches, selective retrieval. The implementations vary in quality and the field is moving fast enough that any specific capability claim deserves scrutiny. The underlying problem, though, is settled: without deliberate context management, inference costs in agent workloads scale superlinearly with task complexity, and the naive per-call estimate is not a useful planning number.

Context engineering: The practice of deliberately managing what context the model carries forward on each call — through summarization, compression, and selective retrieval — to prevent superlinear cost growth in multi-turn agent loops. It is the primary production technique for making agent deployments economically viable at scale.

The number your buyer needs isn't the per-call rate. It's the effective cost per completed task, which requires knowing how many turns the task takes and how aggressively the deployment manages context accumulation. If the vendor or the implementation team can't answer that question, the demo number is the wrong number to plan around.

Sourcing note: Token pricing figures are illustrative of current frontier model tiers as of mid-2026 and should be verified against current provider rate cards before use in customer-facing materials. The 30–50x cost multiplier reflects practitioner estimates in ML engineering communities; the actual multiplier is workload-dependent and should be modeled against specific agent loop characteristics.