Context Is the Cost: How Token-Based Billing Actually Works in Agent Workloads

By Leigh Garrity— May 9, 2026

Context Is the Cost: How Token-Based Billing Actually Works in Agent Workloads

The word "token" is doing double duty in AI infrastructure conversations, and if you let it stay ambiguous, the cost math won't make sense.

You already know what a token is. You've configured them, scoped them, set expiration policies on them. That's not the token we're talking about here. The LLM billing token is a different animal, and the collision between the two concepts is where most people's intuitions about agent costs go wrong.

What a Token Is (In This Context)

When a language model processes text, it doesn't read words. It reads chunks. The model's tokenizer breaks input into fragments — sometimes a full word, sometimes a syllable, sometimes punctuation — based on statistical patterns in its training data. "Authorization" might be two tokens. "auth" might be one. A JSON blob from a tool call is a pile of tokens, most of them brackets and quotes.

The exact mapping depends on the model and its tokenizer. What matters for billing is that every piece of text going into the model, and every piece of text coming out, gets measured in these chunks. Providers publish their pricing in dollars per million tokens — input tokens and output tokens priced separately, with output typically running three to five times higher than input. The asymmetry reflects compute reality: generating a token is more expensive than reading one.

A typical enterprise API call — a system prompt, a user message, some tool context — might consume 2,000 to 4,000 tokens. A document ingestion pipeline can burn through 100,000 before the first response. Neither number is the problem. The problem is what happens when you chain calls together.

token (LLM): The unit of text that language models process and providers use for billing — roughly a word fragment, measured in millions, priced separately for input and output. Not a credential, not a session identifier, not a signed assertion.

How the Bill Compounds

A single call is straightforward. You send tokens in, you get tokens out, you multiply by the rate, you pay. At $10 per million input tokens and $30 per million output tokens (illustrative figures for a frontier model in the government-accessible tier), a 3,000-token input with a 500-token response costs about $0.045. Cheap enough that nobody worries about it.

The agent loop is where the math turns on you.

When an agent runs a multi-turn task — researching, calling tools, reasoning through intermediate steps, revising its approach — each new call to the model includes the full history of everything that came before. The model has no persistent memory between calls. It doesn't store what it learned in turn 3 and retrieve it in turn 7. It re-reads the entire conversation from the beginning, every single time.

Turn 1: system prompt (2,000 tokens) + initial task (500 tokens) = 2,500 tokens in. Turn 2: everything from turn 1 (3,500 tokens, including the model's response) + a tool result (say, 4,000 tokens of retrieved text) = 7,500 tokens in. Turn 3: everything from turns 1 and 2 (12,000 tokens) + another tool result = 16,000 tokens in.

The context window grows with every turn. By turn 10, you might be sending 60,000 tokens on a single call. By turn 20, you're approaching 120,000. The cost per call isn't flat — it's climbing the whole time.

The naive estimate for a 20-turn agent loop: 20 calls × 3,000 tokens = 60,000 tokens. The actual cost, once you account for accumulated context and verbose tool results: closer to 1.5 to 2 million tokens. That's the 30 to 50x multiplier. It's not a rounding error. It's the difference between a $6 task and a $200 task, and it scales with the richness of the tools the agent is calling.

Output tokens are the expensive line item because the model generates them fresh. But in a long agent loop, input tokens are the volume problem. You're paying to re-read the entire conversation history on every turn, and that history gets longer every time.

context accumulation: The compounding growth of tokens sent to the model in a multi-turn agent loop, as each call includes the full history of prior turns, tool results, and reasoning steps. The cost per call increases with every turn, making total loop cost a function of the sum of an arithmetic series, not a flat per-call rate.

IDAM Concept Mapping

The LLM context window most resembles a session token or bearer token in one specific way: it carries state forward so the system knows what happened before. Without it, every call is stateless — the model has no idea you've been talking for 15 turns.

The analogy holds there, and then it breaks hard.

A session token is a pointer. It's a reference to state stored somewhere else — a database, a cache, a signed assertion that says "this session has these attributes." Retrieving that state is cheap. The token itself is small. The state lives elsewhere and gets fetched when needed.

An LLM context window is not a pointer. It is the state. Every byte of prior conversation, every tool result, every reasoning step — it all travels with the request, in full, on every call. The model re-reads it from scratch. There is no cache hit, no session lookup, no "retrieve the prior context cheaply." The compute cost scales with the size of the context on every single turn. That's the break, and it's the reason agent costs surprise people who've spent their careers thinking about identity infrastructure.

What This Looks Like in a Real Procurement Conversation

Take an automated procurement review agent deployed at a civilian agency. Its job: ingest a vendor response to an RFP, cross-reference it against FAR requirements, flag compliance gaps, and produce a structured evaluation. Reasonable scope. The kind of thing that looks good in a demo.

The agent runs 20 turns. Turn 1: parse the RFP and identify evaluation criteria. Turns 2 through 8: retrieve and read relevant FAR clauses, one tool call per clause, each returning 3,000 to 5,000 tokens of regulatory text. Turns 9 through 15: read sections of the vendor response, cross-referencing against the criteria. Turns 16 through 20: synthesize findings, draft the evaluation, revise based on a quality check.

By turn 8, the model is reading 40,000 tokens of accumulated context on every call. By turn 15, it's closer to 80,000. The final synthesis call might send 120,000 tokens in and generate 3,000 tokens out.

A naive estimate — "it's 20 API calls, how expensive can it be?" — might land at $5 to $10. The actual bill, accounting for accumulation, is likely $150 to $300 per procurement review. Run that agent across 500 solicitations in a fiscal year and you've got a line item that someone's going to notice.

That's the conversation you need to be ready for. Not from the CAIO, who probably understands this. From the CFO's office, or the program manager who approved the pilot budget based on the demo cost and is now looking at an invoice that's two orders of magnitude higher.

The answer to that conversation is context engineering: the practice of aggressively pruning what the model carries forward. Summarizing completed reasoning steps instead of preserving them verbatim. Returning structured tool results instead of raw document text. Dropping intermediate exchanges once they've been incorporated into a synthesis. Done well, context engineering can reduce a 20-turn loop's token consumption by 60 to 80 percent without meaningfully degrading output quality. Done poorly, it degrades the agent's coherence and you've just traded a cost problem for a quality problem.

This is why context engineering has quietly become the skill that separates production agent deployments from demo-quality ones. Architecture, not model capability. And it's a question your buyers' engineering teams will have to answer before any agent workflow survives its first quarterly review.

The demo looked great. They always do. Three turns, a clean example, a short system prompt. The production workload runs 20 turns on messy real-world documents with a system prompt that's grown to 4,000 tokens because every edge case someone encountered got appended to it.

Agent demos that look great rarely survive contact with a finance team. That's not a criticism of the technology. It's a description of where the real engineering work lives.

context engineering: The practice of actively managing what information the model carries forward across turns in an agent loop — summarizing, pruning, and restructuring context to control token accumulation without degrading output quality. The dominant cost-control discipline in production agent deployments.