Tokens, Context Windows, and Inference

By Leigh Garrity— May 8, 2026

The vocabulary behind every AI cost conversation — and why the word "token" is going to trip you up if you let it.

Three concepts sit underneath almost every serious AI budget conversation you're going to have in the next twelve months. They're not complicated. But if you walk in with the wrong definition of any of them — especially the first one — you'll be doing vocabulary repair work in front of a buyer who already has the right definition, and that's a bad place to be.

What They Are

Tokens are the atomic unit of text that a language model reads and writes. Not words. Not characters. Subword chunks, produced by a process called tokenization that breaks text into pieces the model can process numerically. In English, a token is roughly three-quarters of a word — "unbelievable" might tokenize as two or three pieces, "the" is almost always one, and a UUID or a piece of code might tokenize very inefficiently because the tokenizer was trained on natural language. The model sees nothing but tokens. It reads tokens in, runs computation, and produces tokens out. Every piece of text that touches the model goes through this process.

Context window is the total number of tokens the model can hold at once during a single inference call. Think of it as working memory with a hard ceiling. Everything the model has access to during that call — the system prompt, the conversation history, any documents you've fed it, tool outputs, the model's own prior responses — has to fit inside that window. Current frontier models commonly support 200,000 to 1,000,000 tokens, and that ceiling is still moving upward. But it is a ceiling. When you hit it, something gets dropped or the call fails, depending on how the application handles it.

Inference is the act of running the model. You send tokens in, the model runs, tokens come out. The model has no persistent state between calls — it doesn't "think" between your turns, doesn't remember the last conversation unless you explicitly include it in the next call's context. Each inference call is a complete, self-contained computation event. Every API call is an inference call. Every inference call costs money, and the cost is denominated in tokens.

“

• Token: The subword chunk a model reads and writes — roughly three-quarters of a word in English, the unit in which all model input and output is measured and priced. • Context window: The total token capacity available during a single inference call; everything the model can "see" at once, including history, documents, and tool outputs. • Inference: The computation event of running the model — tokens in, tokens out, billed accordingly, no persistent state between calls.

How It Works

A single enterprise API call — a user asks a question, the model answers — might consume 2,000 to 4,000 tokens. The user's prompt is a few hundred tokens. The system prompt (instructions, persona, guardrails) adds several hundred more. The model's response adds the rest. That's a simple, single-turn interaction.

Now add a document. A federal agency feeds a 50-page policy document into the context before asking the model to summarize it. That document alone might be 40,000 to 60,000 tokens. The model hasn't responded yet. You're already deep into the context window before the first output token appears.

The mechanism that makes this matter at scale is the one most buyers don't fully track: the context window is re-sent on every inference call. The model has no memory. If you want it to know what happened in turn 3 when it's responding in turn 7, you have to include turns 1 through 6 in the turn 7 call. The application layer handles this automatically, which means it's invisible to the user — and to the buyer reviewing the bill.

In a simple back-and-forth chat, this compounds slowly. In an agentic workflow, it compounds fast.

Take a procurement workflow agent: a user submits a contract review request. The agent calls a document retrieval tool (output: 8,000 tokens of contract text), a policy lookup tool (output: 3,000 tokens of applicable regulations), and a vendor history tool (output: 2,000 tokens of prior transaction data). The model reasons through the results, produces an intermediate summary, and asks a clarifying question. That's turn 2. Turn 3 now carries all of that — the original request, all three tool outputs, the model's reasoning, the clarifying question, and the user's answer. The user typed 15 words. The model is processing somewhere north of 15,000 tokens.

By turn 8 of a complex agent loop, the accumulated context can be ten times the size of the user's initial prompt. The initial prompt is not where the money is.

“

• Token accumulation: In multi-turn or agentic interactions, prior conversation turns and tool outputs are re-sent with every inference call, compounding token costs in ways that aren't visible to the end user but show up clearly in billing.

What This Means in a Budget Conversation

When a CAIO asks "what's this going to cost us at scale," the instinct is to think about volume: how many employees, how many queries per day. That math works for simple chat interfaces. It breaks down for agentic workflows, which is where most of the serious enterprise AI investment is actually going.

Agent cost scales with loop length multiplied by average context accumulation per turn — not queries per day. An agent that resolves a task in three turns with minimal tool use is a fundamentally different cost structure than one that runs eight turns with four tool calls each, even if the user experience looks similar from the outside.

This has a specific implication for public sector buyers, who are often building workflows around document-heavy processes — contract review, compliance checking, policy analysis. Those use cases are context-hungry by nature. A 200-page acquisition document doesn't get smaller because you're in a government environment. If anything, the regulatory context you have to include alongside it makes the token count larger.

The budget conversation worth having isn't "how much does a query cost?" It's "how long are your agent loops, and how much tool output are you accumulating per turn?" Those two numbers, more than anything else, determine what the bill looks like at scale.

Pricing structures vary by provider and change frequently enough that any specific number here would be stale within months. What's stable is the structure: input tokens and output tokens are priced separately, output tokens typically cost more, and the input token count for a long agent loop is dominated by accumulated context rather than user input. Bring that structure into the conversation.

“

• Cost concentration: In agentic workflows, token costs concentrate in accumulated context from prior turns and tool outputs — not in the user's initial prompt. Budget models that ignore loop length and tool output volume will underestimate spend significantly.

IDAM Concept Mapping

The OAuth Token Collision — and Where It Stops Working

The word "token" already means something to you, and that meaning is precise: an OAuth token is a credential. It represents an authorization grant. It has an issuer, an audience, a scope, an expiration time. It flows between systems as proof that a principal was authorized to do something. You count them, yes, but what you're really tracking is what they represent — who authorized what, to whom, for how long.

LLM tokens share the word and nothing else. They are not credentials. They don't represent anything. They have no issuer, no audience, no scope, no expiration. They are the text itself, chunked into pieces the model can process numerically. Counting LLM tokens is like counting words — it's a measure of volume, not a measure of authorization or identity. An OAuth token is a noun, a thing with semantics. An LLM token is closer to a unit of measurement, like a byte or a character, except the chunk size is variable and determined by a tokenizer trained on a specific corpus.

The collision matters because the mental model you carry for OAuth tokens will actively mislead you if you apply it here. When someone says "the agent consumed 80,000 tokens on that call," the right questions are: how did the context accumulate, what was the input/output split, and what's the per-token price? The wrong questions — who issued these tokens, what do they grant access to, are they still valid — don't have answers, because they're asking about the wrong thing entirely. The word is the same. The ontology is completely different. Don't let the word do work it can't do.

The synthesis piece at the end of this section pulls these concepts together with the others in the AI Foundations arc. The operational core here is narrower: tokens are how models measure text, context windows are how much text they can hold at once, and inference is the computation event that turns input tokens into output tokens. In agent loops, the accumulated context is where the real spend lives. And the OAuth token you've been working with for years is a completely different animal — same name, different species.

Know the difference before the meeting. You won't have time to work it out during.