Tokens, Context Windows, and Inference

By Leigh Garrity— May 9, 2026

Start here: the word "token" is going to cause you trouble

You already know what a token is. That's the problem.

In your world, a token is a credential artifact — an OAuth bearer token, a JWT, a refresh token. It carries identity claims, has a lifetime, gets validated against a key. When someone says "token" in an identity conversation, you know exactly what they mean and you know how to work with it.

In an AI conversation, "token" means something completely different. The concepts share a name and nothing else.

An LLM token is a chunk of text. Specifically, it's a sub-word unit that a language model uses to read and write language. The model doesn't see words; it sees tokens. "Unbelievable" might be three tokens: "un," "believ," "able." "The" is probably one. A space before a word is often part of the token that follows it. Rough rule of thumb: one token is about three-quarters of a word in English, which means 1,000 tokens is roughly 750 words, which means this article is somewhere around 2,000 tokens.

That's it. An LLM token is a text chunk. No identity claims, no cryptographic material, no expiration, no credential.

The word overlap is a genuine hazard. In a meeting where someone says "the model processes tokens," your IDAM-trained brain will fire an association that doesn't apply. Name the collision to yourself, set your existing mental model aside, and pick up the new one. They don't connect.

“

• Token (LLM): A sub-word chunk of text — roughly three-quarters of a word — that a language model uses as its basic unit of reading and writing. Not a credential, not an identity artifact, not related to OAuth in any way except the name.

The context window: working memory with a hard ceiling

Every time you run a language model, it operates on a sequence of tokens. That sequence is the context window — everything the model can see at once. Prior conversation turns, system instructions, documents you've fed it, tool outputs, its own previous responses. All of it lives in the context window, or it doesn't exist for the model.

Context windows are large now. Models commonly support 200,000 to 1,000,000 tokens, which at the three-quarters-of-a-word conversion rate means somewhere between 150,000 and 750,000 words. Enough to hold several novels, or a substantial chunk of an agency's policy documentation, or a long and complicated agent conversation with many tool calls.

The ceiling matters because it's absolute. When you hit it, something has to go. Either the model truncates older content, or the application manages what gets included, or the call fails. There's no graceful overflow. The model doesn't summarize what it can't fit; it simply can't see it.

The other thing that matters: the model retains nothing between inference calls. The context window isn't a persistent state that lives on a server somewhere. It's reconstructed from scratch every time. If you want the model to remember something from a prior turn, you have to include that prior turn in the next call's context. Long conversations get expensive for exactly this reason — each turn carries the full weight of everything that came before it, regardless of how much of it is still relevant.

“

• Context window: The total token sequence a model can process in a single inference call — now commonly 200K to 1M tokens. Everything the model knows about a conversation must fit inside it; nothing persists between calls unless explicitly included.

Inference: the compute event

Inference is the act of running the model. You send in a token sequence; the model produces a token sequence back. That's it.

Every inference call has two sides: the input tokens (everything you sent in — instructions, context, the user's message) and the output tokens (what the model generates back). Both sides cost compute. Both sides cost money. The pricing model for most commercial LLM APIs reflects this directly: you pay per input token and per output token, usually at different rates, with output tokens typically priced higher because generating text is more compute-intensive than reading it.

A simple question-and-answer exchange is cheap. A user asks something in 50 tokens, the system prompt adds another 500, the model responds in 300. You've spent roughly 850 tokens total. At current commercial rates, that's a fraction of a cent.

The math changes when agents enter the picture.

“

• Inference: A single model execution — tokens in, tokens out. Both sides are metered. The cost of a call scales with the total token count, not with the complexity of the question.

Where the money actually goes

An agent isn't a single inference call. It's a loop. The model reasons, calls a tool, gets a result, reasons again, calls another tool, gets another result, and so on until it produces a final answer. Each step in that loop is a separate inference call. And each inference call includes the full context: the original instructions, the user's request, every prior reasoning step, every tool call, every tool response.

Take a concrete example. An agent is helping a federal procurement officer research a vendor. Turn one: the user asks a question (50 tokens). The system prompt is 1,000 tokens. The model reasons and calls a web search tool. That's inference call one: roughly 1,100 tokens in, maybe 200 tokens out.

Turn two: the tool returns search results — 3,000 tokens of text. The model now processes the original context plus the search results plus its own prior reasoning. Inference call two: 4,300 tokens in, 400 tokens out.

Turn three: the model decides it needs to look up a specific document. The document retrieval returns 8,000 tokens. Inference call three: 12,700 tokens in, 500 tokens out.

By turn five, you're sending 30,000 tokens per call. By turn ten, you might be at 80,000. The user's original question — 50 tokens — is a rounding error. The accumulated context of tool outputs, reasoning traces, and prior responses is the actual cost driver.

Most AI pricing conversations miss this. Buyers see a per-token rate and imagine it applying to the user's message. The rate applies to everything in the context window, every time the model runs. In a long agent loop, that's a very different number.

For public sector deployments, the implication is direct. Agentic workflows that touch multiple data sources — document retrieval, database queries, API calls — accumulate context fast. A procurement automation workflow that seems simple from the outside might be running 15-20 inference calls per user request, each one carrying the full weight of every prior step. The cost model for that workflow looks nothing like the cost model for a chatbot.

Making it practical

When a CIO asks why their AI deployment costs more than projected, the answer is almost never "the model is expensive per token." It's usually "the agent is running more inference calls than expected, and each call is carrying more context than anyone modeled."

The follow-on question — "how do we control it?" — has real answers: context compression, selective retrieval instead of full document ingestion, caching repeated system prompts, limiting agent loop depth. Those are engineering decisions. But the seller's job in that conversation is to correctly identify the mechanism before the engineering team gets involved. Misdiagnosing token cost as a per-query problem leads to the wrong interventions.

The other conversation this surfaces in is procurement. Federal agencies buying AI capabilities through contracts increasingly encounter token-based pricing in API agreements. Understanding that input and output tokens are priced separately, that context accumulation multiplies both, and that agentic workflows have fundamentally different cost profiles than conversational interfaces — that's the difference between a contract that works and one that blows its budget in month two.

“

Okta Concept Mapping: Context Window ≠ Session

Where the analogy holds: A context window functions like a session in one important way — it's the bounded scope of what a system knows about a current interaction. Just as a session carries state for an authenticated user's current activity, the context window carries everything the model knows about the current conversation.

Where it breaks: A session is stateful and server-side. The server holds it; the client references it. A context window is stateless and reconstructed. There is no server-side memory of prior turns. Every inference call starts from zero and requires the application to reassemble the full context from scratch. The "memory" of a conversation exists only in the token sequence the application sends. If the application doesn't include a prior turn, the model has no access to it — not because it forgot, but because it never had it. This matters for how you think about agent architecture: the application layer, not the model, owns conversation state.

The three concepts as one mechanism

Tokens are the unit. The context window is the container. Inference is the event that consumes both and produces output.

The cost structure follows directly: you pay per inference event, scaled by how many tokens went in and how many came out. In a simple interaction, that's small. In an agent loop, the context window fills with accumulated evidence of everything the agent has done, and you pay to send that full context with every step.

The user's question is almost never the expensive part. Everything the agent learned on the way to answering it is.

“

• Inference cost: Priced per input token and per output token, billed at each model execution. In agent loops, the dominant cost is accumulated context — prior turns, tool outputs, reasoning traces — not the user's original request.