Tokens and the Context Window

By Leigh Garrity— May 6, 2026

What a Token Is

A token is a sub-word unit of text. In English, one token is roughly three to four characters. "Authorization" is four tokens. "SAML" is probably two. Whitespace, punctuation, and line breaks all count. The model never sees raw characters or whole words — it sees a numerical sequence of token IDs, each one corresponding to a chunk of text that its vocabulary was trained to recognize.

The tokenization happens before the model does anything. You submit text; a tokenizer converts it to a sequence of integers; the model processes those integers. The output comes back as integers, and a detokenizer converts them back to text. The model itself only ever sees numbers.

Languages with denser scripts — Arabic, Japanese, Korean — tokenize less efficiently than English. The same semantic content costs more tokens, which means it costs more money and consumes more of the model's capacity. For federal agencies with multilingual document requirements, this isn't a footnote.

What the Context Window Is

The context window is the total number of tokens a model can process in a single call — input and output combined. It's fixed by the model architecture. You cannot expand it at runtime. Think of it as the model's entire working memory for that one interaction: everything it knows, everything it's responding to, and everything it generates has to fit inside that boundary.

When a call exceeds the context limit, the model doesn't gracefully degrade. Depending on the API implementation, it either returns an error or silently truncates the input, usually from the beginning, which is often where your instructions live. Both outcomes cause problems.

Context window sizes vary widely across model tiers, and they've been growing. Some current models support windows of one million tokens or more. Smaller, faster models tend to have windows in the tens of thousands of tokens; frontier models have pushed into the hundreds of thousands. These numbers shift with each model release, so any specific figure is a snapshot.

How They Work Together

A typical enterprise API call — a user query with some system instructions — might consume two to four thousand tokens. Well within any modern context window, and the cost is negligible.

Document ingestion is where the math changes. A 500-page policy document runs roughly 125,000 words, which is somewhere around 165,000 tokens before you've added instructions or received a response. That fits inside a large-context model's window, but it's now the dominant cost driver for the call. At enterprise API pricing — which runs roughly in the range of a few dollars to fifteen dollars per million input tokens, depending on the model tier — a single document ingestion call can cost a dollar or more. Run that pipeline at scale across a federal document repository and the arithmetic becomes a budget conversation.

Output tokens typically cost more than input tokens, sometimes two to four times more. A model that generates a long summary or a detailed analysis is more expensive per call than one that returns a short answer. This asymmetry matters when you're estimating costs for an agency use case.

Long context windows exist, but they're not free and they're not uniformly reliable. Quality degrades for content that appears early in a very long context — the model's attention is less consistent at the edges. "Just use a bigger window" is a real option, but cost scales linearly with it.

IDAM Concept Mapping

The context window behaves like a stateless session. Everything the model knows about the interaction has to be present in the current call — there's no persistent state between API requests, just as a resource server in an OAuth flow holds no memory of previous requests once the token is validated. Your OAuth intuition gets you most of the way there. Where it breaks: the model can't "remember" a prior exchange the way a session cookie carries state across requests. If you want the model to know what was said two calls ago, you have to re-inject that content into the current context window yourself. Note also that "token" in this domain has nothing to do with access tokens, bearer tokens, or any identity artifact — the word is doing completely different work here, which is worth saying out loud before it creates confusion in a meeting.

What This Means in Practice

When a CIO asks whether a model can "read" a large document, that's a context window question. When procurement asks what the API will cost to run at scale, that's a token question. Both concerns live inside ML terminology, but they're architecture and budget questions at their core.

The practical ceiling to communicate: a model with a one-million-token window can technically ingest a very large document set in a single call. Whether it should is a different question, involving cost, latency, and quality at the edges of the window. The honest answer to "can it handle our document corpus?" is usually "yes, but here's what that costs and here's where the quality assumptions start to soften."

That's the answer that builds credibility in the room.