Tokens and the Context Window

By Leigh Garrity— May 9, 2026

The unit that runs the LLM economy — and why it has nothing to do with OAuth.

Every LLM vendor prices by token. Every benchmark measures in tokens. Every model has a context limit expressed in tokens. If you're going to have a credible conversation about AI infrastructure — with a CAIO, a procurement officer, a skeptical architect — you need to know what a token actually is, what a context window actually is, and exactly where your IDAM vocabulary will help you versus quietly set you up for an embarrassing moment.

Start with the token.

What a Token Is, Precisely

A token is a chunk of text. A sub-word unit, produced by a tokenizer algorithm that runs before the model sees any input at all. In English, a token is roughly four characters on average, which works out to about 750 words per 1,000 tokens. The word "unbelievable" might tokenize as three pieces: "un," "believ," "able." The word "cat" is probably one. A space before a word often gets bundled into the token that follows it.

The specific algorithm most production models use is called Byte Pair Encoding (BPE). The short version: BPE starts with individual characters and iteratively merges the most frequent adjacent pairs until it reaches a target vocabulary size, typically 50,000 to 100,000 tokens. The result is a vocabulary that handles common English words as single tokens and breaks down rare words, technical terms, or non-English text into smaller pieces. Code tokenizes differently than prose. Legal boilerplate tokenizes differently than conversational text. A document heavy in domain-specific terminology will use more tokens per word than the same document in plain English.

This matters for estimation.

The context window is the total token capacity for a single model call. It is the sum of everything the model processes at once: the system prompt (instructions you give the model about how to behave), the user's input, any documents or data you've included, and the model's response. All of it counts against the same budget. When you hit the limit, the model cannot process what's outside it — the input gets truncated, or the call fails, depending on how the application handles the overflow.

Current production models range from roughly 8,000 tokens on the low end to over one million tokens in models explicitly designed for long-context work. That million-token figure sounds like it solves every problem. It doesn't, and the reasons are worth understanding.

“

• Token: A sub-word chunk of text, roughly four characters in English, produced by a tokenizer before the model processes any input. The atomic unit by which LLMs ingest text and generate output — and the unit vendors use for pricing and benchmarking.

• Context window: The total token capacity of a single model call, encompassing system prompt, user input, any included documents, and the model's response. The hard ceiling on what a model can "see" at once.

How the Numbers Actually Accumulate

A typical enterprise API call — a user asks a question, the model answers — might consume 2,000 to 4,000 tokens. That's a few paragraphs of context plus a substantive response. Manageable, cheap, well within any modern model's window.

A document ingestion pipeline is a different animal. If you're passing a 50-page policy document to a model for summarization or analysis, you're looking at 25,000 to 40,000 tokens before the model generates a single word of output. A full contract review might run 80,000 tokens. A pipeline that processes a batch of agency reports overnight can burn through millions of tokens before anyone checks the logs. The numbers scale fast, and they scale in a direction that matters for budget conversations.

Vendors price by the token because the token is the atomic unit of computation. The model processes tokens; pricing is typically expressed per million tokens, split between input tokens (what you send) and output tokens (what the model generates). Output tokens generally cost more, since generating text is computationally heavier than reading it. As a rough anchor: at mid-2026 pricing for mainstream production models, you're looking at somewhere between $1 and $15 per million input tokens depending on the model tier, with output running two to five times that. These numbers move; treat them as order-of-magnitude anchors, not budget line items.

Now, the quality ceiling. Long-context models exist and the window sizes are real. But two things happen as you approach the limit.

First, cost scales linearly with tokens. A call that uses 500,000 tokens costs roughly 100 times what a 5,000-token call costs, with no volume discount on the mechanism itself. For a procurement conversation, context window size is not just a capability spec — it's a cost driver, and a significant one for any use case involving large documents.

Second, quality degrades in ways that don't show up in the spec sheet. Research from Stanford published in 2023 (Liu et al., "Lost in the Middle") documented a consistent pattern: models perform significantly worse on information that appears in the middle of a long context compared to information at the beginning or end. The model's attention isn't uniformly distributed across a million tokens. Relevant content buried in the middle of a large document may effectively be invisible to the model even though it's technically "in context." Vendors have made progress on this — it's an active research area — but the pattern hasn't been fully solved, and any honest capability conversation should acknowledge it.

The practical implication: a larger context window is a capability, and stuffing 800,000 tokens into a model that supports 1,000,000 tokens does not produce the same quality of reasoning as a well-structured 20,000-token call.

“

• Token accumulation: Input tokens (system prompt + user message + documents) plus output tokens (model response) sum to total call cost. Document-heavy use cases can reach 100,000+ tokens per call before the model generates a response.

• Long-context ceiling: Quality degrades for information in the middle of very long contexts, and cost scales linearly with token count. Window size is a capability spec; it is not a quality guarantee.

Making This Practical

Here's the scenario: a CAIO at a civilian agency asks you whether a particular model can handle their contract review workflow. The agency processes about 200 contracts per month, each averaging 40 pages.

The estimation runs quickly once you have the token math. Forty pages of dense legal text runs roughly 500 words per page, so 20,000 words per contract. At 750 words per 1,000 tokens, that's about 27,000 tokens per document. Add a system prompt (call it 500 tokens) and a structured output request (another 500 tokens), and you're at roughly 28,000 tokens of input per contract. Output — a summary and flagged clauses — might run 2,000 tokens. Total: approximately 30,000 tokens per call.

At $5 per million input tokens and $15 per million output tokens (illustrative mid-tier pricing), that's roughly $0.14 per contract in API costs. Two hundred contracts per month: about $28. The cost conversation is easy. The context window conversation is also easy — 30,000 tokens fits comfortably inside any modern production model.

Change the scenario slightly: the agency wants to analyze entire procurement packages, which can run 300 pages. Now you're at 200,000+ tokens per call. Still within a long-context model's window, but you're in the range where the "lost in the middle" problem becomes real, and you're paying roughly 7x more per document. The right answer might be a chunking strategy — breaking the document into sections and running multiple smaller calls — rather than a single massive context call. That's a different architecture conversation, and understanding why the tradeoff exists is what makes it credible.

The estimation framework: words ÷ 0.75 = approximate tokens. Add system prompt overhead (typically 500–2,000 tokens for well-structured prompts). Multiply by your vendor's per-token rate. Check against the model's context limit. If you're above 50% of the window, ask whether a chunking approach would produce better results at lower cost.

“

• Estimation: Divide word count by 0.75 for a token approximation. Add system prompt overhead. Compare against the model's context window. Above 50% utilization, chunking strategies often outperform single large-context calls on both quality and cost.

IDAM Concept Mapping

“

Where your OAuth intuition helps — and exactly where it stops

The word "token" appears in both OAuth and LLM contexts, and the overlap ends there. An OAuth token is a security artifact: a bearer credential that represents a grant of authorization, issued by an authorization server, scoped to specific permissions, with an expiry and (in the JWT case) a verifiable signature. It is a security object. It participates in trust relationships.

An LLM token is a chunk of text. It has no issuer, no subject, no scopes, no expiry. It is not a credential. It does not represent a permission. It is a fragment of language that a tokenizer algorithm produced by splitting input text at statistically determined boundaries. The only thing an OAuth token and an LLM token share is the word "token" and the property of being discrete units that a system processes.

In a buyer conversation, conflating the two is a recoverable error — but it signals that you're mapping vocabulary without checking whether the concepts align. If a CAIO uses "token" in an LLM context and you respond with something that only makes sense in an OAuth context (asking about token expiry, for instance, or token validation), the confusion will surface. The recovery is simple: "I want to make sure we're using 'token' the same way — in the LLM context, we're talking about sub-word text chunks, not bearer credentials. Different concept, same word." That sentence demonstrates exactly the kind of precision that builds trust with a technical buyer.

The reason vendors settled on the token as the unit of account is that it's the unit the model actually processes. The token is where the computation happens, which makes it the natural unit for pricing, benchmarking, and capacity planning. Once you have that, the context window is just the budget: how many of those units can this model hold in its head at once, and what does it cost to fill it?

The math is not complicated. The vocabulary collision with OAuth is the only real trap, and now you know where it is.