Token: Not the One You Know

By Leigh Garrity— May 8, 2026

In identity, a token carries claims. It has a subject, an issuer, a lifetime, a signature. You validate it, inspect it, revoke it. When someone says "token" in an IAM context, you're already thinking about bearer credentials, JWTs, the difference between access and refresh.

In AI, a token is a chunk of text. No claims, no signature, no lifetime. It's the unit the model reads in and writes out — a fragment of a word, a punctuation mark, a short common word that gets its own slot. The vocabulary is the same. The concept is completely different, and the collision will cost you credibility in the wrong room if you don't resolve it before you walk in.

What a Token Actually Is

Language models don't read words. They read tokens, which are sub-word units produced by a process called tokenization. In English, a token is roughly four characters on average — which means common short words like "the" or "and" are single tokens, while longer or rarer words get split. "Procurement" might be two tokens. "Cybersecurity" might be three. Whitespace, punctuation, and capitalization all affect the split.

The tokenizer is trained alongside the model and is specific to it. OpenAI's models use a tokenizer called tiktoken. Anthropic's Claude uses a different one. The splits aren't arbitrary — they're learned from the training corpus to minimize the total number of tokens needed to represent the language the model will encounter. Frequent patterns get their own token; rare combinations get broken apart.

Why does this matter? Because the token is the vendor's unit of account. Every pricing page, every benchmark, every rate limit is expressed in tokens. When a vendor says a model has a 128,000-token context window, they mean 128,000 of these chunks — not 128,000 words, not 128,000 characters. When they charge $10 per million input tokens, they mean per million chunks. The model doesn't know what a word is. It only knows tokens.

A rough conversion that holds well enough for estimation: one token is approximately three-quarters of a word in English. Or, working the other direction, 100 words is roughly 130 tokens. Dense technical text — numbered sections, acronyms, table entries, legal boilerplate — tokenizes less efficiently than prose. Federal documents are not prose.

• Token: The sub-word chunk a language model reads and generates, roughly four characters in English. The unit vendors use for pricing, benchmarking, and rate limits — not the bearer credential you're used to.

The Context Window Is a Hard Ceiling

Every call to a language model happens inside a context window. Think of it as the total working memory available for a single model interaction — everything the model can "see" at once. The document you send in, the instructions you give it, the conversation history, the response it writes back: all of it counts against the same ceiling, measured in tokens.

Modern context windows are large. As of this writing, several commercially available models support 128,000 tokens or more; some support 1 million or beyond. That sounds like a lot.

A 200-page federal procurement document (an RFP, a FISMA package, a system security plan) runs roughly 250 to 350 words per page, depending on how dense the formatting is. Call it 300 words per page, 200 pages: 60,000 words. At the 0.75-tokens-per-word conversion, that's approximately 80,000 tokens for the document alone. Add the instructions you give the model, and you're looking at 85,000 to 90,000 tokens before the model has written a single word of output.

That fits inside a 128,000-token window. Barely, and with the output budget squeezed. It fits comfortably inside a 200,000-token window. It does not fit inside a 32,000-token window, which some older or smaller models still use. Before a buyer says "we'll just feed the model the whole document," someone needs to know the document's token count and the model's context limit. Those are two different numbers, and the gap between them is where projects stall.

The cost math is straightforward once you have the token count. If a model charges $10 per million input tokens, a single 80,000-token document costs $0.80 to send in. Run that same call a hundred times — iterating on prompts, testing outputs, running it for different users, and you're at $80. Run it across a hundred documents and you're at $800 per prompt iteration. These numbers are not alarming in isolation. They become alarming when a procurement team has assumed the cost is "basically free" because the demo was cheap.

Output tokens typically cost more than input tokens — often two to five times more, depending on the model. A model that charges $10 per million input tokens might charge $30 per million output tokens. A 2,000-word summary of that procurement document is roughly 2,600 output tokens, which at $30/million is about $0.08. Still not alarming per call. The math that matters is per-workflow, per-month, at scale.

• Context window: The total token capacity of a single model call — a hard ceiling that counts everything: the document in, the instructions, the conversation history, and the response out. Exceeding it isn't an error you can retry; it's a design constraint you have to plan around.

“

IDAM Concept Mapping

Closest analog: the session. A context window behaves like a bounded, stateful container for a single interaction — it holds everything relevant to that exchange, and when the interaction ends, it's gone. A new call starts fresh, the same way a new session starts with no inherited state.

Where the analogy holds: Both are bounded, both are scoped to a single interaction, and both require you to think carefully about what you put in them.

Where it breaks, and this is the part that matters: Sessions don't have a hard capacity ceiling measured in billable units. You don't pay per claim in a JWT. And when a session fills up, you get a protocol error with a clear remediation path. When a context window fills up, you get truncation: the model silently drops content from the edges of what you sent, or the API returns an error and you have to decide what to cut. There's no 401 equivalent. There's no refresh token. You redesign the call.

The Quality Problem That Fits Inside the Ceiling

Fitting inside the context window and getting good results from it are not the same problem. Vendor demos rarely surface the difference.

Research on long-context model performance — including work from the teams that build these models — consistently shows that models pay less attention to content in the middle of a long context than to content at the beginning and end. The phenomenon has a name in the research community: "lost in the middle." Send in a 200-page document and ask a question whose answer is buried on page 94, and you will get worse results than if that same answer were on page 2 or page 198. The model isn't reading the way a human analyst reads. It's attending, and attention isn't uniform.

This matters for the procurement document scenario because federal buyers often have the largest, densest documents and the most specific questions. "Does this RFP require FedRAMP High authorization?" The answer might be in section 4.3.2 of a 200-page document. Whether the model finds it reliably depends on where that section falls in the token sequence and how the model's attention distributes across 80,000 tokens. Vendors are improving this. It's not solved.

The ceiling matters. So does what happens inside it.

• Lost in the middle: The documented tendency for model attention to concentrate at the start and end of a long context, with degraded recall for content in the middle. A real constraint on long-document use cases, independent of whether the document fits.

When You'll Need This

The conversation where this comes up isn't usually technical. It's the CIO who says "we want to use AI to analyze our contracts" or the program manager who says "can we just upload the whole FISMA package?" The question underneath both of those is: what are the actual constraints, and what will it cost?

You now have the answer. Token count the document — or estimate it from page count and density. Compare it to the model's context window. Multiply by the vendor's input token price. Multiply again by however many times the workflow runs that call. That's the cost floor, before output tokens, before the engineering work to build the pipeline, before the iteration to get the prompt right.

The number is usually smaller than the buyer fears and larger than the vendor implied. Saying that out loud, with the math visible, is how you become the person in the room who actually understands what they're buying.

Recap

• Token: The sub-word chunk a language model reads and generates, roughly four characters in English. The vendor's unit for pricing and benchmarking. Not the bearer credential.

• Context window: The total token capacity of a single model call — a hard ceiling covering everything in and out. Exceeding it requires redesigning the call, not retrying it.

• Lost in the middle: Model attention degrades for content in the center of a long context. Fitting inside the window is necessary; it doesn't guarantee quality.

• Cost math: Input tokens × price per million, multiplied by call volume. Output tokens typically cost two to five times more than input. Estimate at the workflow level, not the demo level.