What a token actually is
A token is a sub-word chunk of text, typically around four characters of English (Google's Gemini API documentation confirms this rule of thumb across their model family), produced by a tokenizer algorithm before the model processes anything. The model never sees your text as text. It sees a sequence of integer IDs, each mapped to one of these chunks.
The word "red" is one token. The word "authentication" is three or four, depending on which model's tokenizer you're using. A period is a token. A space before a word can change which token that word maps to. The model processes these integer sequences the way you process sentences, except there's no grammar, no word boundaries, and no meaning baked in at this stage. Just chunks and their IDs.
The working approximation: one token equals roughly four English characters, or about three-quarters of a word. A 750-word document runs approximately 1,000 tokens. This holds for average English prose. It breaks for code (which tokenizes at roughly 1.5–2x the rate of prose due to syntax and special characters), structured data, non-English languages, and dense technical terminology. All of these tokenize less efficiently because the tokenizer encounters character sequences it hasn't learned to compress.
How the splitting works
The dominant tokenization algorithm is Byte Pair Encoding (BPE), originally developed as a text compression method and adapted by OpenAI for its GPT models. OpenAI publishes their tokenizer as an open-source library called tiktoken. Google's models use a related but distinct approach called SentencePiece, which operates directly on raw text including whitespace rather than pre-tokenized words. The core principle is the same across both: build a vocabulary of common text fragments by learning which byte sequences appear most frequently in training data.
BPE doesn't split on spaces or word boundaries. It starts from individual bytes and iteratively merges the most frequent adjacent pairs into single tokens, building up a vocabulary of common fragments. Short, common words often survive intact. Longer or rarer words get broken into pieces the algorithm encountered more frequently during training.
The word "tokenizing," run through OpenAI's GPT-4 tokenizer (the cl100k_base vocabulary), splits into three pieces (this Galileo post walks through actual tiktoken library output): "token," "iz," "ing." Each piece maps to an integer. The model processes those three integers. The word "authentication" splits along boundaries the algorithm learned from frequency patterns: common suffixes like "tion" and "ing" tend to become single tokens, so the split falls somewhere like "auth," "ent," "ication." The exact split depends on which tokenizer vocabulary you're using. OpenAI's current models use a vocabulary called o200k_base with roughly 200,000 merge rules. An older model's tokenizer will split the same word differently.
Sit with that for a second. The same sentence, sent to two different models, can produce different token counts. The text hasn't changed. The measurement system has. Anthropic illustrated this concretely: their Claude Opus 4.7 shipped with a new tokenizer that can produce up to 35% more tokens for the same input text (Finout's analysis draws on Anthropic's own migration guide) compared to its predecessor, with the upper end showing up most on code and structured data. Same words, same rate card, different bill. When providers quote context window sizes and per-token prices, those numbers are denominated in their tokenizer's units. There is no universal token.
Capitalization, leading spaces, and position within a string all affect how text gets split. "Red" at the start of a sentence and " red" after a space produce different token IDs. The model processes a sequence of chunks that encode both the text and some of its structural context.
Recap
- Token: A sub-word chunk of text, roughly four English characters, produced by a tokenizer algorithm before the model processes anything. The model sees integer IDs, not words.
- Tokenizer: The algorithm that splits text into tokens. Different models use different tokenizers, so identical text can produce different token counts and different costs across providers.
The term collision
You've noticed the problem already. The AI world and the identity world both use the word "token" constantly, and they mean completely different things by it.
In IDAM, a token is a security artifact: scoped, expirable, issued by an identity provider, carrying authorization. In LLM conversations, a token is a sub-word text chunk carrying no identity, no authorization, no scope, no expiration. It's a unit of measurement, like a byte. The only bridge is that both are units systems count and constrain — an OAuth token has a lifetime; an LLM token contributes to a budget. That's where the analogy stops bearing weight. When someone in a mixed AI-and-identity conversation says "token budget," they mean how many text chunks the model can handle. Ask which kind. You'll save everyone ten minutes.
The context window
The context window is the total number of tokens a model can handle in a single call. Input and output combined, drawn from the same pool. The "combined" part is the detail people miss.
Each model publishes a maximum token limit covering everything: the prompt you send, any documents you attach, the system instructions that shape the model's behavior, and the response the model generates. If a model has a 200,000-token context window and you send 180,000 tokens of input, you've left 20,000 tokens for the response. Send 199,000 and you get 1,000 back. The budget is zero-sum.
Current flagship models from the major providers offer context windows of 1 million tokens or more (Anthropic's API documentation confirms this for Claude Opus 4.7, Opus 4.6, and Sonnet 4.6). That's roughly 750,000 words, which sounds like plenty. Three things eat into it.
Quality degrades with volume. Practitioners consistently report that model performance on retrieval and reasoning tasks drops as context fills, particularly when relevant information sits in the middle of a long input. A 1M-token window means the model can accept that much text. The spec says "capacity," but in practice you get capacity with diminishing returns.
Cost scales, sometimes nonlinearly. More tokens in, more money out. Some providers add a surcharge above a threshold. OpenAI's GPT-5.4 doubles its input price above approximately 272,000 tokens, applied to the entire request. Anthropic's current Claude Opus and Sonnet models offer their full 1M-token window at standard pricing with no long-context surcharge, a deliberate change from earlier models that charged a premium above 200K tokens. Same capability label, materially different cost architecture.
Overhead eats capacity before you start. System prompts (the instructions shaping model behavior) run 500–2,000 tokens. Conversation history in multi-turn interactions accumulates. Retrieved documents consume budget. In agentic workflows where the model calls tools and processes results across multiple steps, a single session can burn 500K+ tokens (this estimate from iternal.ai's token projection guide reflects agentic coding benchmarks; complex debugging and self-correction loops can push even higher). The context window is the gross number. Your usable space is the net.
Recap
- Context window: The total token budget for a single model call, shared between input and output. Current flagships offer 1M+ tokens, but quality degradation, cost, and overhead constrain practical use below the theoretical maximum.
- Long-context premium: Some providers charge more above a token threshold; others include the full window at standard rates. This difference matters for cost estimation on document-heavy workloads.
What it costs
Token pricing is denominated per million tokens, with input and output priced separately. Output tokens cost more, typically 4–5x the input rate, because generation is computationally harder than ingestion. All numbers below are current as of May 2026 and will change. Verify at the provider's pricing page before using them in any customer-facing context.
A snapshot across two providers at standard tiers:
| Model | Input (per 1M tokens) | Output (per 1M tokens) |
|---|---|---|
| GPT-5.5 | $5.00 | $30.00 |
| GPT-5.4 | $2.50 | $15.00 |
| GPT-5.4 Nano | $0.20 | $1.25 |
| Claude Opus 4.7 | $5.00 | $25.00 |
| Claude Sonnet 4.6 | $3.00 | $15.00 |
The math for a single call: (input tokens ÷ 1,000,000 × input rate) + (output tokens ÷ 1,000,000 × output rate).
Picture the actual conversation. You're in a discovery call with a federal civilian agency. The CIO's technical advisor asks what it would actually cost to run AI-assisted document summarization across their policy library. A real budget question. You need to walk through the math in real time, or at least know the shape of the answer well enough to keep the conversation moving.
A standard government page runs 250–300 words. Ten pages of policy text tokenizes to roughly 3,500–4,500 tokens for clean prose, closer to 5,000–6,000 for dense technical or legal language where uncommon terms split into more sub-word pieces. Add a system prompt (~500 tokens) and the user's query (~200 tokens), call it 5,000 input tokens. The model returns a 450-word executive summary: about 600 output tokens.
On GPT-5.4 at standard pricing: $0.0125 input + $0.009 output = roughly two cents per document.
Five hundred documents a day: about $11. A month of that: around $330. Small numbers per call. They compound in production. And they shift meaningfully when you change models, hit long-context thresholds, or scale to agentic workflows where the model makes multiple calls per task. You want to be the person in the room who can sketch this on a whiteboard when the conversation stalls on "but what does this actually cost."
Two more details for those conversations. First, OpenAI charges a 10% uplift for regional processing endpoints, directly relevant to public sector buyers with data residency requirements. Second, the input-output price gap means the cost profile of a workload depends as much on what the model produces as on what it receives. Tasks that generate long outputs (drafting, code generation) cost disproportionately more than tasks with short outputs (classification, summarization). When a buyer asks "what does this cost to run," the shape of the output matters as much as the size of the input.
Recap
- Token pricing: Billed per million tokens, input and output priced separately. Output costs 4–5x more. A typical document summarization call on a mid-tier model costs roughly two cents.
- Cost drivers: Model choice, context length thresholds, output length, and regional processing requirements all move the number. Small per-call costs compound at production scale.
The short version
Tokens are sub-word text chunks, about four English characters each, produced by a tokenizer before the model touches anything. The context window is the total token budget for one call: input plus output, drawn from the same pool. Current flagships offer 1M+ tokens of context, but cost, quality degradation, and overhead mean the usable number is always lower. Pricing runs per million tokens with output costing several times more than input.
The word "token" in an AI conversation has nothing to do with the word "token" in an identity conversation. One is a unit of text measurement. The other is a security artifact carrying authorization. The first time that distinction keeps a discovery call on track instead of sideways for ten minutes, you'll know why it mattered.
Things to follow up on...
- Tokenizers aren't interchangeable units: Anthropic's Opus 4.7 tokenizer can produce up to 35% more tokens for identical input text, meaning the same document costs more on the new model even at unchanged per-token rates.
- The four-character rule breaks: A Galileo analysis running tiktoken on mixed content found the "characters ÷ 4" heuristic can miss by 37% on text with emoji, code, or special characters.
- Long-context pricing diverges by provider: OpenAI doubles input pricing above ~272K tokens on GPT-5.4 while Anthropic eliminated surcharges entirely for their 4.6-generation models, a contrast worth understanding before estimating costs on document-heavy workloads.
- Agentic workflows burn context fast: Complex debugging and self-correction loops in agentic coding workflows can consume 1–3.5 million tokens per task, pushing well past what single-document use cases would suggest.

