You know what a token is. It has scopes, an expiry, and an issuer. You've debugged token lifetimes at 2 AM and sold platforms that manage millions of them.
In an AI architecture conversation, "token" means something completely unrelated. And because the word is identical, your brain won't flag the moment you stop understanding what's being said. Same word, foreign meaning, no warning. In a meeting with a CAIO, that's the difference between tracking the conversation and silently losing it while nodding along.
What a token actually is in an LLM
An LLM token is a chunk of text — sometimes a whole word, sometimes a syllable, sometimes a punctuation mark. A preprocessing step called a tokenizer breaks input into pieces the model can operate on.
Most English words become one or two tokens. "The" is one token. "Authentication" might be three. A paragraph of plain English runs roughly four tokens for every three words.
That's the whole concept. A token is the unit the model reads in, the unit it writes out, and the unit you get billed for.
No issuer. No expiry. No scope. No trust chain. Your IDAM intuition about tokens does not transfer here. Where it does transfer is coming, but for now, set it down.
The ruler keeps changing length
Different models use different tokenizers, and those tokenizers change between model versions. This is where a vocabulary problem starts costing money.
Anthropic's pricing documentation states that Claude Opus 4.7's tokenizer "may use up to 35% more tokens for the same fixed text" compared to its predecessor. Same input. Same JSON payload. Up to 35% more tokens billed, because the model slices text into finer pieces.
Anthropic says the finer tokenizer improves accuracy. But the cost implication is concrete: estimates built on Opus 4.6 token counts are wrong. The math was fine. The unit of measurement moved.
For public sector buyers, this compounds. Anthropic's inference_geo parameter restricts processing to US-only data centers at a 1.1x pricing multiplier. Stack that on the tokenizer change: 45-50% above what the same workload cost on the previous model with global routing. None of this is visible on the headline rate card.
Model pricing, tokenizer behavior, and context window sizes shift without notice. If a buyer tells you their AI costs are predictable, ask which model version they're running and when they last checked.
Context windows are working memory for one call
A context window is the total text a model can hold during a single interaction. Your prompt, any documents you've attached, the conversation history, the model's own prior outputs. All of it has to fit inside this window.
Current context windows across major providers have converged around 1 million tokens: Claude Opus 4.7, GPT-5.5, Gemini 3.1 Pro all sit at 1M. A year ago the standard was 128K-200K. Treat any specific figure as perishable.
A million tokens is roughly 750,000 words. Sounds enormous. It fills fast because the window holds everything: system instructions, retrieved documents, full conversation history, tool outputs. In an agent workflow making multiple calls, the accumulated context from prior steps gets passed forward into each new call, growing with every hop.
The context window exists for the duration of one API call. When the call ends, the window is gone. The model retains nothing. The next call starts empty unless your application explicitly repopulates it.
Training data, context, and memory are three different things
Buyers conflate these constantly, and the conflation leads to bad architectural decisions.
-
Training data is what the model learned from before deployment. Baked in. You can't change it, inspect it, or add to it after the fact. It's why the model knows what Python is. It's also why the model's knowledge has a cutoff date.
-
Context is what you put in the context window for a single call. Documents, instructions, conversation history. The model uses it for that call only. When the call ends, the context is gone.
-
Memory is an application-layer feature, not a model capability. When a chatbot "remembers" your name across sessions, the application is storing that information and injecting it back into the context window on subsequent calls. The model itself remembers nothing. The application does the remembering and pays the token cost to re-inject it every time.
When a buyer says "we need the model to know our policies," they could mean any of these three. The answer, the cost, and the security implications are different for each. Training is expensive and rare. Context is per-call and token-billed. Memory is an engineering pattern that raises its own questions: who decides what gets stored, who can retrieve it, and whose context window it lands in. Those are access control questions, and AI architectures don't have a standardized answer for them yet.
Inference is running the model
Inference is sending input to a trained model and getting output back. The model applies what it already knows to what you just gave it. Nothing updates. Nothing gets retained.
Every API call to Claude, GPT, or Gemini is an inference call. Every message a user sends to a chatbot. Every step an agent takes. Every tool invocation. Each one is a separate inference call, billed separately.
You pay per token in and per token out. Input tokens (what you send) and output tokens (what the model generates) are priced differently, with output typically costing 3-5x more per token. When someone in the room says "inference costs," they mean the ongoing operational cost of using the model, metered in tokens, billed per call. The electric meter running every time someone asks a question.
Where the money actually goes
Your instinct might be that the user's prompt is the expensive part. Look at where the tokens actually accumulate.
Researchers at Concordia University studied token consumption across 30 software development tasks in a multi-agent system. Input tokens, the accumulated context passed between agents, constituted 53.9% of total consumption. The iterative review stages where agents passed growing context back and forth accounted for 59.4% of all tokens. The human's original prompt was a rounding error.
This is a preprint, not peer-reviewed, scoped to software development tasks using a specific framework. But the pattern it describes is consistent with independent studies: input tokens consistently outnumber output tokens by 2:1 to 3:1 in agentic architectures. The accumulated context is where the real money goes.
With context windows at 1M tokens and agent loops running dozens of inference calls, a single workflow can burn through millions of tokens before producing a final output.
The analogy break
Your background is genuinely useful here, but in a specific and slightly frustrating way.
In IDAM, identity propagates through the transaction chain. An OAuth token carries the user's identity from authentication through every downstream service call. You can trace who did what, when, to which resource. The audit trail is the architecture's reason for existing.
In AI inference, that chain breaks. When a user triggers an agent workflow that makes API calls to a model, the identity recorded for billing is typically the agent's service credential, not the human who started the work. AWS Bedrock launched IAM principal-based cost attribution in April 2026, and it can capture federated identity through session tags for direct API calls. That's a real step.
But the moment an agent layer sits between the human and the model, the human identity drops out of the billing record. The IAM principal becomes the agent's execution role. In a multi-agent chain with several hops, you might have three different service roles, none of which map back to the person who asked the original question.
There's no equivalent of token propagation through the agent chain. No standard for carrying the originating identity through delegated AI calls. No cross-provider attribution schema. The FinOps Foundation has recognized AI as a distinct cost management domain, but identity-aware attribution through agent architectures remains an engineering problem without a standards-level answer.
You see the gap because you've spent your career building the systems that solve this exact problem for human access. That instinct transfers cleanly. The assumption that someone has already solved it for AI — that one will get you in trouble. When a buyer asks "how do we know which team is spending what on AI," the honest answer is: partially, in specific configurations, with significant gaps once agents are involved. That kind of honesty builds credibility in the room, because the buyer's engineers already know the gaps exist. They're waiting to see if you do too.
Three things to carry into the meeting:
- Token means processing unit, not credential. No issuer, no expiry, no scope.
- Context windows are per-call working memory. They're where costs accumulate, especially in agent loops.
- The identity-to-cost-attribution problem your IDAM instincts immediately spot is real, recognized, and unsolved. Nobody has a clean answer yet.
Things to follow up on...
- AWS Bedrock cost attribution: AWS launched IAM principal-based cost allocation for Bedrock in April 2026, including support for federated identity session tags from providers like Okta and Entra ID — the closest thing to an IDAM-native answer for AI spend tracking.
- Claude Code cost escalation: Anthropic more than doubled its published estimate of per-developer Claude Code costs after Opus 4.7 became the default model, putting the average at $13/developer/active day and confirming that tokenizer changes have real budget impact at scale.
- FinOps Foundation on AI: The FinOps Foundation's State of FinOps 2026 survey found 98% of practitioners now manage AI spend (up from 31% two years ago), with 48% identifying generative AI as their least predictable cloud spending category.
- OpenAI's long-context surcharge model: Unlike Anthropic's flat-rate approach, OpenAI charges a 2x input / 1.5x output premium for GPT-5.4 prompts exceeding 272K tokens — a meaningful architectural distinction when buyers are comparing providers for large-context workloads.

