A large language model is a next-token predictor. It takes a sequence of text, predicts what comes next, appends that prediction, and predicts again. Every AI concept in this section and most of the AI conversations happening in your accounts right now build on that single mechanical fact.
The conversational fluency, the code generation, the uncanny ability to sound like it understands your question: all of it is the result of running that prediction loop thousands of times per response. That loop is the whole mechanism.
Where the prediction loop comes from
Every major LLM runs on the Transformer, an architecture introduced in 2017 by researchers at Google. The key innovation was an attention mechanism that lets the model weigh relationships between all parts of a sequence simultaneously rather than processing them in order. This made training massively parallelizable. Parallelizable training made internet-scale data feasible. And internet-scale data made everything that followed possible.
The original Transformer had two halves: an encoder and a decoder. Today's LLMs (GPT-4, Claude, Gemini, Llama) use only the decoder half. Skip the distinction for now. The decoder's job is what carries forward: generate output one token at a time, always moving forward, always predicting the next token from everything before it.
Those tokens are sometimes whole words, sometimes fragments. The next lesson covers tokens properly. For now, treat "token" as "chunk of text the model reads and produces."
How the model learns to predict
The training method is self-supervised learning, and it matters because it explains both what the model knows and what it doesn't.
Take a sentence: The agency submitted its budget request on Tuesday. During training, the model sees The agency submitted its budget request on and tries to predict Tuesday. Then it checks whether it was right. Every word in the training data is both the question and the answer. No human labels anything. The text supervises itself.
Scale that to trillions of tokens across books, websites, code repositories, government documents, forum posts. The model learns patterns: statistical relationships between tokens that let it predict, with impressive accuracy, what comes next. Those patterns get encoded in the model's parameters (its weights), and once training finishes, the weights are frozen. The model ships with whatever it learned. It doesn't keep learning from new conversations.
One thing to hold for public sector conversations: that training data is overwhelmingly public internet text. Your buyer's internal policies, SOPs, and acquisition documents were almost certainly never in it. The model's weights don't contain agency-specific knowledge, because the training data didn't either.
How it generates a response
When you send text to an LLM, the mechanical sequence is:
- The model reads your input.
- It produces a probability distribution across all possible next tokens, then selects one. (How it selects, and what controls that selection, is a later lesson.)
- It appends the selected token to the sequence.
- It feeds the extended sequence back in and predicts again.
- Repeat until it hits a stop signal or a length limit.
Predict, select, append, repeat. Anthropic describes Claude as a system trained to predict the next word that "writes text one word at a time."
The part worth sitting with: the model commits to each token before it knows what the next one will be. The fluency of the output suggests a draft behind the curtain, some internal outline guiding the response. The curtain is empty. It's improvisation, one token at a time, all the way down.
In identity, a token is a credential (OAuth access token, SAML assertion). In AI, a token is a unit of text the model reads and generates. When a buyer says "token costs," they mean per-unit pricing for text processing, not credential lifecycle. If your instinct reaches for OAuth, you'll answer a question nobody asked.
Two consequences that matter before your next meeting
Everything in this section unpacks specific concepts. Two follow directly from the mechanism above, and you need them now.
Hallucination is mechanical. A model that predicts the next token from statistical patterns will sometimes produce sequences that are fluent, plausible, and wrong. A 2025 OpenAI analysis argues that hallucinations originate as classification errors: when incorrect statements can't be reliably distinguished from facts in the training distribution, the model generates them through "natural statistical pressures." The training objective and evaluation practices together create the condition. A smarter model still carries this. Prediction from patterns produces it structurally. A system optimized to produce what sounds right will sometimes produce things that sound right and aren't. A later lesson goes deep on this. For now, hold the mechanical intuition: it's the only explanation you need in a meeting, and it has the advantage of being true.
"Just give it our docs" is not a one-line fix. When a buyer says this, they're imagining the model as a search engine: feed it documents, it retrieves answers. But the model's knowledge lives in only two places: its weights (patterns from training, frozen) and its context window (whatever text is included in the current request, which Anthropic's documentation calls "working memory"). You can put documents into the context window, and the model will predict based on them. But it's still predicting. When the conversation ends, that context is gone. Getting an LLM to reliably use your organization's documents is a real engineering problem with real solutions. It starts with understanding that you're augmenting a predictor.
An LLM has no system of record. Unlike a directory, its weights encode patterns, not retrievable facts, and its context window is temporary. When buyers talk about "giving the model our data," your access control instinct is correct, but the assumption that the model stores and retrieves like a directory is where IDAM intuition breaks.
What comes next
This piece is the frame. The lessons that follow fill it:
- What tokens actually are and why they cost money
- How input shapes output
- How you connect a predictor to real information
- Why hallucination resists easy fixes
- What controls shape the model's behavior when it selects from its predictions
Every one of them builds on the single mechanical fact you now hold. When a concept in a later lesson feels abstract, come back to the prediction loop. It's the ground truth.
Things to follow up on...
- The paper that started it: Vaswani et al.'s 2017 "Attention Is All You Need" is eight pages and readable, and it introduced the Transformer architecture that every major LLM builds on today.
- Why hallucination persists structurally: Kalai and Nachum's analysis for OpenAI walks through why even error-free training data and larger models don't eliminate hallucination, because evaluation practices reward guessing over admitting uncertainty.
- Context windows explained by Anthropic: Anthropic's own context window documentation is the clearest primary-source explanation of what "working memory" means for a model and why more context isn't automatically better.
- Where retrieval fits in: OpenAI's guide to optimizing LLM accuracy lays out when you need retrieval-augmented generation versus fine-tuning versus prompt engineering, which is the real decision tree behind "just give it our docs."

