An LLM is a neural network — a very large one — trained to do exactly one thing: given a sequence of text, predict what comes next. The complete training objective is next-token prediction. Not language understanding, not accurate question answering, not helpfulness. Predict the next token. The surprising part is how much capability falls out of doing that at sufficient scale across sufficiently diverse text.
Understanding that mechanism — why a narrow prediction task produces broad capability, and where the current limits actually are — is what makes the rest of the AI conversation legible. Every product claim, every procurement question, every CIO objection traces back to this.
What It Is, Precisely
A large language model is a neural network with billions of learned numerical parameters. Those parameters encode, in a form no human can read directly, the statistical relationships between tokens — fragments of text, roughly word-sized — learned by processing trillions of words of training data. At inference time, the model takes a sequence of tokens as input and produces a probability distribution over what token is likely to come next. It samples from that distribution, appends the result, and repeats.
The word "large" in the name refers to parameter count — current frontier models run in the hundreds of billions of parameters — and to training data scale, which is measured in trillions of tokens. Both matter, and the relationship between them turns out to be important in ways the field is still working through.
What an LLM is not: a database, a search index, or a rule system. It doesn't retrieve stored facts. It generates text by sampling from learned distributions. The distinction matters when a buyer asks "how does it know things?" — the honest answer is that it doesn't know things the way a database knows things, and the implications of that difference are real. (Failure modes from this architecture are covered in 1.7.)
• Definition: An LLM is a neural network trained to predict the next token in a sequence, across trillions of words of text. Its capabilities emerge from that training objective applied at scale — they are not explicitly programmed.
Why One Objective Produces Broad Capability
To reliably predict the next token in a Python script, a model has to learn Python syntax and semantics. To predict the next sentence in a clinical case study, it has to learn something about clinical reasoning. To predict the next paragraph in a legal brief, it has to learn how legal arguments are structured. Nobody programmed those things in. They're necessary to do the prediction task well, across enough examples.
The training corpus is essentially a compressed representation of human knowledge production — everything from Stack Overflow threads to federal regulatory filings to academic papers to Reddit arguments about Star Trek. Next-token prediction, applied at scale to that corpus, forces the model to learn the latent structure of each domain, because surface-level pattern matching isn't sufficient to predict accurately across diverse text types.
Code generation, policy reasoning, instruction following — these fell out of the training objective because code, policy documents, and instructions were all in the training data. The breadth is a consequence of the depth required by the task.
The scaling era — roughly 2018 to 2023 — was built on the observation that more parameters plus more data produced better prediction, and better prediction correlated with better downstream performance on almost everything. The relationship was regular enough that researchers formalized it as scaling laws: given a compute budget, there's an optimal allocation between model size and training tokens. The Chinchilla paper from DeepMind in 2022 established that earlier large models had been undertrained relative to their size — you could get equivalent performance from a smaller model trained on more data. That recalibrated how labs thought about the size-versus-data tradeoff.
• Emergence: Broad capability isn't a design choice — it's a consequence of the training objective applied at scale. The model learns domain structure because accurate prediction requires it. • Scaling laws: The relationship between model size, training data, and capability is regular enough to predict. More compute helps, but the allocation between size and data matters.
Where the Frontier Actually Is
Raw scaling is not done. But it's yielding diminishing returns relative to cost, and the qualitative improvements from simply making models bigger have slowed. The labs that have been most transparent about this — Anthropic's research team has published on it, and the pattern shows up in public benchmark trajectories — describe a shift from "pre-training improvements" to "post-training improvements" as the primary driver of capability gains.
Three post-training techniques are doing most of the current work.
Reinforcement Learning from Human Feedback (RLHF). After a model is pre-trained on next-token prediction, human raters evaluate pairs of model outputs and indicate which is better. Those preferences train a separate reward model. The reward model is then used to fine-tune the base LLM via reinforcement learning — nudging the model toward outputs that score higher on human preference. The result is a model that follows instructions more reliably, avoids certain output patterns, and behaves more like a useful assistant. RLHF doesn't change the architecture; it changes what the model optimizes for after the base training is complete.
Reasoning chains. Research starting around 2022 showed that prompting models to generate intermediate reasoning steps before producing a final answer dramatically improved performance on tasks requiring multi-step logic. The more recent development is training models to produce reasoning traces as part of their output by design — a trained behavior, not a prompting trick. These "reasoning models" allocate more computation to working through a problem before committing to an answer. The capability improvement is real and measurable on hard benchmarks. You're changing what the model does at inference time, not how many parameters it has.
Tool use. Models can be trained or prompted to call external systems — APIs, code interpreters, search indexes — as part of generating a response. The model learns to recognize when it needs external information, format a call to get it, and incorporate the result. A model calling an external API needs credentials, and those credentials need to be managed. That problem is covered in later articles; the point here is that it marks where the capability boundary of the model itself ends and the infrastructure question begins.
The labs that are ahead right now are ahead on post-training, not on parameter count.
• Post-training: RLHF, reasoning chains, and tool use are the current primary levers for capability improvement — not raw model size. • Reasoning models: A qualitatively different approach that allocates more inference-time computation to working through problems, not just a scaled-up base model.
When You'll Need This
A public sector CIO asks: "How is this different from the AI we evaluated three years ago?" The answer involves two things. First, the base capability has improved substantially — a product of scale and better training data. Second, the post-training work has made these models dramatically more reliable for specific task types. RLHF is why a current model follows a system prompt more consistently than an earlier one. Reasoning chains are why a current model can work through a multi-step procurement analysis without losing the thread. These are the result of specific training choices with documented mechanisms.
A procurement officer asks: "Can we audit what it knows and how it decides?" You have to be straight here: not in the way you can audit a rule system. The knowledge is encoded in billions of parameters that don't map to human-readable logic. This is a genuine constraint, not a gap that will be patched in the next release. The governance implications are real, and the field is still developing the tools to address them.
IDAM Concept Mapping: The Policy Decision Point
The closest IDAM analog to an LLM is a Policy Decision Point. A PDP takes an input — an access request — and produces an output — permit or deny — by evaluating it against stored policy logic. An LLM takes an input — a prompt — and produces an output — text — by evaluating it against stored learned weights. Neither retrieves a pre-computed answer; both compute a response from stored knowledge at runtime.
The analogy holds up to that point. Then it breaks, and the break is the lesson: a PDP is deterministic and auditable. The same request produces the same decision every time, and you can read the policy to understand why. An LLM is probabilistic by design — the same prompt produces different outputs on different runs — and the decision logic is not auditable in any meaningful sense. The weights encode billions of learned associations that no human can inspect directly. For a public sector buyer asking about governance and accountability, that difference is not a footnote. It's the central question.
• Analogy: LLMs compute responses from learned weights the way a PDP computes decisions from policy rules — neither retrieves pre-computed answers. • Break: A PDP is deterministic and auditable. An LLM is neither. That gap is where the governance conversation lives.

