Large Language Models

By Leigh Garrity— May 9, 2026

What It Is

An LLM is a transformer-based neural network — a specific architecture that processes sequences of text by learning relationships between words across context windows that can span tens of thousands of words. The network has parameters (numerical weights, typically in the billions) that are adjusted during training until the model gets reliably good at its one job: given the text so far, predict what comes next.

Training happens on text at a scale that's hard to hold in your head. Frontier models are trained on trillions of tokens — a token is roughly a word or word fragment, not the OAuth kind — drawn from books, web pages, code repositories, and scientific literature. The model sees a sequence, predicts the next token, checks whether it was right, and adjusts its weights slightly. Repeat this process billions of times across trillions of examples, and something unexpected happens.

The model gets good at writing. And reasoning. And code. None of those capabilities were explicitly trained. They emerged.

“

• LLM: A neural network trained to predict the next word in a sequence, at a scale large enough that reasoning, writing, and code emerge as byproducts. The training objective is humble; the outputs are not.

How It Works

The emergence story is the part that trips people up, because it seems like it shouldn't work. Predicting the next word is a narrow task. But to predict the next word well across a corpus that includes federal acquisition regulations, Python tutorials, medical literature, and three decades of Stack Overflow, the model has to develop internal representations of grammar, logic, factual relationships, and rhetorical structure. It doesn't learn these things because anyone told it to. It learns them because they're useful for the prediction task.

This is the scaling law insight, and for about a decade it was the dominant story in AI research: make the model bigger, give it more data, and it gets better. A 2022 paper from DeepMind — the Chinchilla work — complicated this picture usefully. It showed that many large models were actually undertrained relative to their parameter count, meaning you could get equivalent performance from a smaller model trained on more data. The simple "bigger = better" story gave way to a more precise "compute-optimal training requires scaling both parameters and data together."

Whether we've hit the ceiling on pre-training scale is genuinely contested right now. Some researchers argue that returns on raw parameter count are diminishing and that the low-hanging fruit of internet-scale text has been picked. Others argue the ceiling hasn't been found yet and that data quality, not quantity, is the binding constraint. What's less contested is that the biggest recent capability gains haven't come from larger models. They've come from what happens after the initial training.

Post-training techniques are where the current action is. Three matter for this conversation:

RLHF (Reinforcement Learning from Human Feedback) is the technique that made these models usable rather than just technically impressive. Human raters score model outputs, and the model is trained to produce outputs that score well. A frontier model today doesn't just complete your sentence in a statistically plausible way — it responds to what you actually wanted. The gap between the original GPT-3 and ChatGPT came down mostly to RLHF, not parameter count.

Chain-of-thought reasoning is the practice of training or prompting a model to reason step-by-step before producing a final answer. Asking a model to show its work reduces errors on complex tasks significantly. No larger model required — just a different approach to how the model uses the capacity it already has.

Tool use is the capability that connects most directly to your accounts. A model with tool use can call external APIs, run code, search the web, or query a database during a conversation. It's one way to reach outside the frozen knowledge state described below, but it's also where identity questions begin: you now have an AI agent making authenticated requests on behalf of a user, and someone has to decide what that agent is allowed to do.

“

• Emergence: The capability gains that appear when next-token prediction runs at sufficient scale — reasoning, writing, code — without being explicitly trained. The current frontier extends these gains through post-training techniques, with raw parameter count delivering diminishing returns, though the ceiling on pre-training scale remains a live debate, not a settled question.

“

IDAM Concept Mapping: The Frozen Directory

Closest IDAM analogy: a directory that was populated once and never synced again.

Everything in an LLM's training corpus reflects the state of human language up to a specific cutoff date. Nothing after that date exists in the model's knowledge — not a new agency reorganization, not a policy change, not a breach that happened last quarter. In this sense, the model's knowledge state resembles a stale directory: bounded, fixed, and aging from the moment training ends.

The analogy holds there. It breaks hard on the next part.

A directory can be queried precisely and audited completely. You can ask for every user in a group and get a deterministic answer. You can pull an access log and trace any claim back to its source. An LLM's knowledge doesn't work this way. You can ask a model what it knows about a specific agency's procurement history, and it will produce a confident, fluent answer — but you cannot verify whether that answer reflects something in the training data, a confabulation, or a blend of both. The knowledge is distributed across billions of parameters in a form that's not inspectable. There's no audit log. There's no query interface that returns ground truth. The model doesn't know what it doesn't know, and neither do you.

When You'll Need This

The scenario that comes up most often in public sector accounts isn't a technical deep-dive. It's a CIO or procurement officer asking some version of: "What does this model know about our agency?" or "Can it access our data?"

Both questions reveal a directory-like mental model. The buyer is imagining a system with a defined, queryable knowledge state — something you could scope and audit. The correct answer is more uncomfortable: the model's knowledge is whatever was in its training corpus, distributed across its parameters in a way that can't be inventoried. If your agency's public documents were on the web before the training cutoff, the model may have learned from them. You can't check which ones. You can't remove specific knowledge without retraining the model.

This matters differently depending on the deployment context. A model accessed via API, with no retrieval augmentation, has a fixed knowledge state that ends at its training cutoff — the frozen directory. A model with tool use can reach outside that state, which is where the identity questions begin: what credentials is it using, what systems can it reach, and who authorized that access?

The useful move in that CIO conversation is to separate two questions that buyers routinely conflate: what the model knows (training corpus, fixed, opaque) versus what the model can access (tool use, retrieval, API calls — which is where your IDAM conversation actually lives). The first question doesn't have a clean answer. The second one does, and it's yours to own.

“

• Knowledge cutoff: The point in time after which nothing exists in an LLM's training data. Unlike a stale directory, this boundary can't be queried or audited — the model's knowledge state is opaque in a way that directory state is not. Separate what the model knows from what it can access, because only the second question has an answer you can work with in a procurement conversation.