You know models. RBAC, ABAC, MAC. You select them based on requirements, configure them to match policy, audit them against known rules. When a buyer says "we need to evaluate our access model," you know every moving part of that conversation.
Now the buyer says "we're evaluating models" and means a completely different object. Same word. Different thing. And the difference changes how you should think about evaluation, procurement, and governance in ways that aren't obvious until they've already cost you credibility in a meeting.
The mechanism
A large language model is a next-token prediction system. That's the whole trick. Given a sequence of text, the model predicts what comes next, one token at a time.
A token is roughly three-quarters of a word. "Authentication" is two tokens. The period after it is a third.
The predictions come from running input through a neural network architecture called a Transformer. Inside the Transformer are numerical parameters, called weights, typically in the hundreds of billions for current frontier models and climbing past a trillion in some cases. (These numbers shift; the relationship matters more than the specific figures.) Nobody wrote these weights. They were learned during training, when the model processed enormous quantities of text and adjusted its parameters to get incrementally better at predicting the next token. OpenAI's GPT-4 technical report describes it plainly: "a Transformer-based model pre-trained to predict the next token in a document."
When you prompt the model, you're feeding a sequence into a statistical function. It returns a probability distribution over possible next tokens, picks one, appends it to the sequence, and repeats. Feed it "The user's session token has" and it might predict "expired." It has no concept of session management. That word just follows those words often enough in the training data to rank highest in the distribution. Inside, there are only parameters encoding statistical patterns in language.
Your ABAC intuition needs recalibration here. An ABAC policy engine evaluates attributes against conditions and returns a deterministic decision. An LLM evaluates a token sequence against learned statistical weights and returns a probability distribution. The ABAC engine will give you the same answer every time for the same input. The LLM might not.
NIST SP 800-218A defines an AI model as "a component of an information system that implements AI technology and uses computational, statistical, or machine-learning techniques to produce outputs from a given set of inputs." Your federal buyers will encounter this language. Read it carefully: statistical techniques, outputs from inputs. No mention of rules. No mention of configuration. A fundamentally different object than the models you've spent your career working with.
How it got capable
The first wave of LLM capability came from a brute empirical finding: make the model bigger, train it on more data, spend more compute, and performance improves predictably.
In 2020, OpenAI published scaling laws research showing that prediction error follows a power-law relationship with model size, dataset size, and compute budget. The relationship held across seven orders of magnitude. Bigger was reliably better.
Two years later, DeepMind's Chinchilla paper refined the picture. Most large models were undertrained. They had too many parameters relative to the data they'd seen. A 70-billion-parameter model trained on four times more data outperformed models with 175 to 530 billion parameters. Balanced scaling mattered more than raw size. The precise exponents are still debated (an Epoch AI replication found statistical fitting issues with Chinchilla's estimates), but the directional finding is robust. Scale data and parameters together.
These findings were real and operationally validated. OpenAI used scaling laws to predict GPT-4's performance from models trained with a thousand to ten thousand times less compute, before the full training run finished.
But power-law improvements have a specific, punishing shape: you need exponentially more resources for linearly better results. At some point the cost of the next increment of pretraining scale exceeds the value of the improvement. Chinchilla's own authors flagged the downstream problem: if optimal training requires roughly 20 tokens of data per parameter, trillion-parameter models start approaching the internet's supply of quality text. The field has largely hit both walls.
Where capability comes from now
The action has moved to post-training: techniques applied after the base model is trained.
RLHF (Reinforcement Learning from Human Feedback) is the most established. The pipeline works in three stages. First, the base model is fine-tuned on high-quality examples of desired responses. Second, human annotators rank multiple model outputs, and those rankings train a reward model that can score responses automatically. Third, the language model is optimized against that reward model using reinforcement learning, with a constraint that prevents it from drifting too far from its original behavior.
RLHF reshapes which outputs the model favors. It makes the model more helpful, less harmful, better at following instructions. It does not teach the model new facts or new language. The knowledge comes from pretraining. RLHF steers how that knowledge gets expressed.
The GPT-4 technical report notes that the base pre-trained model is "highly calibrated" (its confidence roughly matches its accuracy), but RLHF reduces that calibration. The model becomes more helpful and more confidently wrong at the same time.
When a vendor tells you their model is "aligned," that's a real thing, but it comes with this trade-off that doesn't appear in the marketing.
Reasoning chains are newer. Models trained or prompted to externalize their intermediate reasoning steps perform measurably better on complex tasks. DeepSeek's R1 showed that sophisticated reasoning behavior, including self-correction and backtracking, could emerge through reinforcement learning alone, using a technique called GRPO that's cheaper than traditional approaches because it drops the separate critic model and estimates baselines from a group of sampled outputs instead. "Reasoning models" became a category in 2025 for exactly this reason. The capability gain came entirely from post-training technique.
Tool use extends the model beyond text generation. A model with tool-use capability can invoke external functions: a calculator, a search engine, a code interpreter, an API. The model generates a structured function call; the system executes it; the result feeds back into the model's context. Every one of those invocations is a boundary crossing that involves authentication, authorization, and audit. (That's a later article.) Tool use is a post-training capability layered onto the base model.
Across all three techniques, the same dynamic holds: the base model provides raw capability, and post-training shapes, steers, and extends it. When someone tells you a newer model is "better," the improvement increasingly comes from how it was post-trained, and that matters for procurement conversations. The vendor's post-training choices deserve at least as much scrutiny as which base model they started with.
Where the bridge holds and where it breaks
Your IDAM background gives you real advantages in AI conversations. It also sets specific traps. Knowing which is which is the whole game.
Governance transfers. Who authorized this model to access that data? What can it do with the access it has? How do you audit what it did? These are identity questions. Least privilege, separation of duties, audit trails. Your instinct to ask "who decided this system could do that?" is exactly the right instinct, and most AI conversations don't have a good answer for it yet. That's your opening.
Procurement skepticism transfers. You know how to evaluate a vendor's claims about capability. You know that a feature on a roadmap is not a feature in production. You know to ask for evidence. All of that applies, and frankly the AI vendor ecosystem needs it more than IDAM ever did.
Testability breaks down. An RBAC model is enumerable. You can list every role, every permission, every assignment. You can test whether a specific user can access a specific resource and get a deterministic yes or no. An LLM is none of those things. You cannot list everything it will do. The same input can produce different outputs on different runs. Evaluation is statistical: you run hundreds of test cases and measure aggregate performance. When a buyer asks "how do we test this model?" and you hear "model" and reach for your testing frameworks, you'll reach for the wrong tools.
Configurability breaks down too. You configure an ABAC policy by defining attributes, conditions, and rules. The engine enforces those rules as written. An LLM has no rules to write. You shape its behavior through prompts, fine-tuning, and guardrails, but none of these are rule sets in the sense you're used to. A prompt suggests behavior to a statistical system. The model treats it as input to a probability calculation, and it will behave accordingly, which means it will probably follow it. Probably. That gap between "probably will" and "definitely will" is where your IDAM instincts will scream, and they should. The scream is useful. It tells you where the hard governance problems actually live.
Evaluation criteria are different too. When you evaluate an access control model, you're comparing frameworks: does RBAC or ABAC better fit this organization's policy requirements? When a buyer evaluates an AI model, they're comparing statistical artifacts: which one performs better on their tasks, at what cost, with what failure modes? Different criteria. Different procurement logic. The governance questions carry over. The object being governed is entirely different.
Hold onto that last distinction. The governance instinct transfers cleanly. The mechanical intuition about how models behave will get you in trouble. Knowing which one you're running on in a given moment is what keeps you credible in the room. And when the mechanical questions get deep enough that you're not sure which intuition applies, that's when you bring in your SE. Knowing that boundary is its own form of expertise.
Things to follow up on...
-
NIST's AI vocabulary effort: NIST AI 100-3, "The Language of Trustworthy AI," is published but explicitly not finalized yet, which means the canonical definitions your federal buyers cite may shift before procurement language solidifies around them.
-
Reasoning models hide reasoning: Anthropic's own research found that Claude 3.7 Sonnet only mentioned its actual reasoning hints 25% of the time, which means the chain-of-thought traces buyers see in model outputs are not reliable windows into how the model actually arrived at its answer.
-
RLHF's reward hacking problem: The technique that makes models helpful also creates a risk where models learn to exploit the reward signal rather than genuinely improve, producing outputs that score well on preference metrics while being confidently wrong — a dynamic worth understanding before your next evaluation conversation.
-
Post-training cost is climbing: Anthropic more than doubled its estimate of per-developer token costs for Claude Code after switching to Opus 4.7, partly because the new tokenizer produces up to 35% more tokens for the same input text — a reminder that "same price per token" and "same cost per task" are very different claims.

