The statistical process that produces a correct answer and the statistical process that produces a hallucinated one are the same process. Same mechanism, same math, same fluency in the output. The model has no internal state that distinguishes "I know this" from "I'm guessing." There is no confidence flag, no uncertainty register, no moment where the system switches from retrieval to fabrication. The generation mechanism runs one way. Sometimes the output is true. Sometimes it isn't. The model cannot tell which is which, and neither can you, by looking at the output alone.
This is the single most operationally important fact about language models in any environment where the answer matters.
Where hallucinations come from
You already know from earlier in this series that language models predict the next token based on statistical patterns learned during training. A 2025 paper by Kalai et al. formalizes why this prediction process inevitably produces hallucinations. The authors are principal researchers at OpenAI and Georgia Tech, and their argument is precise enough to be useful: hallucinations arise as errors in binary classification. When a model encounters a question where the correct answer and a plausible-but-wrong answer are statistically close together, the model will sometimes pick wrong.
The training data doesn't need to be dirty. The model doesn't need to be small. The statistical objective driving pretraining simply cannot perfectly separate rare truths from plausible fictions.
The paper gives a quantitative anchor worth holding onto. If 20% of facts in a given category appear only once in the training corpus, expect the model to hallucinate on at least 20% of questions in that category. That's a mathematical lower bound, not an average. A peer-reviewed study in PNAS confirmed the relationship empirically: the "monofact rate" (facts appearing exactly once in training data) directly predicts hallucination frequency.
That's pretraining. Post-training makes it worse in a specific, almost perverse way.
When models are fine-tuned and evaluated on benchmarks, the grading system rewards answering over abstaining. Kalai et al. put it directly: "language models are optimized to be good test-takers, and guessing when uncertain improves test performance." A model that says "I don't know" scores lower than a model that guesses confidently and gets it right 70% of the time. So the optimization that makes models useful, the thing that makes them answer your question instead of staring at you blankly, is the same optimization that makes them guess when they shouldn't.
One incentive, two faces. Helpfulness and hallucination come from the same training pressure.
A note on this source: Kalai et al. is an arXiv preprint, not yet peer-reviewed as of this writing. The mathematical framework builds on prior published work by Kalai and Vempala (STOC 2024), and the core claims have been corroborated in subsequent peer-reviewed work. It is the best formal explanation available for why hallucination is structural. Treat it as strong theoretical grounding that hasn't yet reached settled consensus.
Recap:
- Structural origin: Hallucinations arise because next-token prediction cannot always distinguish rare truths from plausible fictions, even with clean training data and large models.
- Training incentive: Post-training optimization rewards guessing over abstaining, so the same pressure that makes models helpful makes them confidently wrong.
The confidence problem
A hallucinated answer and a correct answer arrive with identical fluency. No stutter. No hedge. No tell. The model doesn't know it's wrong, so it can't signal that it's wrong.
NIST AI 600-1, the federal generative AI risk profile, names this phenomenon "confabulation." The vocabulary choice is deliberate: NIST avoided the clinical baggage of "hallucination" in favor of a term that describes the behavior more precisely. Both words refer to the same thing in this context, and your buyers may use either depending on whether they're reading policy documents or vendor literature. What matters is NIST's characterization of the risk: it arises "when users believe false content due to the confident nature of the response." NIST goes further, flagging that models sometimes generate fabricated citations and logical reasoning steps that appear to justify an incorrect answer. The model states the wrong thing and then builds a scaffold of fake evidence around it.
Benchmark data makes the scale concrete. On open-ended factual queries (the PersonQA benchmark), OpenAI's o3 reasoning model hallucinated 33% of the time, more than double the 16% rate of its predecessor o1. The smaller o4-mini hit 48%. On bounded summarization tasks with source documents, the same class of models scored below 1%.
A note on these numbers: The PersonQA figures for o3 and o4-mini are reported across multiple secondary sources citing OpenAI's evaluation benchmarks and the Vectara HHEM leaderboard. They are directionally consistent but drawn from secondary compilations, not primary model cards. Specific numbers will shift as models improve. The pattern they illustrate will not.
And the pattern is counterintuitive. Models optimized for deeper chain-of-thought reasoning performed worse on factual queries without source documents. The Kalai et al. framework explains why: chain-of-thought reasoning doesn't change the underlying binary classification problem at the generation step. It gives the model more tokens in which to build a convincing path to the wrong answer. Better at constructing plausible justifications, and still unable to flag when it doesn't know. If someone in a buyer conversation says "we're using a reasoning model, so hallucination isn't a concern," this is the data point that matters.
The difference across all these benchmarks comes down to whether the model has something to check against. Open-ended generation without source grounding produces materially higher error rates than grounded tasks, and reasoning capability alone does not fix this. The model can't signal its own uncertainty, so the mitigation has to come from outside the model.
Recap:
- Indistinguishable delivery: Hallucinated outputs carry the same confidence, fluency, and apparent reasoning as correct outputs. There is no built-in signal that the model is wrong.
- Task-dependent rates: Hallucination rates range from sub-1% on grounded summarization to 15–48% on open-ended factual queries, making the presence or absence of grounding the dominant variable.
What grounding fixes and what it doesn't
Grounding is the family of techniques that give a model something external to check against at inference time, rather than relying solely on what it learned during training. Four members of this family matter:
RAG (retrieval-augmented generation) retrieves relevant documents and feeds them to the model alongside the user's query. You encountered the mechanics in Lesson 3. What matters here is the hallucination-reduction effect: a clinical study using self-reflective RAG lowered hallucination rates to 5.8% in a medical decision-support context. Material improvement. The hallucination rate drops but never hits zero.
Citations and attribution work differently. Instead of just retrieving source material, the model is constrained to tag its output with references to the specific passages it drew from. The model can still misrepresent a source, but citations give a reviewer a direct path to check: here's what the model said, here's the passage it claims to be based on, do they match? Google's Vertex AI documentation calls this "auditability" and lists it as a distinct benefit alongside hallucination reduction. Citations make the answer checkable, which is the actual value. Correctness still requires someone doing the checking.
Tool-use is the most mechanically distinct member of the family. Rather than generating an answer from memory or retrieved text, the model calls an external system at inference time: querying a database, hitting an API, running a calculation. When a model calls a personnel database to look up an employee's clearance level instead of generating the answer from training patterns, it's using a tool. The answer comes from the system of record, not from next-token prediction. Tool-use sidesteps the generation mechanism entirely for the specific fact being looked up, which is why it's the strongest grounding technique for verifiable, structured data.
Verification workflows are the human-process layer. They sit outside the model and define who checks what, against which source, before the output becomes a decision. More on this below.
Every major provider offers grounding capabilities, and every one of them uses the word "reducing," not "eliminating." They use that word because grounding does not eliminate hallucination.
The ReDeEP paper found that "even with accurate and relevant retrieved content, RAG models can still produce hallucinations by generating outputs inconsistent with retrieved content." The RAGTruth corpus documented the same pattern. A model given the right document can still misrepresent what the document says. The Kalai et al. framework explains why: the binary classification problem exists at the generation step, not just the knowledge step. Better knowledge helps. It doesn't change the generation mechanism.
Grounding without verification produces a better guess. Acting on it still requires checking.
Recap:
- Grounding reduces, doesn't eliminate: RAG, citations, tool-use, and verification workflows materially lower hallucination rates by giving the model external information to anchor against, but the generation mechanism can still produce errors even with correct retrieved content.
- Error profile shift: Grounding changes the dominant failure mode from fabricated facts to misrepresented sources, a meaningful improvement because the second type is auditable.
Your IDAM instinct is exactly right here: never trust an assertion without verifying it. A SAML assertion is signed and validated against the IdP's certificate; an OAuth token is verified against the authorization server. That instinct maps directly to AI outputs. Don't trust the model's answer without a verification path. The analogy breaks at the mechanism level: in IDAM, verification is automated and binary (a valid signature is valid). With AI outputs, there is no digital signature that proves factual accuracy. Verification has to happen at the content level, a human or system checking the output against a source, and it cannot be reduced to a protocol-level check. Your IDAM instinct gives you the right posture. It does not give you the right mechanism.
The calibrated trust model
The practical landing point: never delegate a consequential decision to a model without a verification path in the workflow.
Federal policy already requires this. OMB M-25-21 defines "High-Impact AI" as AI whose output serves as a principal basis for decisions with legal, material, or safety effects, and mandates seven minimum risk management practices including pre-deployment testing, ongoing monitoring, and human oversight. GSA's implementation guidance requires continuous monitoring, human-in-the-loop validation, and annual re-registration for every approved AI system. The Federal Reserve Board prohibits any high-impact AI use without explicit CAIO approval.
Your buyers are building these verification workflows right now. When a CAIO tells you "we're piloting an LLM for internal knowledge management" or "we're evaluating generative AI for procurement support," the question that earns you the room is: what does the verification path look like? How are they confirming the output before it becomes a decision? That question signals you understand the structural problem underneath the product category.
The calibrated trust model has three tiers, and they map to the benchmark data above:
Low-stakes, grounded tasks (summarizing a document the user can check): Model output is useful as a draft. Verification is lightweight. Skim the source, confirm the summary tracks.
Medium-stakes, partially grounded tasks (synthesizing across multiple sources, generating analysis): Model output is a starting point that still needs validation. Think: drafting a comparative analysis of vendor security architectures for a procurement evaluation, or generating a risk summary across multiple ATO packages. Verification requires checking claims against sources and flagging unsupported assertions before anyone acts on them.
High-stakes, consequential decisions (anything M-25-21 would classify as high-impact): Model output feeds a human decision. The human still makes it. The verification path must be designed into the workflow before deployment, not bolted on after.
The spec calls this "human oversight," which is a generous phrase for what's actually required. NIST AI 600-1 identifies the failure mode: a human in the loop who rubber-stamps outputs functions as a compliance checkbox, one that doesn't reduce risk. What's required is a workflow where the model's output and its sources are both visible to the person making the decision, and where that person has the time and the context to evaluate the gap between them. If the workflow doesn't provide both, you have a human in the loop in name only.
Recap:
- Calibrated trust: Match verification effort to the stakes. Low-stakes grounded tasks need a skim. High-stakes decisions need a designed verification workflow with source visibility and human judgment.
- Policy alignment: OMB M-25-21 already mandates human oversight and ongoing monitoring for high-impact AI in federal agencies. Your buyers are building these workflows now, and understanding why they're structurally necessary is the conversation worth having.
Things to follow up on...
- NIST's preferred vocabulary: NIST AI 600-1 uses "confabulation" instead of "hallucination" and maps it to specific Measure and Manage functions in the AI Risk Management Framework, which your public sector buyers are likely already referencing.
- RAG faithfulness failures: The ReDeEP paper uses mechanistic interpretability to show exactly how models generate outputs that contradict the documents they just retrieved, a useful reference when a buyer assumes RAG solves the problem.
- DHS compliance implementation: DHS's published compliance plan for M-25-21 gives Component Senior AI Officers explicit authority to approve or retire AI use cases, showing how the verification-path requirement translates into actual agency governance.
- Monofact rate research: A peer-reviewed PNAS study empirically confirmed that manipulating training data frequency distributions directly controls hallucination rates, providing the strongest experimental support for the Kalai et al. theoretical framework.

