Why Model Failures Don't Throw Errors

Software failures announce themselves; model confabulation doesn't. How to use NIST's preferred vocabulary and reframe the risk for buyers.

By Leigh Garrity— May 9, 2026

Software failures announce themselves; model confabulation doesn't. How to use NIST's preferred vocabulary and reframe the risk for buyers.

Software Failure

What it is: A deterministic system producing an unintended result due to a defect in code, configuration, or infrastructure.

What it does: Software failures show up as crashes, exceptions, error codes, degraded performance, or incorrect outputs that are detectable through logging, monitoring, and testing. When a token validation fails, the system returns an error. When a database query times out, the application logs it. The failure is the signal. They're the same event.

Where it comes from: Decades of engineering practice have built an entire discipline around failure detection: structured error handling, observability stacks, incident response frameworks, SLAs with measurable uptime. Every piece of that infrastructure assumes one thing: failures are observable.

What makes it distinct: Software failures are self-announcing. The system knows something went wrong, even if nobody has looked at the log yet. This is so fundamental to how we think about reliability that you don't notice the assumption until you encounter a system that breaks it.

Model Confabulation

What it is: A generative AI model producing false or unsupported content as a natural consequence of how it generates output.

What it does: The model produces text that reads as authoritative and coherent but contains fabricated facts, invented citations, or incorrect reasoning. NIST AI 600-1 defines confabulation as "erroneous or false content" that can include fabricated logic or citations "which may further mislead humans into inappropriately trusting the system's output." The output doesn't look broken. It looks polished.

Where it comes from: Language models are trained to predict the most statistically plausible next token. The training objective rewards coherence and fluency (preprint) without explicitly penalizing factual inconsistency. The model optimizes for "sounds correct," and whether the output is correct depends on whether the plausible sequence happens to match reality. Confabulation lives in that gap. Recent theoretical work (preprint) frames this as "a statistical necessity" for calibrated language models, not a defect to be eliminated.

NIST chose "confabulation" partly because commenters noted that "hallucination" anthropomorphizes the system, implying the model perceives something that isn't there. It doesn't perceive anything. It generates plausible sequences. For government buyers writing authority-to-operate packages, the distinction between "the system perceived incorrectly" and "the system generated plausible nonsense" changes how you scope the mitigation. The first framing invites a recalibration fix. The second demands architectural mitigation. They lead to very different procurement conversations.

What makes it distinct: Confabulation is silent. The model has no internal "I don't know" signal unless one has been specifically trained in, and even then the signal is unreliable. OpenAI's GPT-4 System Card acknowledges that hallucinations "can become more dangerous as models become more truthful, as users build trust in the model when it provides truthful information in areas where they have some familiarity." The better the model gets at being right, the harder it becomes to catch when it's wrong, because you stop checking.

Okta Concept: Authentication Errors vs. Confabulation

Your IDAM instinct says if the system accepted it, it's valid. That's exactly wrong for model output — a confabulated answer passes every surface-level check with no "invalid token" equivalent.

Five Risk Dimensions, Compared

Five dimensions matter here: visibility, reproducibility, traceability, incident response, and failure frequency. The first three are detection properties. The last two are operational consequences. All five map to how a CISO or CAIO actually evaluates risk, and each one marks a point where twenty years of software-trained intuition goes quiet.

Dimension	Software Failure	Model Confabulation
Visibility	Self-announcing. Errors produce signals: exceptions, status codes, log entries, alerts. The system knows it failed.	Silent. Output looks identical whether correct or fabricated. No error state triggers. Detection requires checking the output against an independent source of truth.
Reproducibility	Generally deterministic. Same input, same config, same failure. You can reproduce it in staging and write a regression test.	Probabilistic. The same prompt may produce a correct answer on one run and a fabrication on the next. Temperature settings, context window contents, and model version all shift the output. Regression testing doesn't transfer cleanly.
Traceability	Full audit trail. Stack traces, request logs, configuration state let you reconstruct what happened and why.	Opaque. The model doesn't expose why it generated a particular output. Semantic entropy methods (techniques that measure how much a model's answers vary when you rephrase the same question) can detect when a model is uncertain, but they cannot detect high-confidence wrong answers. Those are the dangerous ones.
Incident response	Established playbooks. Find the error, identify root cause, deploy fix, verify, write postmortem.	No established playbook. You can't "fix" a confabulation the way you fix a bug. The behavior emerges from the model's training distribution, not from a discrete defect. Mitigation is architectural: retrieval-augmented generation, human review, output validation against authoritative sources.
Failure frequency	Failures are exceptions. A well-engineered system operates correctly most of the time, and failures indicate something is broken.	Confabulation is continuous. The model is always generating probabilistically. The variable is whether a particular output happens to align with reality.

The scale is worth grounding in specifics. In a peer-reviewed Stanford study of legal research, LLMs confabulated legal citations at least 58% of the time. Even RAG-grounded legal tools from Lexis and Westlaw still showed 17–33% error rates in a follow-up study. (These rates are from specific studies testing specific model generations; confabulation rates vary by model, domain, and grounding approach, but the structural pattern holds.) And the most-cited real-world consequence, Mata v. Avianca, happened because an attorney asked ChatGPT to confirm its own fabricated citations and the model confirmed they were real. Same confidence. Same tone. Same formatting. Completely invented.

Okta Concept: Logging and Observability

You can log a model's prompt and output, but not why it chose those words. The incident response muscle you've built around log analysis doesn't transfer here.

Okta Concept: Certificate Expiration vs. Model Drift

Model behavior drifts invisibly — no expiration date, no alerts. When a buyer asks "how do we know the AI is still working correctly," they're asking about continuous validation, not point-in-time certification.

How to Say This in the Field

Don't say	Do say	Why it matters
"Hallucination is a known bug that vendors are fixing."	"Confabulation is structural to how these models generate output. There's no patch coming for it. Probabilistic generation requires architectural mitigation."	Buyers who hear "bug" expect a fix in the next release; confabulation doesn't work that way.
"The model hallucinated."	"The model confabulated — that's NIST AI 600-1's preferred term. The model generates plausible text sequences. 'Hallucination' implies perception, which mischaracterizes the mechanism."	Government buyers writing ATOs need precise risk language, and NIST vocabulary carries weight.
"AI makes mistakes sometimes, just like people do."	"Models produce false output with the same confidence as true output. There's no built-in uncertainty signal. Detection requires external verification."	Anthropomorphizing the failure mode leads to anthropomorphized mitigations. Those don't work.
"We can test for that."	"You can test for known failure cases, but confabulation doesn't reproduce deterministically. The same prompt can produce correct output nine times and a fabrication on the tenth."	Buyers from software backgrounds assume regression testing applies. It doesn't transfer cleanly.
"RAG solves hallucination."	"Retrieval-augmented generation reduces confabulation rates but doesn't eliminate them. Grounded legal research tools still show 17–33% error rates in peer-reviewed testing."	RAG is a mitigation, not a fix. Overselling it destroys your credibility when the buyer reads the research.
"The newer models are much better."	"Confabulation rates have improved with each generation, but OpenAI's own system card notes that better models can be more dangerous because users stop verifying."	Improvement is real, and so is the trust escalation problem.
"We need to monitor AI outputs."	"We need external validation against authoritative sources. 'Monitoring' implies the system will signal when something's wrong. It won't."	"Monitoring" carries connotations of log-based detection that don't apply here.
"That's an edge case."	"In a peer-reviewed Stanford study, legal LLMs confabulated citations at least 58% of the time. 58% is a base rate for ungrounded generation in specialized domains."	Concrete numbers from peer-reviewed research shut down the "edge case" dismissal.
"AI is just a tool. Garbage in, garbage out."	"The input can be perfectly clear and the output can still be fabricated. GIGO assumes deterministic processing. Probabilistic generation produces confabulations from clean inputs."	GIGO is a comforting mental model from deterministic systems that doesn't explain confabulation.
"We'll add a human in the loop."	"Human review is the strongest mitigation available, but it only works if the reviewer has independent domain expertise. Someone who can't independently verify the output is just reading confidently written text."	"Human in the loop" sounds like a checkbox. In practice, it's an expertise requirement.

Your buyer has probably spent twenty years in a world where system output is trustworthy unless the system tells you otherwise. Every monitoring tool, every audit framework, every incident response playbook they've built assumes failures are detectable events. Confabulation breaks that assumption. The entire concept of "error" as something the system knows about and reports doesn't map to a system that generates plausible text whether or not that text is true.

Things to follow up on...

NIST's twelve risk categories: NIST AI 600-1 places confabulation within a broader framework of twelve generative AI risk categories including data privacy, information integrity, and CBRN information access, several of which compound when confabulated output enters downstream systems.
Semantic entropy detection limits: The Oxford team behind the Nature paper on semantic entropy showed the method can flag when a model is uncertain, but it carries 5–10x computational overhead and still cannot catch high-confidence wrong answers.
Johnson v. Dunn escalation: A July 2025 federal court in Alabama disqualified attorneys from a case after AI-generated hallucinated citations appeared in a motion, declaring that monetary sanctions alone are proving ineffective at deterring the problem.
GAO's generative AI inventory: GAO report GAO-25-107653 found federal generative AI use cases grew nine-fold from 32 to 282 between 2023 and 2024, while agencies' risk assessments for those deployments remain incomplete.