The Determinism Gap

By Leigh Garrity— May 8, 2026

Traditional software does exactly what a programmer told it to do. A design property, with specific consequences for how you test it, audit it, and fix it when it breaks. Machine learning systems don't work that way, and the difference isn't cosmetic. It's the thing your CISO buyer is about to discover when their existing incident response playbook meets their first AI deployment.

The shift: traditional software is deterministic. ML systems are probabilistic. Those words get used interchangeably with "old" and "new," which is wrong and will get you into trouble. The distinction is about how behavior is specified — and that distinction runs all the way down to how you respond when something goes wrong.

What Deterministic Actually Means

A deterministic system produces the same output every time you give it the same input. Always. Not usually — always. The behavior is fully specified by the code a programmer wrote. If you want to know why the system did what it did, you read the code. If you want to change what it does, you change the code. The rule is the artifact, and the artifact is auditable.

Your buyers' IT organizations were built around this assumption. Every access control policy, every SIEM rule, every firewall ACL operates on it. The system does what the rule says. When it doesn't, something is broken and you find the break.

A probabilistic system produces outputs that are right most of the time, based on patterns the system learned from data. The behavior isn't specified by rules a programmer wrote — it's inferred from examples. Given enough examples of spam email, an ML system learns to recognize spam. Given enough examples of fraudulent transactions, it learns to flag fraud. The "rule" is never written down. It lives, distributed and implicit, in the statistical relationships the system extracted from its training data.

Same input, same output? Usually. But the accuracy of the output is a property of how well the training data represented the problem — not a property of whether the code is correct. That's the gap.

How the Pattern-Learning Actually Works

A programmer building a traditional spam filter might write: if the subject line contains "FREE MONEY" and the sender domain is less than 30 days old, mark as spam. The rule is explicit. You can read it, test it, and explain any specific decision by pointing to which conditions fired.

An ML spam filter is trained on, say, 50 million labeled emails — messages humans have already classified as spam or not-spam. The system processes those examples and builds an internal model of what distinguishes one class from the other. It might learn that certain word combinations, sender patterns, and timing signals correlate with spam in ways no programmer anticipated and no single rule could capture.

The result: a system that correctly classifies roughly 99% of new emails it's never seen before. That number sounds impressive. It is impressive. It's also the entire point of the probabilistic framing — 99% accuracy means 1% of the time, the system is wrong, and that error rate is a feature of the data distribution, not a bug in the code.

In practice, if your agency processes 10 million emails a month, a 99% accurate spam filter misclassifies 100,000 of them. Some of those are spam that gets through. Some are legitimate messages that get blocked. There is no line of code responsible for any specific misclassification. The error is a property of the model's learned patterns encountering an input that sits near the edge of what the training data covered well.

No bug to find here. The error is a property to manage, not a defect to patch.

“

IDAM Concept Mapping

Closest analog: the policy engine. Access control policy engines — OPA, XACML, Okta's expression language — are deterministic systems that make permit/deny decisions based on explicit rules. The analogy holds in one direction: both policy engines and ML classifiers take inputs and produce decisions. Where it breaks: a policy engine's decision is always traceable to a specific rule you can read, modify, and test in isolation. An ML model's decision is traceable to training data and learned weights, neither of which you can edit the way you edit a policy rule. You can change an ML model's behavior by retraining it on different data, but you can't change a single decision without potentially affecting thousands of others. The governance model for "this rule is wrong" doesn't transfer. Build a new mental model before you try to apply your policy-engine intuitions to ML audit requirements.

What This Means for the Conversations You're Already Having

Three topics come up in almost every public sector AI conversation right now: auditability, incident response, and the question of what happens when the system gives a wrong answer. The determinism gap is the reason all three are harder than your buyers' existing playbooks assume.

Auditability. Federal compliance frameworks — FISMA, FedRAMP, the emerging AI-specific overlays — require audit trails. Your buyers know how to build audit trails for traditional software: log the inputs, log the outputs, log which rule fired. That works because the rule is readable. You can reconstruct any decision by replaying the inputs against the policy.

ML audit trails work differently. You can absolutely log every input and every output — that part is the same. What you can't always produce is a human-readable explanation of why a specific output was generated. The "rule" is distributed across the model's parameters in a way that doesn't reduce to a sentence. A genuine compliance challenge, and the honest answer to a CISO who asks "can we audit this?" is: you can log everything, but the explanation layer requires additional tooling and has real limits. The spec for what "explainable AI" means in a FedRAMP context is still being written. Don't pretend otherwise.

Incident response. When a traditional system does something wrong, the IR playbook is: identify the anomalous behavior, trace it to the code path responsible, determine whether it's a bug or an exploit, patch it, deploy. The timeline is measured in hours to days. The fix is discrete.

When an ML system does something wrong, that playbook doesn't map cleanly. "Wrong answer" in ML has a different structure. The training data may not have represented this type of input well. The deployment context may have drifted from what the model was trained on — the world changed, the model didn't. The input may have been crafted specifically to exploit the model's learned patterns (this is a real attack class; it's called adversarial input, and it has no direct analog in traditional software exploitation).

None of these are fixed by patching code. Remediation involves retraining on better data, adjusting deployment constraints, or architectural changes that take weeks or months. And in the immediate aftermath of an incident, you often can't tell which category you're in. The IR team that expects a bug ticket is going to have a bad time.

"It gave the wrong answer." In traditional software, this phrase means the code didn't implement the intended behavior. Find the code, fix the code. In ML, it means the model's learned patterns produced an output that was incorrect for this specific input. Maybe the training data was biased. Maybe the input was unusual. Maybe the model is working exactly as designed and this is the 1% case.

The remediation path depends entirely on which of those is true, and determining which is true requires investigation that looks nothing like traditional debugging. Your buyers need to understand this before they deploy AI into any workflow where a wrong answer has consequences — which, in government, is most of them.

The Practical Upshot

ML systems can be trusted and can be deployed in government. They require a different trust model — one built around statistical accuracy, distributional coverage, and monitoring for drift rather than code correctness and rule verification. The agencies that are deploying AI well are the ones that understood this distinction before they wrote their ATO packages, not after.

When a CISO asks whether an AI system can be audited, a qualified answer serves better than yes or no: what you can log, what the explanation layer can and can't tell you, and what an IR playbook needs to look like for a probabilistic system. That answer requires the determinism gap to be clearly in your head before you walk into the room.

The gap is structural — the price of a system that can do things no programmer could write rules for. Understanding it is what separates a conversation that builds trust from one that creates liability.

Recap

Deterministic: Traditional software executes explicit rules a programmer wrote; given the same input, you always get the same output, and any deviation is a bug with a traceable cause.
Probabilistic: ML systems infer patterns from training data and produce outputs that are statistically correct most of the time; accuracy is a property of the data distribution, not the code, and errors don't have a single traceable cause.
The gap: Auditability, incident response, and "wrong answer" remediation all work differently in probabilistic systems — your buyers' existing playbooks assume determinism, and that assumption will break.