A policy engine is a promise. Same inputs, same output. Every time. Not because the system is trying to be consistent, but because it's executing a rule a human wrote, and rules don't change between evaluations.
Machine learning systems broke that on purpose.
ML systems trade the determinism guarantee for the ability to handle inputs no human ever wrote a rule for. That trade was worth making, for certain problems, and those problems are now showing up in your accounts. What matters is what gets traded away, and whether your buyers understand what they're giving up.
What Probabilistic Output Actually Means
When a traditional software system evaluates an access request, it runs through a rule set. The rules were written by a human. They're stored somewhere you can read. The system finds the applicable rule, applies it, and returns a result. The result is binary: permit or deny. Run the same request tomorrow and you get the same result, because the rules haven't changed.
An ML system doesn't have a rule set. It has a model — a mathematical structure that encodes associations learned from data. When the system receives an input, it doesn't look up a rule. It computes a probability distribution across possible outputs and returns the highest-probability answer.
That's the precise version. The output is "yes, with 94% confidence" or "no, with 87% confidence," and the system reports the winner without showing you the score. The system is usually right. It is not always right. And it cannot tell you, in advance, which inputs will produce wrong answers.
Wrong answers here are structural, not incidental. A bug is a rule written incorrectly — something you can find, fix, and eliminate. Probabilistic wrongness comes with the territory. Replacing explicit rules with inferred associations means the associations generalize well across most inputs and fail on some. The system doesn't know which some.
• Probabilistic output: An ML system returns the highest-probability answer from a distribution of learned associations, not the result of executing a rule. The output is usually correct; the system cannot identify in advance which inputs will produce incorrect results.
How It Behaves
"94% accurate" at scale: a system processing 50,000 classification decisions per day gets 47,000 right and 3,000 wrong. Every day. The system doesn't flag the 3,000. It returns them with the same confidence it returns the 47,000. From the outside, correct outputs and incorrect outputs look identical.
Two practical consequences follow from this.
First: you can't test your way to certainty. With a deterministic system, you can enumerate the input space, or at least the meaningful edge cases, and verify the output for each. If the system passes your test suite, you have a reasonable basis for confidence. With a probabilistic system, the input space is effectively infinite, and the model's behavior on inputs it hasn't seen is a function of how well its learned associations generalize. Testing tells you how the system behaves on the inputs you tested. It tells you considerably less about inputs you haven't.
Second: when the system is wrong, there's no rule to inspect. With a deterministic system, incident response has a clear first step: find the rule that fired. Pull the policy, read it, trace the decision. The logic is there; you just have to find it. With a probabilistic system, the "logic" is distributed across numerical weights that encode learned associations. You can observe that the system produced a wrong output. You can probe the system with similar inputs to understand the pattern of its failures. You cannot read the reasoning.
That's a property of how the system works, not a temporary limitation waiting for better tooling.
• Distribution: Every ML output is a weighted probability across possible answers; the system returns the highest-weighted option without surfacing the score. At scale, a high-accuracy system still produces a predictable volume of wrong answers and cannot identify which ones.
Okta Concept Mapping
The closest IDAM analog to an ML model is the policy engine: both evaluate inputs against criteria and produce decisions. The analogy holds at that level of abstraction and breaks on auditability. A policy engine exposes its logic; you can pull the rule that fired, read it, and trace the decision. An ML model's "policy" was inferred from data and lives in numerical weights you cannot read, which means every audit requirement and incident response playbook your buyers operate under needs to be rethought for this class of system.
When You'll Need This
Federal civilian agencies are in the middle of figuring out what their existing compliance frameworks mean for AI systems. NIST 800-53 was written for deterministic software. FedRAMP was written for deterministic software. The continuous monitoring requirements, the audit logging requirements, the incident response requirements: all of them assume a system that executes rules and can tell you which rule it executed.
When a CAIO asks "how do we satisfy our continuous monitoring requirements for this AI system?", they're not asking rhetorically. They genuinely don't know, and neither does most of the vendor community. The honest answer involves understanding that continuous monitoring for a probabilistic system looks different. You're not checking whether the rules are configured correctly. You're checking whether the model's behavior has drifted from its baseline. Different tooling, different thresholds, different escalation paths.
The conversation you're likely to encounter: a technical buyer is evaluating an AI-assisted access review system. The system uses an ML model to flag anomalous access patterns for human review. The buyer's security team asks, "If the system flags an access event incorrectly, how do we audit the decision?" They're expecting an answer that sounds like "you pull the log, see that policy X fired on condition Y, and determine whether condition Y was correctly evaluated." That's the answer they'd get for a deterministic system.
The actual answer is: you can see that the system flagged the event. You can see the input the system received. You cannot inspect the reasoning that produced the flag. You can only observe the output and compare it to the model's baseline behavior. A buyer who understands this before signing can design compensating controls: human review thresholds, output logging, behavioral monitoring. A buyer who discovers it after deployment has a much harder problem.
• Audit gap: When a probabilistic system produces a wrong answer, there's no rule to inspect, only behavior to observe. Incident response for ML systems requires different tooling, different baselines, and a different definition of "explain this decision" than your buyers' existing playbooks assume.
None of this means AI systems are unreliable. Most of them, on most inputs, are quite good. The harder adjustment is that "reliable" means something different when the system's logic was inferred rather than written. A policy engine earns trust because you can read it. An ML model earns trust through observed behavior across enough inputs to have confidence in its generalization, backed by monitoring that catches when that generalization starts to fail.
Different kind of trust. Different evidence required. And your buyers will need to update playbooks written for a world where the software did exactly what it was told, every time, and you could prove it.

