Your Buyer's Playbook Assumes Deterministic Software. ML Isn.

ML systems produce probabilistic outputs, not deterministic ones. Your testing, auditing, and incident response instincts need recalibration.

By Leigh Garrity— May 9, 2026

Your Buyer's Playbook Assumes Deterministic Software. ML Isn.

ML systems produce probabilistic outputs, not deterministic ones. Your testing, auditing, and incident response instincts need recalibration.

Every policy engine you've sold has a property so fundamental you've probably never had to name it: determinism. It evaluates an access request, checks attributes against rules someone wrote, and returns allow or deny. Same input, same policy, same result. If something breaks, you pull the log and find the rule that fired. Authorization works this way on purpose, because the alternative would be genuinely unacceptable.

Machine learning is the alternative. And your buyers are adopting it.

(A quick acknowledgment for the reader who's already thinking it: yes, risk-based authentication and adaptive MFA already use ML for risk scoring. Probabilistic systems are not entirely foreign to the identity stack. But those systems operate in a narrow, bounded role. What's changing is the scope. Your buyers are now building workflows where probabilistic outputs drive decisions far beyond a risk score.)

How ML Systems Get Their Rules

The mechanical difference matters precisely because everything else follows from it.

Traditional software executes logic a programmer specified. You can read the code. You can trace a decision to a line number. An ML model also takes inputs and produces outputs, but the function it implements was derived statistically from patterns in training data. The "rules" exist as numerical weights distributed across millions or billions of parameters. Nobody wrote them. Often, nobody can read them. The model outputs the most probable prediction given what it learned, and "most probable" is load-bearing. A probability machine through and through.

This has a concrete consequence, one the model providers themselves will tell you about if you read their documentation instead of their marketing.

OpenAI's API docs state it plainly: "Chat Completions are non-deterministic by default." Same prompt, different outputs, different requests. They offer controls to narrow the variance. A seed parameter. Temperature set to zero. But Azure's documentation on reproducible output concedes the limit:

“

"It's currently not uncommon to still observe a degree of variability in responses."

Even with the seed parameter and system_fingerprint matching across calls. And the system_fingerprint that tracks the model's backend configuration can change when the provider updates infrastructure, without notifying the API caller. The application code didn't change. The outputs did anyway.

If that feels like it should be a bigger deal than anyone's making it, you're right. Hold that feeling. It's useful.

Where Your IDAM Intuition Transfers and Where It Breaks

Experience with identity systems gives you a clear understanding of what determinism buys you. That understanding is an asset in AI conversations. But the instincts built on top of it don't transfer cleanly to probabilistic systems, and the failure points are specific enough to name.

Testing. In deterministic software, you test whether the system produces the correct output for a given input. Pass or fail. ML shifts the question. You're characterizing performance bounds — how often the system gets it right, under what conditions, how badly it fails when it doesn't. NIST's AI Risk Management Framework names this gap:

“

"Difficulty in performing regular AI-based software testing, or determining what to test, since AI systems are not subject to the same controls as traditional code development."

The second half of that sentence is the quiet part. The field hasn't settled what "testing" means for these systems.

The DoD is further along than most. Their DT&E Guidebook for AI-Enabled Systems (February 2025) acknowledges that ML systems "can continue to learn and change during operation" and that evaluation "will typically need to focus extra attention on potential unacceptable or worst-case performance, in addition to characterizing average or typical performance." The test-and-certify-once model your buyer relies on doesn't survive contact with a system whose behavior drifts after certification.

Auditing. The instinct here: pull the audit log, find the rule that fired, trace the decision. In an ML system, the decision emerged from a statistical inference across billions of parameters. No single rule fired. No line number to point at. Newer "reasoning" models produce a chain of thought, a visible trace that reads like an explanation. How faithful that trace is to the model's actual reasoning process is worth examining.

Anthropic published research in May 2025 testing exactly that. They fed models hidden hints and checked whether the chain of thought acknowledged using them. Claude 3.7 Sonnet was faithful 25% of the time. DeepSeek R1 hit 39%. (These numbers are model-version-specific and will shift as providers iterate.) Training models to be more transparent about their reasoning plateaued quickly, hitting a ceiling around 28% on some evaluations.

A fair caveat: METR, an independent evaluation organization, published a counterpoint arguing that for complex multi-step tasks, the chain of thought is more informative than those numbers suggest. The distinction matters. For simple, high-volume decisions (the kind most likely to face audit scrutiny), the faithfulness problem stands. For extended reasoning tasks, the trace may carry more signal. Neither case gives you the clean audit trail your buyer's compliance framework was built around.

Incident response. When a deterministic system makes a bad decision, you find the bad rule and fix it. When a probabilistic system makes a bad decision, you're asking something else: was this output within the expected distribution, or did something change? Maybe the input was out-of-distribution, something the model never encountered in training. Maybe performance degraded because the world shifted underneath the model.

This is called concept drift, and it's a term you'll hear in every ML evaluation conversation. A system that passed evaluation last quarter can fail this quarter without a single code change. The data the model was trained on no longer represents the environment it's operating in. The deterministic incident response playbook assumes the cause is locatable and the fix is a rule change. In probabilistic systems, the cause might be diffuse, the fix might be retraining, and the timeline might be weeks.

The Conversation This Prepares You For

A buyer says: "We need this to be reliable."

You need to hear which word is doing the work. Do they mean deterministic-reliable, same output every time, traceable to a written rule? Or statistically-reliable, right 97% of the time, with known failure modes and monitoring to catch the rest?

A buyer who means the first and builds for the second gets a system that occasionally makes decisions nobody can fully explain. A buyer who means the second and demands the first will never ship, because ML systems cannot make that guarantee. The providers say so in their own docs.

The question worth asking

"When you say reliable, do you mean the system produces the same output every time for the same input, or that it produces acceptable outputs within a known error rate you can monitor?" The first is a deterministic requirement. The second is a statistical one. They need to know which one they're specifying before anyone builds anything.

Your job here is to surface the ambiguity.

A career in IDAM has trained you to think in deterministic terms. That training is genuinely valuable here, because it means you understand exactly what's being given up when a system moves to probabilistic outputs. You can articulate the gap to a buyer who hasn't thought it through. Just know that the old playbooks describe a world where the rule is written down somewhere. Increasingly, it isn't.

Things to follow up on...

NIST's AI evaluation rethink: NIST's ARIA program is developing a scenario-based approach to AI evaluation that pairs human testers with live AI applications to estimate real-world risks, acknowledging that traditional benchmarks don't capture how these systems actually behave in deployment.
DoD formalizing AI test policy: As of late 2025 the DoD still lacked a formally approved DT&E policy for AI systems, though the CDAO's interim guidebook and frameworks are what your public sector buyers are actually working from right now.
Anthropic's interpretability research: Anthropic's mechanistic interpretability team published work in March 2025 showing they can trace a model's actual internal reasoning and distinguish genuine computation from confabulated explanations, which has direct implications for whether AI audit trails can ever be trustworthy.
Stochasticity and trust frameworks: A January 2025 paper explores why eliminating user-facing randomness and letting users control stochasticity tolerances are both insufficient strategies for building trust in probabilistic systems, a framing that maps directly onto the "reliable means what, exactly?" conversation.