Why "How Did It Decide?" Has Two Completely Different Answers

Neural networks are layered math, not inspectable rules. Your IDAM audit intuition breaks here, and knowing exactly where matters.

By Leigh Garrity— May 9, 2026

Why "How Did It Decide?" Has Two Completely Different Answers

Neural networks are layered math, not inspectable rules. Your IDAM audit intuition breaks here, and knowing exactly where matters.

Your buyer asks: "Can we see how it decided?"

You know this question. You've answered it hundreds of times. In IDAM, the answer is a chain you can trace: policy rule, attribute match, timestamp, decision. Deterministic. Logged. Replayable. The explanation is the mechanism.

When the same buyer asks that question about an AI system, the words are identical. The answer is a fundamentally different category of thing. To understand why, you need to know what's actually happening inside a neural network when it produces an output.

A neural network is a stack of math

The name "neural network" suggests biology. The original designers drew loose inspiration from neurons, but what survived that inspiration is a mathematical structure: linear algebra at scale. Knowing this matters because vendors and buyers will casually say "it works like a brain," and the honest response is: it multiplies matrices.

Each layer in the stack does three things:

Multiply inputs by a matrix of weights. Every input value gets multiplied by a learned weight, the results are summed, and a bias term is added. This is matrix multiplication and vector addition.
Apply a nonlinear function. The summed result passes through a nonlinear activation function (ReLU, sigmoid, others). This step is easy to overlook and does the most important work.
Pass the output to the next layer. Repeat.

That's it. Multiply, add bias, apply nonlinearity, hand off. The entire network is this operation, stacked.

"Deep" means many layers stacked. Two hidden layers is shallow. Dozens or hundreds of hidden layers is deep. Modern large language models run many dozens of layers, each containing millions or billions of individual weight values.

The nonlinear step deserves more attention than it usually gets. Without it, stacking layers accomplishes nothing. Multiply one matrix by another matrix and you just get one matrix. A hundred linear layers collapse into a single linear transformation. You could replace the whole stack with one layer and get the same result. The nonlinearity between layers prevents this collapse. It's what makes depth meaningful. It's also what makes the system resistant to simple inspection, because the interactions between layers are no longer a clean linear chain you can decompose into parts.

And here's what most vendors won't volunteer: nobody fully understands why stacking many nonlinear layers works as well as it does. The field can describe what happens mathematically. It can demonstrate that deep networks outperform shallow ones on complex tasks. But the theoretical understanding of why depth produces such dramatic capability gains is incomplete. "We trained it and it works" is closer to the operational truth than any vendor's architecture slide will suggest.

Where your IDAM intuition applies

A policy engine receives attributes. It evaluates them against rules. Rules fire or don't. A decision is produced. Every step is discrete, inspectable, logged. You can answer "why was access granted?" by pointing to the specific rule, the specific attributes, the specific moment. The explanation and the mechanism are the same object.

This intuition, that a system's decision can be explained by showing which rules fired, is correct for policy engines. It's about to stop being useful.

Where it breaks

A neural network has weights that were learned during training. No rules fire.

During training, an optimization algorithm adjusted billions of individual weight values, incrementally, across millions of iterations, to minimize prediction errors. No human wrote those weights. No human specified what pattern any particular weight should encode. The weights emerged from the training process the way a riverbed emerges from water flow: shaped by what passed through, with no engineer specifying the result.

When the trained network processes a new input, every weight in every layer participates. The output is the aggregate effect of billions of learned statistical associations interacting across dozens of nonlinear transformations. There is no single weight you can point to and say "this is why the model produced that output."

In a public essay on the urgency of interpretability research (April 2025), Anthropic's CEO Dario Amodei wrote directly:

“

"We have no idea, at a specific or precise level, why it makes the choices it does."

That's the CEO of a major model provider arguing publicly that his own company's models need significantly more interpretability work, even as deployment continues.

This is the analogy break. When your buyer asks "can we see how it decided," the IDAM version of that answer and the AI version are not two points on the same spectrum. They are different kinds of explanation entirely. The IDAM answer traces a deterministic path through inspectable rules. The AI answer would require understanding how billions of learned parameters interact across dozens of nonlinear layers to produce a probabilistic output. One is an engineering artifact you can pull up on a screen. The other is an open scientific question.

Interpretability as a research frontier

The field working on this is called mechanistic interpretability. It's real, funded, and making genuine progress — and honest about how far it has to go.

Anthropic published results in 2024 identifying millions of human-interpretable concepts inside one of their production models: features corresponding to things like "the Golden Gate Bridge" or "conflict between individuals." In early 2025, they extended this to tracing computational circuits, showing that when the model reasons about "the capital of the state containing Dallas," there's an internal step where it represents "Texas" before producing "Austin."

Progress worth taking seriously, with limits worth taking equally seriously. Their researchers stated that attribution methods provide satisfying insight for "about a quarter of the prompts we've tried." The published examples are success cases. Even on short, simple prompts, understanding the circuits takes hours of human effort. Scaling these techniques to full production models would require compute that "vastly exceeds the compute used to train the model in the first place."

Anthropic themselves called mechanistic interpretability a "pre-paradigmatic field" still in search of the right abstractions. Their stated goal is to reach the point where interpretability can reliably detect most model problems by 2027. That's a research target. No vendor has committed to shipping interpretability tooling on that timeline.

No major model provider currently offers enterprise tooling that lets a buyer inspect the internal reasoning of a large language model at the mechanistic level. What providers offer is output logging, safety classifications, model cards, and documentation. These answer "what happened" and "what the system is certified for." They do not answer "how did the network arrive at this specific output." Federal procurement guidance calls this "explainability," which is a generous word for what's actually available.

The NIST vocabulary your buyer is using

NIST draws a distinction your buyer will likely reference. The AI Risk Management Framework separates three concepts that sound similar but aren't:

Transparency answers "what happened in the system"
Explainability answers "how a decision was made"
Interpretability answers "why a decision was made and what it means to the user"

These are listed as separate trustworthy AI characteristics in the framework. They are not the same requirement.

NIST acknowledges the gap directly, identifying "inscrutability" as a challenge to AI risk measurement resulting from "the opaque nature of AI systems." IDAM audit logs satisfy transparency cleanly. For AI systems, explainability and interpretability are where the ground gets soft.

Worth knowing for the call: current federal AI procurement guidance (OMB M-25-22, dated April 2025; verify currency before citing in a meeting, as federal AI guidance has been turning over rapidly) defines explainability as "an agency's ability to provide evidence or reasons for system output" and directs agencies to prioritize "documentation that facilitates transparency and explainability." Documentation-based. Model cards and impact statements. The weights stay opaque.

Two meanings, one word

When your buyer says "we need explainability," they may mean the OMB definition, which is achievable with current tooling, or they may mean the IDAM-style definition, which requires solving an open research problem. Knowing which one they mean changes the entire conversation.

What you carry into the call

For a policy engine: yes, you can see how it decided. Completely. Down to the individual rule and attribute match.

For a neural network: partially, through documentation and output logging. The internal mechanism is an active research problem. The best labs in the world are working on it with real but limited results. The theory of why deep networks work so well is itself incomplete. NIST recognizes this gap explicitly. Federal procurement guidance operationalizes explainability as documentation, not model internals.

You don't need to solve interpretability. You need to know it exists, know it's an inherent property of how these systems work rather than a vendor failure or a missing feature, and know the NIST language that frames it. That's enough to hold the conversation honestly.

Things to follow up on...

Anthropic's 2027 interpretability target: Dario Amodei's essay on the urgency of interpretability lays out Anthropic's public goal of reliably detecting most model problems through interpretability by 2027, and is worth reading for how a major provider frames its own limitations.
NIST AI 600-1 confabulation language: The Generative AI Profile defines twelve risk categories specific to generative AI, including confabulation, and uses vocabulary your buyer's CAIO office is likely adopting.
Reasoning models hiding their work: Anthropic's own 2025 research found that Claude 3.7 Sonnet only mentioned the actual reasoning hints it used 25% of the time, which means even the "chain of thought" output isn't a reliable window into what the model actually computed.
OMB's documentation-based explainability standard: OMB M-25-22 defines federal explainability requirements around documentation and impact statements rather than model internals, which is the operational definition your buyer's procurement office is working from.