If the Model Starts Behaving Differently, How Will You Know?

By Carey Whitten— May 5, 2026

If the Model Starts Behaving Differently, How Will You Know?

Most production AI incidents are not model errors. The model didn't hallucinate its way into an outage. What actually happened: a tool call returned a malformed response, the agent didn't handle it, and the loop ran until someone noticed the bill. Or a prompt template got "improved" by someone who didn't realize they'd changed the output format that three downstream systems were parsing.

This is the operational reality that shapes what instrumentation actually needs to do. The goal is catching the specific failure modes that production AI systems actually exhibit — not general-purpose logging, not capturing every token for posterity — and catching them before a CISO is asking why nobody noticed.

What Observability Means Here

Logging is recording what happened. Observability is the property of a system that lets you ask arbitrary questions about its internal state from the outside, without having to predict in advance what you'd want to know.

For a production AI system, that distinction matters because the failure modes are not fully enumerable before you encounter them. You can log every request and response and still be completely blind to quality drift — the slow degradation in output coherence that follows a provider's silent model update, or the subtle shift in tool-call behavior after a framework version bump. Logs tell you what the system did. Observability tells you whether what it did was right.

The instrumentation layer that makes this possible has four components, and they're interdependent.

The Mechanism

OpenTelemetry GenAI semantic conventions are the emerging standard for what AI telemetry looks like — specifically, the schema. OpenTelemetry itself is not new; it's the CNCF-graduated observability framework that platform engineering teams already use for distributed tracing. The GenAI semantic conventions extend it with AI-specific span attributes: which model was called, how many tokens were consumed, what the finish reason was, whether a tool was invoked and what it returned. A common schema means portability — if you're running Azure OpenAI in one environment and a self-hosted model in another, your dashboards and alerts shouldn't need to be rewritten for each. These conventions are still maturing; adoption is uneven across providers and gateways, and the spec has moved faster than the tooling in places. Worth naming plainly rather than leaving the impression this is settled.

Online sampled LLM-as-judge evaluation is how you get signal on output quality at runtime without a human reviewing every response. A second model — typically more capable than the production model, or at minimum a different one — scores a sample of production outputs against a rubric: relevance, groundedness, tone, task completion. "Sampled" is doing real work in that phrase. Scoring every response is cost-prohibitive and adds latency; scoring 5 to 10 percent of traffic gives you a statistical signal that's actionable without being ruinous. The reliability of LLM-as-judge scoring is genuinely contested in the research community — correlated failure modes between the judge and the judged model are a real concern, and calibration varies significantly by task type. Treat it as a drift detector, not a ground truth.

Deploy-gated regression suites are what prevent a prompt template change from reaching production without evidence it doesn't break anything. The mechanism is CI/CD applied to prompts: before a modified template is promoted, it runs against a fixed eval set — a curated collection of inputs with known-good outputs or quality thresholds. If the new template scores below threshold, the deploy is blocked. The fixed eval set is the key artifact here; it needs to be maintained, extended when new failure modes are discovered, and treated as a first-class engineering asset rather than a one-time setup.

Prompt templates as versioned code is the discipline that makes the rest of this work. A prompt template is not a config value someone tweaks in a UI. It has an owner, a changelog, a review process, and a rollback path. When a template change causes a regression — and it will — you need to know exactly what changed, when, and who approved it. The operational discipline looks like any other code review workflow. That's intentional: it removes the category of incident where "someone updated the prompt" is the entire postmortem.

The Conversation You're Going to Have

A platform engineering lead at a federal agency asks: how will we know if the model starts behaving differently after the provider pushes an update? This is not a hypothetical. Providers update foundation models on their own schedules, and the changes are not always announced in ways that make the operational impact obvious.

The answer that holds up in that room: span-level telemetry capturing every tool call — parameters, response, finish reason, latency — in a schema that's portable across providers. Sampled quality scoring running continuously against a rubric the agency helped define. Regression gates on every prompt template change, with a fixed eval set that includes the edge cases the agency has actually encountered. And a version history for every prompt template currently in production.

"We log everything" does not hold up to the follow-up question.

When a CISO asks what the audit trail looks like for a tool-call failure, they're asking whether the agency can reconstruct exactly what the agent did, in what order, with what inputs and outputs, and at what point it went wrong. That's a distributed tracing problem, and it has a distributed tracing answer — which is why OpenTelemetry is the right foundation rather than a bespoke logging schema someone built in a weekend.

Okta Concept Mapping

The closest IDAM analogue is behavioral anomaly detection — the difference between having Okta System Log events and having Identity Threat Protection running continuous risk scoring against those events. Both are about detecting drift from a baseline, not just recording what happened. The analogy holds in that both require you to define what "normal" looks like before you can detect deviation from it. Where it breaks: in IDAM, the behavior space is bounded. A user authenticates, accesses resources, changes credentials — the event types are enumerable and the risk signals are relatively stable. In a production AI system, the behavior space is the model's output distribution across an open-ended input space. It's continuous, high-dimensional, and shifts with every provider update. There is no rule-based equivalent of "quality drift." LLM-as-judge is a probabilistic approximation of something that has no clean signature — which is why it's sampled, why it's imperfect, and why regression suites exist alongside it.