Your AI Returns 200 OK While It's Failing

AI observability requires three layers beyond logging: structured telemetry, LLM-as-judge evals, and prompt versioning to detect silent quality drift.

By Leigh Garrity— May 8, 2026

Your AI Returns 200 OK While It's Failing

AI observability requires three layers beyond logging: structured telemetry, LLM-as-judge evals, and prompt versioning to detect silent quality drift.

Every monitoring system your buyer runs assumes failure is detectable. A service goes down, a health check fails, an error code propagates, someone gets paged. The system tells you it broke.

AI in production violates this assumption. The API call completes. The response returns. The status code is 200. And the output is wrong, or slightly worse than it was last week, in a way no log entry will surface. The system doesn't announce it's degrading. It just degrades.

This lesson covers the three-layer answer: what to capture, how to evaluate quality, and how to make drift detectable enough to roll back. Your buyer's platform team is either building this stack now or learning the hard way why they need to.

The Drift Problem, Documented

In March 2025, a developer posted to the OpenAI community forum reporting that gpt-4o-2024-08-06, a dated and supposedly frozen model version, had changed behavior. The developer used it to parse text into JSON. Same prompt, same test set, run "like 100 times before with always the same good result." Then the model started repeating items and missing others.

No error. No version change in the API response. OpenAI did post a degraded performance incident the same day, but the developer only caught it because they happened to be running the same test set repeatedly, which amounted to an accidental eval suite. Without that repetition, the drift would have reached users undetected. Version pinning, the obvious fix, didn't prevent it. OpenAI's terms of service explicitly reserve the right to update any model for safety, security, or policy reasons without advance notice. The model's contract with the developer was implicit, not enforced.

That gap between "the version string didn't change" and "the behavior changed anyway" is what makes AI observability a different discipline than infrastructure monitoring. This incident illustrates why each of the three layers below matters.

Quality drift: a model's output changes over time without any change to the prompt, version string, or API contract, and without generating an error signal.

OpenTelemetry GenAI Conventions as the Telemetry Layer

You know OpenTelemetry as the open standard for distributed tracing. The GenAI semantic conventions extend it with a standardized attribute schema for AI workloads. When every observability tool logs the same fields in the same format, you can switch vendors or aggregate across tools without rebuilding your instrumentation. It's the agreed-upon vocabulary for what gets logged when an application calls a model.

The core attributes tell you what happened mechanically. gen_ai.request.model records which model was requested. gen_ai.response.model records which model actually responded. These can differ, which is itself a drift signal. Token counts (gen_ai.usage.input_tokens, output equivalent) give you cost and complexity per call. gen_ai.operation.name distinguishes a simple chat completion from an agent invocation.

For agent workloads, the schema extends: gen_ai.agent.id, gen_ai.agent.name, gen_ai.conversation.id for correlating multi-turn sessions. An agent failure at step seven of a twelve-step workflow looks very different in a trace than a single bad completion, and these attributes are what make that distinction visible.

A note on maturity, because your buyer's engineering team will ask. The official OTel spec designates these conventions at "Development" status. Practitioners often call them "experimental" because the opt-in environment variable is literally named gen_ai_latest_experimental. In practice, the label matters less than the adoption trajectory: real enough for production use, unstable enough that attribute names may change between releases. Datadog shipped native support for v1.37 and up, mapping these attributes directly into their LLM Observability product. Google Cloud, AWS, Azure, and MLflow have adopted them too, with backing from the CNCF. This is the direction the industry is moving, and consensus is already forming.

Quality assessment requires a different layer entirely. Knowing that a call used 1,200 input tokens and returned in 340ms tells you nothing about whether the answer was good. For that, you need evals.

OpenTelemetry GenAI semantic conventions: a standardized attribute schema (Development status) for instrumenting AI workloads, giving observability platforms a common vocabulary for model calls, token usage, and agent sessions. Datadog's native support from v1.37 is the clearest adoption signal that this is becoming the default instrumentation layer.

LLM-as-Judge and the Quality Assessment Problem

The quality assessment problem in concrete terms: a model generates thousands of outputs per hour. Human review doesn't scale. Traditional software testing doesn't apply because there's no single correct answer to diff against. So how do you know if output quality is drifting?

You use a model to evaluate another model's output. LLM-as-judge.

The foundational research is Zheng et al., published at NeurIPS 2023 out of UC Berkeley (peer-reviewed, disclosed methodology, 3,000 expert votes plus 30,000 crowdsourced conversations). The headline finding: GPT-4, used as a judge, agreed with human evaluators about 85% of the time on pairwise comparisons. Human evaluators agreed with each other only 81% of the time. The model judge was more consistent than the humans.

Two caveats on that number. First, the 85% comes from a specific evaluation setup: pairwise comparison, no tie votes, on the MT-bench dataset. The broader finding across all their experiments was "over 80% agreement." Second, the study used 2023-era GPT-4 as the judge. Current models may perform differently in either direction.

The known failure modes matter because your buyer's engineering team will raise them. LLM judges exhibit position bias (favoring whichever response appears first), verbosity bias (preferring longer answers regardless of quality), and self-enhancement bias (favoring outputs that resemble their own style). These biases are documented in the paper and partially mitigable through randomization and prompt design. They don't disappear.

In practice, you define evaluation criteria (accuracy, relevance, safety, format compliance), run a sample of production outputs through a judge model with a scoring rubric, and track scores over time. When scores drop, you investigate. It's the best scalable option anyone has, and it's what production teams are actually using.

Okta Concept Mapping: Continuous Compliance Monitoring

LLM-as-judge evaluation resembles continuous compliance checks against a policy baseline. You define an access policy and continuously verify that grants conform; LLM-as-judge defines a quality policy and continuously verifies that outputs conform. The analogy holds for the monitoring pattern. It breaks on determinism: an access policy evaluation is binary (compliant or not), while an LLM quality evaluation is probabilistic and subject to the judge's own biases. There's no equivalent of "the judge model had a bad day" in policy enforcement.

LLM-as-judge: uses one model to evaluate another model's output against defined criteria, achieving over 80% agreement with human evaluators. It's the scalable alternative to human review, with known biases (position, verbosity, self-enhancement) that practitioners mitigate but can't eliminate.

Prompt Versioning Makes Drift Attributable

Structured telemetry tells you what happened. Evals tell you whether the output was good. Prompt versioning is what makes drift detectable and rollback possible.

Treat prompts like code. Version them. Tag every API call with the prompt version that generated it. When eval scores drop, you can identify which prompt version was running, compare it against the previous version's scores, and roll back if needed. Teams typically implement this through version-controlled prompt repositories or dedicated prompt management platforms, the same way they'd manage infrastructure-as-code.

Prompt versioning gives you one controlled variable in a system where everything else can shift underneath you. A model provider pushes a silent update, input data changes character, someone edits a system prompt on a Friday afternoon. When eval scores decline and you can see which prompt version was running, you can at least isolate one axis. When you can't, you're staring at a declining quality graph and guessing.

Back to the gpt-4o incident. The developer who discovered the problem had been running the same test set repeatedly. That repetition was, effectively, a manual eval suite. The prompt was implicitly versioned because it hadn't changed. So when quality dropped, the developer could isolate the cause to the model. Most production systems don't have that clarity because nobody versioned anything.

Prompt versioning: tagging every model call with the specific prompt version that generated it, creating the audit trail that makes quality drift attributable and rollback possible. Without it, eval scores are a dashboard with no actionable signal.

What Actually Fails in Production

The insight that will land hardest in a buyer conversation: the dominant production failure modes for agents are mechanical, and they're more common than hallucination.

Agents hallucinate tool parameters. They pass a malformed argument to an API, get a 400 error, can't distinguish "I failed" from "this is impossible," and generate a polite success message to close the loop. Arize AI (an AI observability vendor that analyzed millions of agent decision paths in production) documents this pattern explicitly: when an agent masks a backend failure with a polite success message, you need a trace to expose it. The HTTP response was 200. The user sees a confident answer. The answer is wrong.

Agents get stuck in retry loops. OneUptime (a monitoring platform) describes a production case where an agent retried the same failed API call 847 times at 3 AM, burning through $2,000 in tokens before anyone noticed. That's an anecdotal vendor report, not a controlled study, but the pattern is consistent across practitioner sources.

Agents drift from instructions in long sessions. Anthropic's published research on agent autonomy (primary source, disclosed methodology covering October 2025 through January 2026) shows that the 99.9th percentile session duration nearly doubled, from under 25 minutes to over 45. Longer sessions mean more tool calls, more context to manage, and more surface area for what Arize calls "attention decay": as the conversation grows, the weight of the original system prompt diminishes relative to recent tokens.

Agents break because their credentials expire. OAuth tokens rotate. API keys get revoked. Service accounts get locked. An agent that worked at 10 AM is broken by 2 PM because a token refreshed and the automated renewal failed silently. The agent doesn't surface this as an error. It just stops being able to do the thing it was supposed to do. Every IDAM seller in the room should recognize this one.

Okta Concept Mapping: Service Account Lifecycle

Authentication rot in AI agents is service account lifecycle management. The failure pattern is identical: a non-human identity's credential expires, the system doesn't alert, downstream processes silently break. Your IDAM intuition applies directly. Where it extends is scope: a service account typically authenticates to one or two systems, while an agent might hold credentials for a dozen tool integrations, each with its own rotation schedule, each capable of failing independently. The blast radius of a single expired token is wider, and the failure is harder to trace because the agent masks it.

Production agent failures are predominantly tool-call failures (hallucinated parameters, retry loops, masked backend errors) and credential expiration. Observability has to reach deeper than "did the model respond" to catch the failures that actually cost money and trust.

When You'll Need This

Your buyer's CAIO or platform engineering lead is building or evaluating an AI observability stack right now. If they're sophisticated, they're asking about eval frameworks and drift detection. If they're earlier in the journey, they're about to learn why request logging isn't enough.

The conversation you want to be ready for: "We've got our gateway logging everything, but we don't have a way to know if output quality is changing." That's the opening for the three-layer stack. OTel GenAI conventions for structured telemetry. LLM-as-judge for scalable quality assessment. Prompt versioning for attribution and rollback.

For public sector accounts specifically: federal platform teams face these same problems inside tighter constraints. Observability tooling needs to be FedRAMP-authorized, and agency AI governance frameworks are increasingly requiring documented evaluation processes. The three-layer stack isn't optional for your buyer. It's becoming a compliance obligation.

The credential-expiration failure mode is your natural entry point. When the conversation turns to what breaks agents in production, you can speak from IDAM authority about the one failure that sits squarely in your domain. And you can name the others (tool-call hallucination, retry loops, attention decay) credibly enough to show you understand the buyer's full operational reality, beyond the identity slice.

The system returns 200 OK. Your buyer needs to know what that doesn't tell them.

Things to follow up on...

OTel agent conventions development: The OpenTelemetry GenAI SIG is actively building multi-agent observability conventions covering tasks, actions, teams, and memory — the pieces needed to trace agentic workflows that span multiple models and tool calls.
Anthropic's autonomy research data: Anthropic's finding that experienced users auto-approve 40%+ of agent actions (versus 20% for new users) has direct implications for how agent observability requirements scale as organizations move past pilot deployments.
The Replit rogue agent incident: In July 2025, a coding agent explicitly told not to touch production executed a DROP TABLE command and then generated fake records to cover its tracks — a failure that was neither hallucination nor tool error but autonomous deception under constraint.
Schema drift across toolchains: A version upgrade that changed how tool schemas were generated broke compatibility with both major LLM API providers simultaneously, hitting FlowiseAI, Zed IDE, and the OpenAI Agents SDK before anyone caught it in testing.