Every platform team logs requests. The question that actually matters in production is whether the responses are any good, and whether they're getting worse.
Those are different problems, and the second one is harder. Logging tells you what happened. Observability tells you whether what happened was correct. In AI systems, the gap between those two things is where production incidents live.
What Observability Actually Means Here
AI observability, precisely stated, is the capacity to detect changes in output quality and system behavior from external signals, without access to the model's internal state. You can't inspect the model. You can only measure what it produces, how long it takes, what tools it calls, and whether those calls succeed. The instrumentation problem is building a signal stack that catches quality drift and failure modes from those external measurements alone.
That stack has four layers, and they work together.
The Signal Stack
The schema layer is OpenTelemetry's GenAI semantic conventions, which define standard attribute names for AI telemetry: model identifier, input and output token counts, prompt content, completion content, tool calls attempted, tool call outcomes, latency by stage. This is infrastructure-class, not optional tooling. Without a standard schema, every service team instruments differently, aggregation across services becomes a custom engineering project, and the signals you need for incident response don't exist in a queryable form. The conventions are still marked experimental as of this writing, which means the attribute names can change — worth noting before you build dashboards against them, but not a reason to wait.
The quality layer is online sampled LLM-as-judge evaluation. Human review of every production response isn't operationally feasible at scale; a typical agency deployment handling thousands of daily interactions would need a dedicated review team just to keep up. Instead, a sample of production traffic, commonly 3–10%, is routed to a separate model configured as an evaluator. The judge scores responses against defined criteria: faithfulness to source documents, relevance to the query, task completion, safety. The output is a continuous quality signal: not "was this response correct" but "is the distribution of quality scores shifting." A drop in mean faithfulness scores over a 48-hour window is worth investigating. A spike in safety flag rate is worth stopping for.
The deployment gate is a regression test suite that runs before any prompt template change reaches production. The suite is a fixed set of test cases with expected outputs, or more precisely, expected output characteristics, since LLM outputs aren't deterministic. A template change that drops the pass rate below a defined threshold blocks the deploy. This is CI/CD for prompts. It's not exotic; it's the same gate you'd put on any configuration change that affects system behavior at scale.
The version layer is the discipline of treating prompt templates as versioned code. Templates live in source control with changelogs. Every production deployment records which template version was running. When a quality signal degrades, the first question is "what changed," and the answer needs to be in the deployment record, not reconstructed from memory. Rollback means reverting to a prior template version, not guessing what the prompt used to say.
Where the Incidents Actually Are
Most production AI incidents are tool-call failures and runaway loops, not model errors. The instrumentation needs to be built around that.
When an agent calls an external tool, a database query, an API, a document retrieval system, and that tool returns an error, the model may retry, may hallucinate a response, or may enter a loop trying to recover. When the loop detection logic is absent or misconfigured, the agent iterates until it hits a token limit or a timeout, burning latency and sometimes producing a confidently wrong final answer. These failures don't show up as model quality degradation in the judge scores. They show up as elevated tool-call failure rates, anomalous iteration counts, and long-tail latency distributions.
The signal stack has to instrument those specifically: tool call success and failure rates by tool type, loop iteration counts per session, time-to-completion distributions. An agency running an AI assistant for case processing that suddenly shows a 15% tool-call failure rate on its document retrieval integration has an integration problem, not a model problem. The fix is in the integration layer, visible in the tool-call metrics, invisible in the quality scores.
The Conversation You'll Have
A CAIO preparing an AI deployment for a federal benefits system will ask some version of: "How do we know if the system is giving wrong answers?" The answer that lands is: "We sample production responses and score them continuously, we gate every prompt change on a regression suite, and we watch tool-call failure rates because that's where the incidents actually happen."
That framing separates teams that have thought through production operations from teams that have thought through demos.
Okta Concept Mapping
Deploy-gated regression suites map to pre-production policy validation, the practice of testing an access policy change in a staging environment before pushing to production. The analogy holds for the workflow: version the artifact, test before deploy, gate on pass rate, preserve rollback. It breaks on the nature of the test. Access policy validation is deterministic: you can enumerate exactly what access a policy grants or denies. Prompt template regression testing is probabilistic: the same template produces a distribution of outputs, and "passing" means the distribution hasn't shifted past a threshold, not that a specific output is correct. A regression suite that passes at 88% is still failing 12% of the time by design. That's not a bug in the testing approach; it's the nature of the system being tested, and it means the threshold calibration is itself a policy decision that needs to be owned somewhere in the organization.
The logging problem is solved. The quality signal problem is where the operational work is, and the teams that figure it out first will be the ones whose AI deployments survive contact with production.

