Lesson 6: Observability and Evals in Production

By Leigh Garrity— May 6, 2026

Lesson 6: Observability and Evals in Production

Production observability for AI systems is the practice of knowing, in real time, whether your deployed model is doing what you think it's doing — not just whether it's responding. Logging requests tells you the system is alive. It does not tell you whether the outputs are coherent, whether tool calls are completing correctly, or whether a prompt change you shipped on Tuesday quietly broke something by Thursday. Those failures don't throw errors. They accumulate.

The four mechanisms that actually catch them are: a standardized telemetry schema, continuous output evaluation, deploy-gated regression testing, and prompt versioning with rollback. Each addresses a different failure mode. Together they constitute the operational floor for any AI system you'd stake a production SLA on.

How It Works

OpenTelemetry GenAI semantic conventions are the emerging standard for what AI telemetry looks like. OpenTelemetry itself isn't new — it's the CNCF-graduated observability framework that most platform teams already use for distributed tracing. The GenAI semantic conventions extend it with standardized attribute names specific to language model interactions: model name, prompt and completion token counts, tool calls invoked, finish reason, latency per generation. Without a shared schema, every team invents their own telemetry and you can't aggregate across services, compare across model versions, or hand data to a centralized platform team that doesn't know your application. The conventions are still maturing — the 1.0 stable release landed in late 2025 — but the major observability vendors have already aligned on them, which means the ecosystem is converging faster than most standards do.

LLM-as-judge is online, sampled evaluation. You route some percentage of live completions — typically 5 to 10 percent — to a second model that scores them against a rubric: groundedness, relevance, task completion, safety. The scores stream into your observability platform alongside your latency metrics. When the rolling average on "task completion" drops four points over 48 hours, you have a signal. When it drops twelve points in six hours after a prompt deploy, you have an incident. The mechanism is imperfect — the judge model has its own biases, and rubric design is genuinely hard — but it catches the class of failure that your error rate and p99 latency will never surface, because the model is responding fine. It's just responding badly.

Deploy-gated regression suites are the pre-production complement to online evaluation. Before a new prompt template or model version ships, you run it against a fixed evaluation set — curated examples with known-good expected outputs — and gate the deploy on the scores. If the new version drops below threshold on your regression suite, the pipeline stops. This is CI/CD for prompts. It catches regressions before users see them, which is the only time catching them is cheap.

Prompt template versioning is the discipline that makes rollback possible. Prompts are code. They need version control, changelogs, and the operational ability to revert in minutes. A prompt change that reduces task completion rate by eight percent needs to be undoable before the support queue fills up — not after two hours of archaeology through deployment logs and Slack threads trying to identify what changed. Teams that treat prompts as configuration rather than code discover this the hard way. Store prompt templates in version control, tag releases, and wire rollback into your deployment tooling the same way you'd wire it for any other artifact.

When You'll Need This

The CIO question that surfaces this conversation is usually some version of: "How do we know if the model is getting worse?" The honest answer, if you're only watching latency and error rates, is that you don't. Output quality drift is invisible to infrastructure monitoring. A tool-call loop that executes 40 times before timing out looks like a slow request in your latency histogram. A prompt regression that drops answer quality by 15 percent looks like nothing at all.

In public sector deployments, where model outputs may inform decisions with real downstream consequences, this isn't an abstract concern. Agencies procuring AI capabilities are increasingly asking vendors to demonstrate continuous evaluation infrastructure as a condition of deployment approval — not as a nice-to-have, but as an audit requirement. Build this stack before your first quality incident, not after.

“

Okta Concept Mapping

The closest IDAM analogue to AI production observability is UEBA — behavioral analytics layered on top of event logs to detect anomalies that raw logging misses. The parallel holds in one important way: both disciplines recognize that event logs tell you that something happened, not whether it should have. Where the analogy breaks is on ground truth. In UEBA, a flagged anomaly is eventually resolvable — you investigate, you determine whether the behavior was malicious, you have a fact of the matter. In LLM quality monitoring, the ground truth of "was this a good response?" is often permanently ambiguous. The judge model is always probabilistic. You're not detecting deviation from a known-good state; you're estimating quality against a rubric you also designed. That's a different epistemic situation, and it means your confidence intervals on AI quality signals are structurally wider than anything your SIEM produces.

Logging is where you start. Past that, you need instrumentation that can see what error rates can't — and the teams that build it after an incident always wish they'd built it first.

How It Works

When You'll Need This

“

Okta Concept Mapping

Logging is where you start. Past that, you need instrumentation that can see what error rates can't — and the teams that build it after an incident always wish they'd built it first.