Evals and Observability: How You Know It's Working

By Leigh Garrity— May 6, 2026

Evals and Observability: How You Know It's Working

Lesson 6 of 9 — Patterns & Practice

Production AI quality assurance has two distinct disciplines. Offline evals tell you what a system does before you ship it. Online observability tells you what it's doing after. Buyers who conflate them tend to ship confidently and discover problems slowly. The primary finding that should reorient every conversation you have about instrumentation: most agent incidents trace to tool-call failures and runaway loops, not to model output quality. That single fact changes which questions are worth asking and which answers are worth trusting.

Offline Evals

What it is: A structured testing discipline that measures AI system behavior against known-good examples before deployment.

What it does: Offline evals run your agent or model against a curated dataset of inputs and expected outputs — the golden dataset — and score the results. The scoring can be deterministic (did the agent call the right tool? did the output match a reference string?) or probabilistic (does this response meet quality criteria, as judged by a human reviewer or, increasingly, another language model). When you run these suites against every new model version or prompt change, you have regression coverage: you know whether a change broke something that was previously working.

LLM-as-judge comes up in buyer conversations, and the name is accurate in a slightly uncomfortable way. You are asking a language model to grade another language model's homework. The results are surprisingly useful — LLMs are reasonably good at assessing coherence, relevance, and policy compliance at scale — and the epistemics are exactly as uncomfortable as you'd expect. It works well enough to be widely adopted. It does not work well enough to be the only thing you're doing.

A golden dataset is a curated collection of representative inputs with verified correct outputs or quality labels. Building one is expensive. Maintaining one — updating it as the task definition evolves, as edge cases are discovered, as the deployment context changes — is more expensive. The quality of the eval is bounded by the quality of the dataset, and datasets go stale.

Where it comes from: Offline evals are a direct extension of software testing practice — unit tests, integration tests, regression suites — applied to a system whose outputs are probabilistic rather than deterministic. The discipline matured in ML research teams at large technology companies and has been productized into tools like Braintrust, LangSmith's eval layer, and Langfuse's experiment tracking. These are representative examples of a fast-moving landscape; the tooling will look different in eighteen months.

What makes it distinct: Offline evals are the only discipline that gives you a controlled, reproducible measurement of system behavior. You choose the inputs. You define the scoring criteria. You run the same test against every version. That control is what makes regression possible. You can only test what you thought to put in the dataset.

Online Observability

What it is: Continuous instrumentation of a running AI system that captures what it actually does in production, in real time.

What it does: Online observability instruments the agent's execution trace: every step, every tool call, every model invocation, every response, all of it queryable. When an agent runs a multi-step task, observability reconstructs the full call graph: which tools were called, in what order, with what inputs, and what came back. It captures latency, token counts, error rates, and the content of intermediate steps that never surface in the final output.

Drift detection is the forward-looking layer. Once you've established a baseline of normal behavior — typical tool-call sequences, typical response latency, typical output distributions — you can alert when production behavior deviates from that baseline. A model update that shifts output tone. A tool that starts returning unexpected schemas. An agent that's entering a retry loop it can't exit. These are the signals that offline evals don't generate because they don't run continuously against live traffic.

Tracing is the foundational capability, and it differs from logging in a way that matters at agent scale. Logging captures events. Tracing reconstructs causality across a distributed, multi-step execution. For a single-turn LLM call, logging is probably sufficient. For an agent that makes five tool calls, hands off to a subagent, and synthesizes a response from three external data sources, you need tracing to understand what actually happened when something goes wrong.

Where it comes from: Online observability is applied ML monitoring practice, which itself descended from distributed systems observability — the OpenTelemetry lineage, the SRE discipline of measuring systems in production rather than in test. Arize and similar platforms brought this practice into ML model monitoring before the agent era; the current generation of tools extends it to the multi-step, tool-calling patterns that agents introduce. The discipline assumes that production behavior is the ground truth, and that your job is to instrument it well enough to understand it.

What makes it distinct: Online observability is the only discipline that sees what actually happens with real users, real data, and real tool responses. It catches failure modes that no golden dataset anticipated. It also catches the failure mode that dominates production agent incidents: tool-call failures and runaway loops, which are behavioral patterns that emerge from the interaction between the model and the environment, not from the model's output quality in isolation.

Comparison

Comparison structure: trait-led analysis. The brief's recommendation stands. Because offline evals and online observability are complementary rather than competing, the useful comparison is across dimensions that matter for buyer decisions, not a head-to-head. I'm anchoring on five dimensions: timing, coverage, what each misses, cost model, and ownership. Every subject appears on every dimension.

Timing

Offline evals run before deployment and on a scheduled cadence thereafter — triggered by model version changes, prompt updates, or dataset additions. They are asynchronous with production traffic. Online observability runs continuously, in production, against live traffic. It is synchronous with what the system is actually doing.

Model behavior can change between eval runs. A vendor updates a model version. A connected API changes its response schema. A new class of user query emerges that the golden dataset doesn't cover. Offline evals won't catch any of these until the next scheduled run. Online observability catches them when they happen.

Coverage

Offline evals cover the failure modes you anticipated when you built the dataset. They are excellent at catching regressions — changes that break previously working behavior — and at measuring quality on known task types. They are the right tool for answering "did this model update make things worse?"

Online observability covers the failure modes that emerge from real-world conditions. Analysis of production agent deployments consistently shows that the majority of incidents trace to tool-call failures and loop conditions rather than to model output quality. An agent that calls an authentication endpoint with a malformed token, receives an error, retries indefinitely, and exhausts its context window is not having a model quality problem. It's having a tool integration and loop-detection problem. Offline evals rarely surface this because golden datasets rarely include the realistic error responses that production tools return. Online observability catches it because it's watching the actual calls.

What Each Misses

Offline evals miss production conditions. The golden dataset is a controlled approximation of reality. Real users ask questions the dataset didn't anticipate. Real tools return errors, timeouts, and schema variations the test harness didn't simulate. Real agent runs involve state accumulation across many steps in ways that unit-style evals don't capture.

Online observability misses intentional regression coverage. It can tell you that something changed; it cannot tell you whether the change is a regression against a defined quality standard unless you've built that standard into your alerting. It also generates significant noise. Production traffic is messy. Without a well-maintained baseline and thoughtful alert thresholds, observability data becomes a log of everything that happened, which is not the same as a signal about what went wrong.

Cost Model

Offline evals cost compute and human time. Running a regression suite against a large golden dataset requires inference compute (more if you're using LLM-as-judge, which requires a separate model call per evaluation). Building and maintaining the golden dataset requires human judgment. The cost is episodic, spiking when you run evals and near-zero between runs.

Online observability costs infrastructure and storage, continuously. Every traced execution generates data. At scale, the storage and processing costs are non-trivial, and the tooling to make that data queryable adds to the bill. The cost is proportional to traffic volume and trace depth. Buyers sometimes treat observability as a nice-to-have because the cost is visible and continuous. The cost of not having it becomes visible only after the first production incident.

Ownership

Offline evals are typically owned by the team closest to the model — data scientists, ML engineers, or the prompt engineering function. The discipline requires judgment about what to put in the golden dataset and how to score outputs, which requires domain knowledge about the task.

Online observability is typically owned by platform or SRE teams — the people responsible for production reliability. It requires instrumentation of the runtime environment, integration with alerting infrastructure, and operational response procedures. In organizations that haven't yet sorted out who owns AI agents in production, ownership of observability is often genuinely unclear. That ambiguity is worth surfacing before an incident makes it urgent.

Field Language Guide

Comparison structure: trait-led analysis across five dimensions. Offline evals and online observability are complementary disciplines; the comparison is not adversarial.

Don't say	Do say	Why it matters
"We tested it before launch, so we're confident."	"What does your regression suite cover, and how often does it run against new model versions or prompt changes?"	Pre-launch testing is offline evals. It doesn't cover production conditions. The follow-up surfaces whether they have a continuous eval posture or a one-time gate.
"The AI hallucinated."	"Was this a model output failure or a tool-call failure? Your tracing should tell you which."	Most production agent incidents are tool-call failures, not model errors. Accepting "hallucination" as the diagnosis skips the actual root cause and lets the real failure mode go uninstrumented.
"We have logging."	"Logging and tracing are different things. Tracing reconstructs the full call graph across a multi-step agent run. Do you have span-level visibility into each tool call?"	An agent that makes five tool calls and fails on the third one looks like a single event in a log. Tracing shows you the sequence. Without it, post-mortems are guesswork.
"We'll add observability once we're in production."	"Observability is harder to retrofit than to instrument from the start. What's the incident response plan for the first production failure before the instrumentation is in place?"	Retrofitting observability into a running production system requires downtime or shadow deployment. The first incident will happen before the retrofit is complete.
"Our evals are passing."	"What's in your golden dataset, and when was it last updated? Does it include realistic tool error responses?"	A golden dataset that doesn't include tool failure conditions won't catch the failure modes that dominate production incidents. Passing evals against a stale dataset is a false signal.
"We're using LLM-as-judge."	"What model are you using as the judge, and what criteria is it scoring against? Have you validated the judge's scores against human labels on your specific task type?"	LLM-as-judge is widely used and useful. It's also a source of systematic bias if the judge model has opinions about the task domain. Buyers should know whether they've calibrated it.
"What's your observability story?" (in procurement)	"What telemetry does your platform emit by default, and does it cover tool-call inputs and outputs, not just model invocations?"	Vendors often instrument model calls well and tool calls poorly. The gap is where most incidents live. Getting a specific answer to this question separates platforms with real observability from platforms with dashboards.
"We had an incident last month."	"What did your tracing tell you about where in the execution the failure originated? Was it a model output, a tool call, or a loop condition?"	In a post-mortem conversation, this question establishes whether the buyer has the instrumentation to do root cause analysis or is working from symptoms. The answer tells you what they need.
"We need a better model."	"Before changing the model, what does your eval suite show about where quality is degrading? And what does your production tracing show about whether the degradation is in model outputs or in tool-call behavior?"	"Better model" is the default diagnosis when instrumentation is absent. It's often wrong and expensive. The question redirects toward evidence.
"Drift detection" (undefined)	"Drift detection means your observability system alerts you when production behavior deviates from a baseline — output distributions, tool-call patterns, latency profiles. It's how you catch model updates or API changes that affect behavior without triggering an obvious error."	The term is used loosely. Defining it precisely in conversation establishes that you understand the discipline and gives the buyer a concrete capability to ask their vendor about.

Callout: The Access Review Analogy — Where It Holds and Where It Breaks

If you work in IDAM, you have a ready analogy for this two-discipline structure: periodic access certification reviews and real-time anomaly detection are complementary, not competing. You run access reviews on a schedule to certify that entitlements are still appropriate. You run continuous monitoring to catch anomalous behavior between reviews. Neither substitutes for the other. An access review won't catch a compromised credential being used at 2 AM. Continuous monitoring won't tell you that a user accumulated excessive entitlements gradually over eighteen months.

Offline evals and online observability map onto this structure cleanly. Evals are the scheduled audit. Observability is the continuous monitor. For an IDAM-fluent audience, the complementary posture is immediately intuitive.

The analogy has a limit, though. In access certification, the ground truth is relatively stable: does this person still need this role? The answer changes slowly, and a periodic review can reasonably approximate the current state. In AI evals, ground truth is contested and the system's behavior space is vast. A golden dataset is not a complete picture of what the system should do — it's a sample of representative cases. The gap between the sample and the full behavior space is where production incidents live. A buyer who thinks their eval suite is like an access review — a periodic check against a known-good state — will underestimate how quickly model behavior can drift between runs and how much of the failure space their dataset doesn't cover.

In practice, the access review framing gets the two-discipline structure across quickly. The place to extend it is the ground truth question. "In access reviews, you know what correct looks like. In AI evals, part of the work is defining and maintaining that definition as the system evolves." That's where the analogy earns its keep and where the real conversation starts.

Tooling references in this piece — LangSmith, Braintrust, Langfuse, Arize — are illustrative examples of the current landscape, not recommendations. The tooling layer in AI observability and evals is moving faster than any publication cycle. Treat vendor capabilities as a starting point for your own verification, not as a current-state assessment.