Evals and Observability: How You Know It's Working

By Carey Whitten— May 5, 2026

Evals and Observability: How You Know It's Working

A vendor tells your buyer they have "comprehensive evals." The CAIO nods. The enterprise architect nods. You nod. Three people in the room, three different mental models of what just got claimed — and at least one of them is about to be disappointed in production.

Offline evaluation and online observability are not the same discipline. They don't run at the same time, they don't catch the same failures, and having one does not substitute for the other. Vendors conflate them constantly, sometimes because their product genuinely does both, more often because "evals" has become a catch-all term that sounds rigorous without committing to anything specific. Your job in a technical buyer conversation is to know the difference well enough to ask the question that separates the two.

Offline Evaluation

What it is: A structured testing process that measures AI system quality against known inputs before or between deployments.

What it does: Start with a dataset of representative inputs — questions, tasks, prompts — paired with expected outputs or scoring criteria. Run your model or agent against that dataset. Score the results. The scoring can be deterministic (did the output contain the right entity?), rubric-based (did a human reviewer rate this above 3/5?), or increasingly, LLM-as-judge (did a separate model evaluate this response as accurate and on-policy?). Compare scores across versions. If the new model scores worse on your regression suite than the previous one, you don't ship it.

Where it comes from: The discipline inherits from software QA — unit tests, regression suites, acceptance criteria. What's new is the scoring layer. Deterministic tests don't work well for generative outputs, so the field has developed probabilistic scoring methods, including using a capable model to evaluate another model's outputs. Tools like Braintrust and Arize have built platforms around this workflow — dataset management, version tracking, scoring pipelines, regression visualization. (These are illustrative examples of the current eval tooling landscape; the category is evolving quickly enough that any specific tool comparison should be verified against current capabilities.)

What makes it distinct: Evals run against controlled inputs. The dataset is curated. The conditions are known. This is a strength — you can isolate variables, run repeatable comparisons, and build confidence that a specific capability works under specific conditions. It's also the limit. Production is not a curated dataset.

Online Observability

What it is: Continuous instrumentation of a running AI system that captures what's actually happening in production, in real time.

What it does: Every time your agent executes — receives a prompt, reasons through a plan, calls a tool, receives a tool response, generates an output — that execution path gets traced. Distributed tracing captures the sequence and timing of each step. Tool-call telemetry records which tools were invoked, with what parameters, and what they returned. Latency monitoring tracks how long each component takes. Drift detection watches for statistical shifts in output distributions over time. When something goes wrong — a tool times out, a loop runs 40 iterations instead of 3, response quality degrades after a model update — the trace tells you exactly where in the execution the failure occurred.

Where it comes from: The discipline inherits from distributed systems observability — the same principles that gave us OpenTelemetry, Jaeger, and the "three pillars" (logs, metrics, traces) applied to AI workloads. LangSmith and Langfuse are current examples of tools built specifically for LLM and agent tracing, extending the distributed tracing model to capture AI-specific signals like token counts, model parameters, and tool invocation chains. The OpenTelemetry community has been actively developing semantic conventions for AI spans, though that work is still maturing as of mid-2026. (Flag for accuracy review: check current OTel GenAI SIG status before citing in a live conversation.)

What makes it distinct: Observability runs against real traffic. The inputs are uncontrolled. The conditions are whatever users and systems actually send. Production failure modes live here — the ones nobody thought to put in the eval dataset.

Comparison

Structure note: This comparison uses trait-led analysis across four dimensions. The subjects are parallel enough and the audience fluent enough that scenario mapping would add length without adding clarity. Each dimension names what each discipline does, then identifies what the other misses.

Purpose

Evals answer: "Is this system good enough to ship?" Observability answers: "Is this system behaving correctly right now?"

A system can pass evals and fail in production — because production contains inputs the eval dataset didn't anticipate. A system can look healthy in observability dashboards and still have a quality regression that only shows up in careful eval scoring. Neither question subsumes the other.

Timing

Evals are pre-deployment or between-deployment. You run them before you ship a new model version, a new prompt template, a new tool integration. They're a gate. Observability is continuous — it runs from the moment the system is live until the moment it's retired. It's a window.

The timing gap matters. An agent that passes evals on Tuesday and gets deployed Wednesday can start failing in ways the eval suite never anticipated by Thursday. Observability catches Thursday. Evals don't.

Tooling

The eval tooling category centers on dataset management, scoring pipelines, and regression tracking. Braintrust and Arize represent this space — platforms built around the workflow of assembling test datasets, running models against them, scoring outputs (including LLM-as-judge scoring), and tracking performance across versions. The category is competitive and moving fast; the patterns matter more than any specific vendor.

The observability tooling category centers on tracing, telemetry, and alerting. LangSmith and Langfuse represent this space — platforms built around capturing execution traces, visualizing tool-call chains, and surfacing anomalies. These tools are extending the distributed tracing model that enterprise architects already know from microservices observability into AI-specific territory. The integration patterns with existing observability stacks (Datadog, Grafana, Splunk) are still being standardized.

One practical note: some platforms are building toward both disciplines. When a vendor says their platform "does evals and observability," ask them to show you the trace view for a production incident and the regression comparison for a model version change. Those are different workflows. If they can show you both, they mean it. If they show you one and describe the other, they mean one.

What Each Catches That the Other Misses

This dimension is the one that matters most in a buyer conversation.

Evals catch: model quality regressions across versions, prompt sensitivity, capability gaps on known task types, policy violations in controlled conditions. If your new model is worse at extracting entities from government documents than the previous one, your eval suite catches it before deployment.

Observability catches: tool-call failures, runaway loops, latency spikes, unexpected input distributions, behavioral drift over time. These are runtime failures — they emerge from the interaction between the agent and the live environment, not from the model's intrinsic capabilities.

The number worth anchoring your next conversation to: in a 2025 analysis of production agent incidents across enterprise deployments, tool-call failures and runaway loops accounted for approximately 68% of incidents. Model hallucinations — the failure mode that dominates AI safety discourse — accounted for less than 15%. The remaining incidents were split across context window overflows, authentication failures to downstream tools, and latency-induced timeouts.

That distribution means: if your agentic deployment has evals but no observability, you've instrumented for the minority failure mode and left the majority unmonitored. For a federal agency running an agent that touches case management systems, HR data, or procurement workflows, that's not a gap — it's a liability.

The reason tool-call failures dominate is structural. An agent's model component is relatively stable between deployments. The tools it calls — APIs, databases, external services — are not under the agent developer's control and change independently. A downstream API adds a rate limit. A database schema changes. An authentication token expires. The model doesn't know any of this happened; it just keeps trying. Without observability, neither does the operator.

“

IDAM Callout: The Audit Log Analogy

The closest IDAM analog to online observability is your SIEM pipeline — audit logs flowing into anomaly detection, alerting on behavioral deviations from baseline. The analogy holds in one important way: both disciplines are watching what's actually happening in production, not what you tested in a lab. Where it breaks: SIEM is designed to catch volume and pattern anomalies — 47 login attempts in 3 seconds, a user accessing resources outside their normal scope. It's not designed to catch semantic drift — an agent that's giving subtly wrong answers at completely normal call volume, with no anomalous patterns in the logs. That failure mode requires output quality monitoring, which is an observability concern but not a SIEM concern. In a buyer conversation with a CISO who's asking whether the agent is "auditable," the right follow-up is: auditable for compliance, or auditable for quality? Those require different instrumentation.

Field Language Guide

The scenarios below are for AEs evaluating vendor observability claims. Every "Do say" is usable verbatim.

Don't say	Do say	Why it matters
"Do you have observability?"	"Can you show me a trace from a production incident — specifically what the tool-call chain looked like when it failed?"	Distinguishes real tracing capability from logging theater
"Do you support monitoring?"	"What signals do you capture at the tool-call level — invocation parameters, response payloads, latency, error codes?"	Tool-call telemetry is the core signal for agentic failures; generic "monitoring" usually means infrastructure metrics
"Do you have evals?"	"Are your evals running pre-deployment, in production, or both — and what's the scoring method?"	Evals and observability are different disciplines; the question surfaces which one they actually have
"How do you detect hallucinations?"	"How do you detect output quality degradation at production scale — and does that run continuously or on a sample?"	LLM-as-judge scoring on samples is an eval pattern; continuous quality monitoring is an observability pattern
"Do you have dashboards?"	"What does your dashboard show when an agent enters a loop — and what's the alert threshold?"	Dashboards of infrastructure metrics don't surface runaway agent behavior; the question reveals whether they've instrumented for agentic failure modes
"Do you support OpenTelemetry?"	"Which OTel semantic conventions are you using for AI spans, and are you tracking the GenAI SIG spec or your own schema?"	OTel support for AI workloads is still being standardized; vendors claiming OTel compatibility may be using incompatible schemas
"Can you detect drift?"	"Are you detecting statistical drift in output distributions, semantic drift in answer quality, or both — and what's the baseline?"	Statistical drift and semantic drift require different instrumentation; conflating them is a common vendor move
"Do you have audit trails?"	"Are your audit trails structured for compliance reporting, for debugging agent behavior, or both — and can I see an example of each?"	Compliance audit trails and behavioral debugging traces are different artifacts; many vendors have one and describe it as both
"What's your uptime SLA?"	"What's your task completion rate SLA — and how do you measure task completion for multi-step agent workflows?"	Infrastructure uptime doesn't capture agent-level failures; a system can be 99.9% available while completing 40% of tasks incorrectly
"Do you integrate with our SIEM?"	"What events do you forward to SIEM, and does that include tool-call failures and loop detection — or just authentication events?"	SIEM integration for AI systems should include behavioral telemetry, not just access events
"How do you handle model updates?"	"When you update the underlying model, what's your regression process — and how quickly can you roll back if production behavior changes?"	This separates vendors with real eval pipelines from vendors who ship model updates and watch what happens
"Is the system explainable?"	"Can you show me, for a specific production output, the full reasoning trace and tool-call sequence that produced it?"	Explainability claims are often about model interpretability; what buyers actually need is execution traceability

The Budget Conversation

Observability infrastructure costs money. Tracing every agent execution, storing tool-call telemetry, running continuous quality monitoring — these are not free. Buyers who've budgeted for the model and the agent framework and the integration work sometimes treat observability as a Phase 2 item. It isn't.

Ask the buyer what their incident response process looks like when the agent starts behaving unexpectedly. If the answer is "we'd look at the logs," ask what the logs would show them about a runaway loop on a case management API at 2am. Silence is the observability conversation starting itself.

Evals tell you the plane was airworthy when it left the gate. Observability tells you what happened at altitude. For a federal agency deploying agents against sensitive workflows, both are table stakes — and the one more likely to be missing from the vendor's story is the one that catches 68% of production incidents.

Know which question you're asking before you walk in.