"How Do You Know It's Working?" Has Two Answers

By Carey Whitten— May 5, 2026

"How Do You Know It's Working?" Has Two Answers

When that question lands in a procurement conversation — and it will, usually about thirty minutes in — the answer you need depends entirely on who's asking.

A data science lead asking it wants to know about your evaluation methodology: golden datasets, regression suites, how you detect when a model update degrades something that worked last week. A platform engineering lead asking the same question wants to know about your runtime telemetry: what you're tracing, how you're alerting on tool-call failures, what happens when an agent loop doesn't terminate cleanly.

Same question. Two different disciplines. The vocabulary that earns you credibility in both conversations: offline evals and online observability. They're not interchangeable, they don't have the same owner, and they don't come out of the same budget line. Conflating them in front of either audience signals that you haven't been in production with this.

Offline Evals

What it is: A pre-deployment quality gate that runs agent behavior against known inputs and expected outputs before any version ships.

What it does: The team builds a dataset of representative inputs — user queries, expected tool-call sequences, edge cases, adversarial prompts — and defines what "correct" looks like for each entry. Before every deployment, the agent runs against that dataset. Outputs get scored: sometimes by deterministic rules (did the agent invoke the right tool with the right parameters?), sometimes by an LLM-as-judge (does this response meet quality criteria on a defined rubric?). The score tells you whether the new version is better or worse than the previous one on the dimensions you've defined. If it regresses past a threshold, the deployment doesn't ship.

Who's behind it: The ML or data science team. They own the golden dataset, they define the scoring rubric, they set the regression threshold. In larger organizations, this is a dedicated ML platform function with its own tooling and pipeline. In smaller ones, it's whoever built the agent, which is a risk in itself.

What makes it distinct: Evals are retrospective by design. They test the agent against scenarios the team has already thought of. The golden dataset is a snapshot of the team's understanding of the problem at the time they built it — which is both its strength and its structural ceiling. Reproducible, auditable, comparable across versions. Also, by definition, unable to catch failure modes nobody anticipated when the dataset was assembled. Representative tooling in this space includes Braintrust, which has strong dataset management and version comparison workflows, and LangSmith, which covers eval pipelines alongside tracing. Langfuse includes eval capabilities as part of a broader observability platform.

Online Observability

What it is: Runtime monitoring of agent behavior in production — tracing execution paths, capturing tool-call telemetry, and detecting behavioral drift after deployment.

What it does: Every agent execution generates a trace: a structured record of what the agent did, in what order, with what inputs and outputs at each step. Tool calls are logged with their parameters, return values, latency, and error codes. Token consumption is tracked per execution. Alerts fire when error rates spike, when a loop exceeds a count threshold, when a tool call fails repeatedly against the same endpoint, when response latency climbs outside normal range. Drift detection compares current behavior distributions against a baseline — if the agent's tool-call patterns shift significantly without a deployment event, that's a signal worth investigating.

Who's behind it: Platform engineering and DevOps. This is infrastructure work, not data science. The team that owns your logging pipeline, your alerting stack, your SLOs — they own observability. In practice, there's a gap here that shows up repeatedly in enterprise AI deployments: the team that built the agent assumes someone else will instrument it, and the platform team assumes the AI team handled it. Nobody handled it.

What makes it distinct: Observability is prospective. It catches things nobody knew to test for. It's also the only discipline that can catch the failure mode that actually dominates production incidents — more on that in a moment. Representative tooling includes Arize, which has the most mature MLOps lineage in this space with drift detection built for production scale, and Langfuse, which is popular in self-hosted enterprise contexts where data residency requirements rule out SaaS-first options.

Comparing the Two

Structure note: Two subjects, flat A/B comparison across four dimensions. Each subject appears on every dimension.

When each applies

Offline evals apply before deployment — they're the gate the agent has to pass before it touches production traffic. The eval suite runs in CI/CD, the same way unit tests run. Online observability applies after deployment, continuously, for as long as the agent is running. There's no "done" state for observability; it's a monitoring discipline, not a milestone.

A team that has excellent evals but no observability has a quality gate with nothing behind it. They know the agent passed its tests. They don't know what it's doing at 2am on a Tuesday.

Who owns it

Evals are owned by whoever is responsible for model quality — typically ML engineering or data science. Observability is owned by whoever is responsible for production reliability — typically platform engineering or DevOps. These are often different teams with different reporting structures, different tooling preferences, and different definitions of "working." When a buyer asks "who owns AI quality?" and the answer is a single team, that's a signal that one of these disciplines is probably being neglected.

What it costs

Eval infrastructure costs are primarily human: building and maintaining the golden dataset is ongoing work, not a one-time investment. The dataset degrades as the use case evolves, and keeping it current requires domain expertise. LLM-as-judge scoring adds inference costs that scale with dataset size and eval frequency. Observability infrastructure costs are primarily operational: storage for traces (which can be voluminous at scale), alerting infrastructure, and the engineering time to instrument agent code and tune alert thresholds. A rough benchmark from enterprise deployments: teams that instrument properly typically budget 15-20% of their agent development effort for observability setup, and underestimate ongoing maintenance by about the same factor.

What it catches

The dominant concern in AI procurement conversations is model errors — hallucinations, factual inaccuracies, responses that don't meet quality standards. Evals are designed to catch exactly this. They're the right tool for that concern.

Model errors are not, however, the dominant failure mode in production. Across enterprise agent deployments tracked through 2025, tool-call failures and runaway loops account for roughly 70-80% of production incidents. The agent called an API that returned an unexpected schema. The agent retried a failed tool call without a backoff limit. The agent entered a reasoning loop that didn't have a termination condition. The CRM API started returning 429s at 2am and the agent retried 400 times before anyone noticed.

None of those incidents show up in an eval suite, because none of them are model errors. The model did exactly what it was trained to do. The infrastructure around it failed, or the tool it was calling failed, or the loop logic was wrong. Catching these requires observability — traces that show retry counts, tool-call error rates, loop depths, latency anomalies.

Buyers worried about their agent "going rogue" will find the actual failure modes anticlimactic. Mundane failures at 2am in a production system are their own category of problem, and they require boring infrastructure work that most AI teams aren't doing yet.

Field Language Guide

Scenario	Do say	Why it matters
Buyer asks "how do we know the AI is working?"	"That question has two answers depending on where in the lifecycle you are — evals before deployment, observability after. They're different disciplines with different owners."	Signals fluency before the buyer has to explain the distinction themselves
Buyer is focused on hallucination risk	"Model errors are real, but in production, most incidents trace to tool-call failures and runaway loops — that's where your observability investment needs to go."	Reframes the risk conversation toward the actual failure mode
Budget conversation about AI quality	"Evals are a data science budget item; observability is a platform engineering budget item. If they're coming out of the same line, one of them is probably underfunded."	Helps buyers structure the right internal ask
Vendor asks about your observability story	"What are you tracing at the tool-call level, and what's your alerting logic for retry loops?"	Specific enough to distinguish vendors who have thought about this from vendors who haven't
Buyer mentions golden datasets	"A golden dataset is your eval baseline — it's a snapshot of the scenarios you've anticipated. It's strong for regression testing; it can't catch what you didn't anticipate."	Acknowledges the tool's value while naming its limit
Buyer asks about LLM-as-judge	"LLM-as-judge is how you score outputs that don't have a deterministic right answer — you use a model to evaluate quality on a rubric. It adds inference cost to your eval pipeline, but it's the only way to score open-ended responses at scale."	Gives buyers the tradeoff, not just the definition
Buyer asks about drift detection	"Drift detection compares your agent's current behavior distribution against a baseline — if tool-call patterns shift without a deployment event, that's a signal. It's the observability equivalent of anomaly detection."	Maps to a concept buyers already hold
Buyer asks who should own AI observability	"Platform engineering, not the AI team. It's infrastructure work — logging, alerting, SLOs. The AI team builds the agent; the platform team instruments it."	Helps buyers assign ownership before there's an incident
Buyer mentions LangSmith, Langfuse, or Arize	"Those are representative of the current tooling landscape — LangSmith and Braintrust are strong on the eval side, Arize and Langfuse on observability. The tooling layer is moving fast; the patterns underneath it are more stable."	Positions you as someone who tracks the space without overclaiming
Buyer asks what a good observability answer looks like from a vendor	"Trace-level visibility into every tool call, configurable alerting on error rates and loop counts, and some form of drift detection. If the answer is 'we log to CloudWatch,' that's not an observability story."	Gives the buyer a concrete evaluation criterion

Okta Concept Mapping

The IDAM analog: The eval/observability split maps to a pattern identity and security teams know well: pre-deployment security review versus continuous monitoring. Before an agency grants an ATO, a security assessment runs — penetration testing, control validation, architecture review. A structured gate before the system touches production. After the ATO, continuous monitoring takes over — SIEM, UEBA, anomaly detection. Same timing logic, same basic purpose.

Where it holds: The ownership split is similar, too. Security review is typically owned by a security assessment team; continuous monitoring is owned by a SOC. Neither team owns both, and the handoff between them is a known governance gap. In AI deployments, the team that runs evals and the team that runs observability are often different, and the handoff is often implicit.

Where it breaks: In continuous monitoring for IDAM, you're watching known principals access known resources. The principal set is relatively stable, the access patterns are enumerable, and anomaly detection can be tuned against a well-understood baseline. In an agentic system, the agent can take novel action sequences that weren't anticipated at design time — the tool-call space is large enough that you can't enumerate expected behavior the way you can enumerate expected access patterns. Agent observability has to be more tolerant of novelty and more focused on structural signals (loop counts, error rates, latency spikes) than behavioral pattern matching. The concept transfers; the tuning logic doesn't.

What this means in a buyer conversation: When a CAIO asks how agent observability fits into their continuous monitoring program, the honest answer is that it's a new category that borrows the concept from SIEM but requires different tooling and different alert logic. Okta's role in this architecture is upstream: the identity governance layer — what the agent is authorized to call, what scopes it holds, what its authentication events look like — generates audit data that feeds the observability layer. That's a meaningful input, not a substitute for trace-level instrumentation of the agent itself.