Offline evals and online observability are the two disciplines that determine whether an AI agent system is working. You'll encounter both in any serious conversation with a CAIO or CISO about agent reliability. Offline evals test agent behavior before deployment against curated datasets. Online observability watches what actually happens at runtime. The distinction matters because most organizations are budgeting for the wrong one. They assume agent failures are model problems, and the MAST studyout of UC Berkeley puts a number on that assumption. Seventy-six percent of system-addressable failures are tool-call breakdowns and coordination problems. Plumbing. Owning that number and knowing which discipline catches which failure type is what buys you credibility when the conversation turns to instrumentation budgets.
Offline Evals
What it is: Testing agent behavior against curated datasets and scoring criteria before the agent reaches production.
What it does: You assemble inputs with known-good outputs, a "golden dataset," and run your agent against them. Did it select the right tool? Produce the correct output? Stay within its task boundaries? You score the results with code (deterministic checks), with humans (annotation queues), or with another LLM acting as a judge. The best implementations run in CI/CD pipelines, gating deployments the way unit tests gate code merges. Score drops below threshold, deployment doesn't ship.
Where it comes from: Traditional ML, where you'd test a model against a held-out test set before promoting it. The agent version extends this to multi-step workflows. You're evaluating an entire trajectory: reasoning, tool calls, and decisions across multiple steps. LangSmith, Braintrust, Arize Phoenix, and Langfuse all offer offline eval capabilities with varying degrees of CI/CD integration and scoring flexibility. The tooling landscape here shifts fast enough that specific capabilities are best verified at evaluation time, not taken on faith from a piece like this one.
What makes it distinct: Offline evals catch regressions you introduce. You changed a prompt, swapped a model, updated a tool description. Did anything break? The golden dataset is your regression suite. Think of it as the discipline that protects you from yourself. Its blind spot is everything that only happens in production: user inputs you didn't anticipate, tool APIs returning unexpected data, load patterns triggering rate limits, and the slow behavioral drift that accumulates between deployments. If you didn't think to test for it, the golden dataset can't catch it.
Online Observability
What it is: Instrumenting a running agent system to capture, trace, and score every step of every request in production.
What it does: Every agent run generates a trace: a structured record of each LLM call, each tool invocation with arguments and responses, each decision point, and the final output. Online observability captures these traces, lets you drill into individual runs, and surfaces aggregate patterns. Which tools fail most often. Where latency spikes. Which workflow paths burn the most tokens. Where agents loop. More sophisticated implementations run automated scorers against production traces in real time, flagging quality degradation before users report it. Some detect drift by identifying when agent behavior distributions shift even though no deployment changed.
Where it comes from: Application performance monitoring and distributed tracing, adapted for AI. If the buyer's team has worked with Datadog or Splunk, the shape is familiar. The difference is what you're tracing. Traditional APM traces HTTP requests through microservices. Agent observability traces reasoning chains through LLM calls and tool invocations, where a 200 OK response tells you absolutely nothing about whether the output was correct. (A successful HTTP status code on a tool call is roughly as informative as a green light on a fax machine. It tells you the signal went somewhere.) All four tools named above offer online observability capabilities. Arize Phoenix is built on OpenTelemetry, an open standard for application telemetry that many infrastructure teams already have deployed, which matters if the buyer's team already has OTel instrumentation in place.
What makes it distinct: Online observability catches failures that only exist at runtime. The tool API returning empty JSON with a 200 status code. The agent entering a retry loop, calling the same endpoint with the same malformed arguments twenty times before a timeout kills the session. The coordination failure where Agent A withholds information Agent B needs, producing a subtly wrong output that passes every automated check. These failures don't exist in your golden dataset because you didn't know to test for them. That's the whole point.
Offline evals map to testing an access policy in staging before deploying it: defined expected behavior, test cases, verified outputs. Online observability maps to your SIEM and runtime audit logs: watching what actually happens when real users hit real systems. Every IDAM seller has lived the moment where a policy that tested perfectly in staging produced unexpected results in production because of a federation trust interaction nobody anticipated. Same gap, different layer.
Where Agents Actually Break
The MAST study (UC Berkeley Sky Computing Lab, published at NeurIPS 2025) analyzed 1,642 execution traces across seven open-source multi-agent frameworks, with expert human annotators categorizing 14 distinct failure modes into three buckets. Inter-annotator agreement was strong (κ = 0.88), which is a credibility signal worth noting.
The distribution:
- System Design Issues: 44.2%. Step repetition (agents looping, 13.2%), disobeying task specifications (15.7%), loss of conversation history (8.2%), agents unaware of their own termination conditions (6.2%).
- Inter-Agent Misalignment: 32.3%. Information withholding between agents (12.4%), conversation resets (9.1%), agents ignoring other agents' input (6.8%).
- Task Verification: 23.5%. Incomplete or missing verification of outputs (11.8%) and premature termination (7.4%).
Those first two categories together: 76.5% of categorized failures are system-level problems. The agents could do the work. They called the wrong tool, lost track of shared state, looped without terminating, or failed to pass critical information to the next agent in the chain.
Two caveats worth stating precisely. First, MAST deliberately excluded basic model-capability failures (hallucination, instruction-following errors) from its taxonomy. The study isolated failures addressable through better system design. So 76.5% represents the distribution among system-addressable failures, not among all failures. Second, MAST studied open-source frameworks (MetaGPT, ChatDev, and others) running coding and math tasks, not enterprise production deployments. How these distributions translate to your buyer's environment is genuinely unknown. But the directional signal is consistent with what practitioners report in production, and it's the best empirical data available.
When a federated SSO integration breaks in production, the root cause is rarely "the IdP can't authenticate users." It's a metadata mismatch, a clock skew, a certificate rotation that didn't propagate, a SCIM provisioning race condition. The authentication capability works fine. The plumbing between capabilities is where things go sideways. Agent systems exhibit the same pattern at a different layer, and the MAST data quantifies it.
Which Discipline Catches Which Failure
The comparison structure here is trait-led, organized by failure type. For each category of agent failure, one discipline is primary and the other is supplementary. This mapping is what helps a buyer allocate budget correctly, so it's worth walking through with some care.
Tool-call failures (wrong tool selected, malformed arguments, silent empty responses): Online observability is primary. These failures happen when real tool APIs return unexpected data, when argument schemas drift between the agent's expectations and the tool's actual interface, or when a tool returns HTTP 200 with empty content. Offline evals can catch known tool-call patterns if you've seen the failure before and added it to your test set. But the first occurrence is always caught at runtime, or not at all. One practitioner writeup on agent error handling (a developer blog; the author describes production agent experience but I can't independently verify the scale of their deployments) captures the pattern well: the most dangerous tool-call failures look like success, because nothing in the error handling checked output quality, only HTTP status codes.
Coordination breakdowns (information withholding, state desynchronization, ignored handoffs): Online observability is primary. Multi-agent coordination failures are emergent. They arise from the interaction between agents, not from any single agent's behavior in isolation. An offline eval that tests Agent A alone and Agent B alone will miss the failure that occurs when Agent A withholds context Agent B needs. Tracing the full multi-agent execution path is the only reliable detection mechanism.
Runaway loops (step repetition, retry storms, generate-validate-fail cycles): Online observability is primary, with cost and latency telemetry as the early warning. An agent in a retry loop can burn significant compute before anyone notices. The same practitioner source describes an agent looping for twenty minutes and $180 in tokens on a malformed input, never once deciding to stop and report the failure. Step-level tracing with cost attribution per step is what makes these visible.
Regression (a prompt change, model swap, or tool update breaks previously working behavior): Offline evals are primary. This is the golden dataset's home turf. Known-good inputs and outputs. Make a change, run the suite, investigate before deploying if scores drop. The LangChain agent engineering report (vendor-sourced, self-selected sample of 1,300+ practitioners, so treat as directional) found 52.4% of organizations run offline evals. Among those, the most common method is human review (59.8%), followed closely by LLM-as-judge scoring (53.3%).
Drift and degradation (model behavior shifts over time without any deployment change): Both disciplines share this, but online observability detects it first. A model provider updates weights. A tool API changes its response format slightly. Nothing in your system changed, but outputs start degrading. Online scoring of production traces catches the drift. Offline evals then confirm it and quantify the impact against your golden dataset.
The pattern across all five failure types: online observability is primary for the majority of production agent failures. Offline evals are primary for the failures you introduce through your own changes. Both are necessary. If a buyer has budget for one discipline now and one later, the failure distribution says instrument production first, build the eval suite from what production surfaces. The data supports that sequencing.
The LangChain survey reinforces this: 89% of respondents report having some form of agent observability in place, compared to 52.4% running offline evals. Among teams with agents in production, observability adoption hits 94%. The market is already voting with its budget.
The most mature teams connect these disciplines into a closed loop: production traces that fail online scoring get converted into offline eval test cases, so the regression suite grows from real failures. This is the agent equivalent of how SOC incident post-mortems feed back into detection rules and policy updates. If a buyer asks "how do these two connect," that's the answer, and it signals architectural maturity.
How to Say This in the Field
| Don't say | Do say | Why it matters |
|---|---|---|
| "You need AI testing" | "Are you running offline evals against golden datasets before deployment, or only catching issues at runtime?" | Specificity signals you know the discipline exists as a practice, not a buzzword. |
| "You need observability for your agents" | "What does your agent tracing look like? Can you drill into individual tool calls and see arguments, responses, and latency per step?" | Tool-call-level tracing is the specific capability that matters; generic "observability" could mean anything. |
| "AI agents hallucinate" | "The Berkeley MAST study found 76% of system-addressable agent failures are infrastructure problems. Tool-call breakdowns and coordination failures." | Moves the conversation from "AI is unreliable" to "AI plumbing needs instrumentation," which is solvable and budgetable. |
| "You should evaluate your models" | "Are you gating deployments on eval scores the way you'd gate a code merge on passing tests?" | Connects evals to CI/CD, which infrastructure buyers already understand and budget for. |
| "The tools are all pretty similar" | "The observability tooling landscape moves fast. The question is whether your team has tracing, offline evals, and a feedback loop between them." | Keeps you out of tool-picking conversations you can't win and focuses on the architectural question. |
| "We can help with AI security" | "Agent observability is where identity governance meets AI operations. Who authorized that tool call? What credentials did the agent use? Your SIEM needs to see agent traces." | Bridges directly to the identity conversation you own. |
| "Agents will transform your workflows" | "Which workflows are you putting agents on, and what's your plan for when a tool call fails silently at step two and corrupts everything downstream?" | Forces the buyer to think about failure modes, which is where your instrumentation conversation starts. |
| "You need better prompts" | "Better prompts help with the 23% of failures that are task verification issues. The other 76% are system design and coordination problems that prompts can't fix." | Uses the MAST data to redirect budget conversations toward infrastructure. |
| "LLM-as-judge is the standard for evals" | "LLM-as-judge scales your eval coverage, but 60% of teams still use human review for high-stakes decisions. The question is which outputs need human eyes." | Shows you understand the eval method landscape without oversimplifying. |
| "Monitoring is table stakes" | "94% of teams with agents in production have some observability. The gap is between 'some observability' and step-level tracing with automated scoring. That's where failures hide." | Pushes past the checkbox conversation to depth-of-instrumentation. |
What This Means for Your Next Conversation
When a buyer talks about making agents more reliable, listen for where they're focusing. If the conversation is entirely about better models and better prompts, they're addressing roughly a quarter of their failure surface. The larger share lives in tool-call reliability, agent coordination, and the runtime infrastructure connecting components. That's where observability earns its budget.
The question to carry into the room: "What's your observability story?" Ask it as a genuine diagnostic. The answer tells you whether the buyer is instrumenting the system where it actually breaks, or polishing the part that was already working fine.
You can own this conversation through the budget framing and the failure-type data. When the buyer starts asking about specific tracing implementations, scoring framework selection, or how to wire evals into their CI/CD pipeline, that's when you bring in your SE. The architectural "what" and "why" is yours. The implementation "how" is theirs.
Things to follow up on...
- MAST failure taxonomy details: The full MAST paper from UC Berkeley breaks all 14 failure modes into granular subcategories with per-framework failure rates ranging from 41% to 86.7%, worth scanning if a buyer asks how failure distributions vary by architecture.
- Anthropic on tool descriptions: Anthropic Engineering's guide on writing tools for agents treats tool descriptions as prompts that shape agent behavior, with iterative refinement of descriptions producing dramatic performance improvements in their testing.
- Eval adoption is growing fast: The LangChain State of Agent Engineering 2025 report shows offline eval adoption jumped from roughly 40% in 2024 to 52.4% in 2025, with LLM-as-judge methods now used by over half of evaluating teams.
- Agent retry loop economics: A practitioner analysis of agent error handling patterns documents specific production incidents where retry loops consumed hundreds of dollars in tokens before timeouts intervened, reinforcing why per-step cost telemetry belongs in any observability budget.

