CURRENT | Foundations

Reliability Lens

The Number You Didn't Ask About

By Nora Kaplan— June 11, 2026

Feature image for article: The Number You Didn't Ask About

An agent succeeds at a task 70% of the time. Frame the results one way and you can report 97%. Frame them differently and you report 34%. Same system, same math, a 63-point gap created entirely by which question the benchmark decided to answer. Nearly every major agent benchmark chooses the generous framing. The number that captures what production actually feels like, consecutive successes across a full workload, is one that almost nobody reports.

Reliability Lens

The Number You Didn't Ask About

By Nora Kaplan— June 11, 2026

An agent succeeds at a task 70% of the time. Frame the results one way and you can report 97%. Frame them differently and you report 34%. Same system, same math, a 63-point gap created entirely by which question the benchmark decided to answer. Nearly every major agent benchmark chooses the generous framing. The number that captures what production actually feels like, consecutive successes across a full workload, is one that almost nobody reports.

Perception Walkthrough

The Accessibility Tree Wasn't Built for AI Agents

By Rina Takahashi— June 11, 2026

Feature image for article: The Accessibility Tree Wasn't Built for AI Agents

Accessibility trees were built for screen readers, not AI agents. But both share the same constraint: neither can see a web page. Both need it translated into structured, sequential text. Playwright MCP builds its entire browser automation architecture around that convergence, using accessibility snapshots that cost a fraction of what screenshots require per interaction. The token savings are real and they compound across multi-step workflows. So does the downside: on 95.9% of the web's top million pages, the underlying markup has detectable failures. The agent inherits every one of them.

Perception Walkthrough

The Accessibility Tree Wasn't Built for AI Agents

By Rina Takahashi— June 11, 2026

Accessibility trees were built for screen readers, not AI agents. But both share the same constraint: neither can see a web page. Both need it translated into structured, sequential text. Playwright MCP builds its entire browser automation architecture around that convergence, using accessibility snapshots that cost a fraction of what screenshots require per interaction. The token savings are real and they compound across multi-step workflows. So does the downside: on 95.9% of the web's top million pages, the underlying markup has detectable failures. The agent inherits every one of them.

Further Reading

The Three Architectures of Browser AgentsMaps the full perception design space the walkthrough occupies one corner of.

Demystifying Evals for AI AgentsA layered eval methodology for teams ready to move from measuring reliability to building around it.

Quick links

The Reliability Gap: Agent Benchmarks for Enterprise

How AI Agents See Your Website: The Accessibility Tree Explained

Key Distinctions

The agent ecosystem adopts vocabulary faster than it pressure-tests it. A term catches on, architectures get built around the assumption, and then someone discovers that two genuinely different concepts have been sharing a label. By then the confusion has already shaped security postures, eval pipelines, and deployment decisions.

These six distinctions can look like semantics from a distance. Up close, the gaps are structural. A sixty-percentage-point swing between two eval metrics applied to the same agent. A protocol everyone treats as a governance layer that explicitly declines to be one. Vocabulary worth sharpening before it costs you something.

Key Distinctions

The agent ecosystem adopts vocabulary faster than it pressure-tests it. A term catches on, architectures get built around the assumption, and then someone discovers that two genuinely different concepts have been sharing a label. By then the confusion has already shaped security postures, eval pipelines, and deployment decisions.

These six distinctions can look like semantics from a distance. Up close, the gaps are structural. A sixty-percentage-point swing between two eval metrics applied to the same agent. A protocol everyone treats as a governance layer that explicitly declines to be one. Vocabulary worth sharpening before it costs you something.

Eval Metrics

pass@k and pass^k Measure Completely Different Things

Confused: treating "succeeded in k tries" and "succeeded every try" as variations of the same score. They aren't. An agent with 70% per-trial success hits ~97% on pass@3 (capability) but just 34% on pass^3 (reliability). Same agent, opposite conclusions.

Security Vectors

Injection and Poisoning Enter Through Different Doors

Confused: lumping both under "prompt attacks." Prompt injection is an input-validation problem where malicious instructions slip through unsanitized user content. Tool poisoning is a supply-chain problem where server-side metadata the agent trusts was authored by an attacker. Defending the input channel does nothing about the supply chain, and vice versa.

System Design

Who Decides the Next Step: Code or Model

Confused: calling anything that uses an LLM an "agent." A system that calls GPT-4 ten times in a fixed sequence is a workflow if code controls the routing. A system where the model picks its own tools and loops is an agent. The dividing line is control architecture, not model size or capability.

Tool Protocols

MCP Standardizes Plumbing, Not Policy or Governance

Confused: assuming a standard integration protocol handles authorization and security. MCP defines how agents discover and call tools. It does not enforce least privilege, scope tokens, define audit policy, or verify tool quality. Authorization remains optional in the spec. Trend Micro found 492 MCP servers exposed with zero authentication.

Runtime Infrastructure

Durable Execution Survives Crashes, Regular Execution Doesn't

Confused: assuming a workflow that runs will keep running. Execution means the code ran. Durable execution means it checkpointed progress, survived a crash, and resumed without duplicating side effects. A server restart at 3am either loses the customer's half-finished task or retries and doubles the order. That's the gap.

Agent Surfaces

Browser Agents Inherit the User's Full Privilege Surface

Confused: treating browser-based and API-based agents as interchangeable approaches. API-first agents call documented endpoints with typed inputs and narrow contracts. Browser-first agents operate through the user's most privileged interface, crossing app boundaries and acting with whatever credentials the session holds. The privilege surface is everything a logged-in user can reach.

Benchmark Landscape

Six benchmarks cover the agent capability landscape from coding and web navigation to function calling and workplace autonomy. Scores are climbing fast. Nearly all of them answer the same narrow question, though: can the agent do this once?

One benchmark in this set, τ-bench, was built to ask the harder version: can it do this reliably, every time? That asymmetry shapes everything downstream. How we evaluate, what we miss when we deploy, and which failures never surface until production.

Scores shift as new models enter evaluation. Dates are noted on each card. Verify against primary leaderboards before making decisions.

Benchmark Landscape

Six benchmarks cover the agent capability landscape from coding and web navigation to function calling and workplace autonomy. Scores are climbing fast. Nearly all of them answer the same narrow question, though: can the agent do this once?

One benchmark in this set, τ-bench, was built to ask the harder version: can it do this reliably, every time? That asymmetry shapes everything downstream. How we evaluate, what we miss when we deploy, and which failures never surface until production.

Scores shift as new models enter evaluation. Dates are noted on each card. Verify against primary leaderboards before making decisions.

Coding Tasks

SWE-bench Verified: Real GitHub Bug Fixes

Agents tackle human-validated software engineering tasks drawn from actual GitHub issues. Leading score hit 74.4% by end of 2025, though scores have climbed further since, accompanied by growing contamination concerns. Reports pass@k only. Check the primary leaderboard for current numbers.

Reasoning Tasks

GAIA Pushes Multi-Step Reasoning With Tools

Tests real-world assistant tasks that chain together web browsing, tool use, and multi-step reasoning across three difficulty tiers. Orchestrated agent systems reached ~90% by end of 2025. Bare models score far lower. Reports pass@k only.

⚑ Reliability Metric

τ-bench Exposes What Repeated Runs Reveal

Policy-adherent tool use in retail and airline customer service. Agents hit 80% on a single attempt, but pass^8 drops below 25% in original experiments. The only benchmark here reporting both capability (pass@k) and reliability (pass^k). Worth sitting with that gap.

Web Navigation

WebArena Navigates Realistic E-Commerce and SaaS