The Number You Didn't Ask About

How the gap between pass@k and pass^k quietly distorts every agent benchmark, procurement decision, and deployment plan built on headline numbers.

An agent succeeds on a customer service task 70% of the time. Report the chance it gets at least one success in three tries, and you can write 97%. Report the chance it handles three consecutive requests without failing, and you write 34%. Same agent. Same math. A 63-point gap created entirely by which question you decided to answer.

The intuition is simple: each additional attempt is another chance to fail, and failures don't average out when you need every attempt to land. At a 90% per-trial success rate over eight consecutive tasks, you're down to 43%. The exponent is merciless. What looks production-ready in a single trial becomes a coin flip across a shift's worth of work.

The optimistic version, pass@k, originated in code generation research, where it made perfect sense. Generate several candidate solutions, run them against a test suite, ship the one that passes. You're selecting the best attempt from a batch. But a customer calling about a refund doesn't get three agents in parallel. Neither does a fraud check, a booking flow, or a compliance verification. For sequential work, Sierra Research's tau-bench introduced pass^k: the probability that all k attempts succeed. The metric that captures what production actually feels like.

The results were clarifying. GPT-4o scored roughly 61% on tau-bench's retail tasks in a single trial. Respectable. But pass^8 in the same domain fell below 25%. A 40-point collapse, from the best model tested at the time.

And tau-bench is essentially alone in reporting this number. SWE-bench, GAIA, WebArena, TheAgentCompany: the benchmarks generating headlines and shaping procurement decisions all report single-trial accuracy or pass@k. Researchers at KTH ran 60,000 trajectories on SWE-bench and found that single-run estimates vary by up to 6 percentage points depending on which run you happen to pick. Reported improvements of 2-3 points may be noise.

So the benchmarks measure capability. Practitioners experience consistency. The gap between those two properties of the same system is where production plans go wrong. In a study of 306 practitioners, reliability was the top development challenge, and teams responded by deliberately constraining agent autonomy: 68% of production agents execute at most ten steps before requiring human intervention. The LangChain developer survey tells the same story from a different angle, with 32% of respondents naming quality as their top barrier to production deployment. These teams watched the demo work. Then they watched Monday happen.

The industry's most-cited numbers are built on a metric designed for batch selection, applied to sequential execution. That gap is the distance between a compelling demo and a deployment that holds up across thousands of runs.

The next time an agent benchmark crosses your desk reporting 85% accuracy, you have the question worth asking: 85% of what? One success in several attempts, or reliable performance every time? In a field where almost no major benchmark reports the sequential number, the absence is itself a finding. Pass^k wasn't deliberately suppressed; it just never became the convention. And that silence has been shaping procurement conversations, deployment plans, and staffing estimates built on numbers that describe a system's best day, not its typical one.

Things to follow up on...

Failures aren't independent: A reliability science framework paper tested whether agent failures behave like independent coin flips and found systematic violations, meaning the pass^k = p^k formula actually understates the consistency problem in practice.
Autonomy keeps growing: Anthropic's research on agent autonomy found that 99.9th-percentile turn duration nearly doubled between October 2025 and January 2026, meaning more steps per run and more surface area for the exponential to bite.
Scaling is an allocation problem: A March 2026 survey of 650 enterprise technology leaders found that organizations reaching production scale weren't spending more on AI overall but were allocating proportionally more to evaluation infrastructure and monitoring than to model selection.
Phil Schmid's walkthrough: Google DeepMind's Phil Schmid published a practitioner-oriented explainer using a flight rebooking agent to show how pass^k changes staffing estimates and architectural decisions.

Things to follow up on...

Failures aren't independent: A reliability science framework paper tested whether agent failures behave like independent coin flips and found systematic violations, meaning the pass^k = p^k formula actually understates the consistency problem in practice.
Autonomy keeps growing: Anthropic's research on agent autonomy found that 99.9th-percentile turn duration nearly doubled between October 2025 and January 2026, meaning more steps per run and more surface area for the exponential to bite.
Scaling is an allocation problem: A March 2026 survey of 650 enterprise technology leaders found that organizations reaching production scale weren't spending more on AI overall but were allocating proportionally more to evaluation infrastructure and monitoring than to model selection.
Phil Schmid's walkthrough: Google DeepMind's Phil Schmid published a practitioner-oriented explainer using a flight rebooking agent to show how pass^k changes staffing estimates and architectural decisions.