Agent deployments quadrupled from Q2 to Q3, hitting 42% of enterprises. Researchers discovered WebArena—used by OpenAI and others—marked "45 + 8 minutes" as correct. Top agents score 5% on hard benchmarks. Organizations are racing to production anyway.
Something's breaking here. Enterprises deploy without knowing how to measure what matters. Benchmarks optimize for controlled accuracy. Production demands cost predictability, failure recovery, operational stability. The evaluation gap widens as adoption accelerates.
Within six months, early deployments will surface failures benchmarks couldn't predict. Not task completion failures. Reliability failures. Affordability failures. Safety-at-scale failures. Success metrics won't translate to production outcomes because current evaluation can't surface failure clustering, cost variance, or degradation patterns.
Teams building at scale will need internal frameworks measuring what matters in their environment. Track failure modes, not just success rates. The organizations developing production-relevant metrics first gain genuine advantage. But really, this points toward verification infrastructure that creates organizational trust—proving agent decisions are correct before they impact operations, not after.
