The best browser agents now achieve 89% success on standardized benchmarks. Humans hit 95.7% on identical tasks. Six point seven percentage points.
That gap sounds trivial. It's not. Recent benchmarks show architectural decisions, not model capability, drive performance. The hybrid context management that reached 85% represented a quantum leap from earlier 50% success rates. But closing that final stretch to human-level performance? Still out of reach.
Here's what the gap means in production: every failed task needs human intervention, fallback systems, or acceptance of incomplete data. At 10,000 daily tasks, you're looking at 1,100 failures versus 430 at human baseline. The math is unforgiving. "Good enough" automation still requires maintaining parallel human systems for the 11% that breaks.
