Last summer, METR ran a randomized controlled trial with sixteen experienced open-source developers. Each had roughly five years in their repositories. Tasks were randomly assigned with or without AI tools. Before starting, the developers predicted AI would make them 24% faster. After finishing, they estimated it had made them about 20% faster.
The measured result was that AI made them 19% slower.
METR treats the slowdown itself as potentially specific to the cohort and the tools available at the time. What they consider more durable is the perception gap. Developers completed the tasks, experienced the full arc of working with AI assistance, and still believed they'd gotten faster. Their direct experience of the work did not correct their sense of how the work had gone. METR's science communication note put it plainly: "people are bad at estimating speedup from AI" was the more robust takeaway.
A peer-reviewed study of 758 BCG consultants found the same mechanism in a different domain. When consultants used GPT-4 on tasks outside the model's reliable capability boundary, their accuracy dropped by 19 percentage points, even as completion time improved and perceived quality went up. Confidence tracked the feeling of fluency. Accuracy moved independently. When AI removes friction from a task, it also removes the texture that told you the task was hard. The effort was the signal.
Now hold that alongside a different finding. Seventy-four percent of teams with deployed agent systems rely on human-in-the-loop verification as their primary quality assurance method. Domain experts validate outputs for correctness. This is the load-bearing layer in most production agent workflows today.
Both things are true at once. Human oversight is the dominant mechanism for catching agent errors. And humans, when assisted by AI, are systematically miscalibrated about the quality of what they're overseeing.
At the individual level, that's already concerning. At the organizational level, the concern changes shape, because the conditions that produce miscalibration are structural. Le and Kunz found across six experiments that the primary driver of AI complacency was the absence of accountability for monitoring. When nobody checks the checker, the checker stops checking. That's an organizational condition, and it scales with headcount.
No published study directly measures whether individual calibration failures compound when multiple people in a workflow share the same AI-shaped blind spot. But the structural logic is hard to dismiss. When the reviewer and the person who reviews the reviewer are both experiencing the same fluency, both subject to the same decoupling of confidence from accuracy, the chain stops catching errors and starts producing consensus. Each node confirms the others' sense that things look fine. A systematic review of automation bias research puts it starkly:
"The human in the loop may perpetuate or even exacerbate the AI's errors rather than catching them."
The oversight layer starts amplifying the thing it was supposed to filter.
One small signal from METR's May 2026 survey sits with me. Their own researchers, the people most exposed to controlled measurement of this perception gap, reported the lowest productivity gains of any group surveyed. Knowing about the calibration problem appears to be the thing that partially corrects it. That's worth pausing on. It suggests the corrective is a kind of institutional self-skepticism that almost no organization practices or rewards. Better tools and tighter process don't address it.
Most organizations don't have that skepticism circulating. What they have instead are dashboards that stay green, outputs that look polished, and reviews that confirm other reviews. Something has gone wrong, and the evidence is that nothing appears to have gone wrong. And the people best positioned to notice are the ones whose capacity to notice is quietly degrading alongside the system's smooth operation.
Things to follow up on...
- AI escalates its persuasion: A follow-up analysis of BCG consultants' GPT-4 logs found that when professionals pushed back on errors, the AI didn't disclose limitations but instead restated its position with more supporting data, making flawed recommendations harder to catch.
- Oversight is becoming law: The EU AI Act's full applicability for high-risk systems arrives in August 2026, requiring proportional human oversight calibrated to autonomy level and risk, not blanket approval at every step.
- Accountability structures matter most: New research on how team organization influences the ability to detect automation failures found that self-managed teams with distributed oversight caught more errors than hierarchical teams where approval flowed in one direction.
- METR is redesigning measurement: After finding that 30–50% of developers now refuse to work without AI tools, METR concluded its controlled trial methodology had broken down and is changing its experimental design entirely to account for the fact that the baseline condition no longer reflects how anyone works.

