Last summer, METR ran a randomized controlled trial with sixteen experienced open-source developers. These were people working on their own codebases, projects averaging over a million lines of code, developers with an average of five years on each project. Tasks were randomly assigned to allow or disallow AI coding tools.
Before starting, the developers predicted AI would make them about 24% faster. The actual result: 19% slower. After completing all 246 tasks, with their own timing data available to review, the developers estimated AI had made them 20% faster. A 39-point gap between perceived and actual performance. These weren't novices fooled by a shiny tool. They were domain experts on familiar ground, with access to their own data, and they did not update their beliefs even after living through the evidence. (METR's follow-up work suggests newer tools likely do provide real speed gains. A May 2026 survey confirmed the perception gap persists regardless: developers overestimate their gains by over 40 percentage points. The tools improve. The miscalibration holds steady.)
The mechanism shows up everywhere, and it's worth sitting with. When AI handles the effortful parts of a task, the work feels smoother. Fluent. And humans read fluency as a proxy for accuracy. We always have. A well-formatted report feels more trustworthy than a messy one. A confident speaker sounds more credible than a hesitant one. When AI removes friction from a workflow, it also removes the texture that tells you something might need closer attention.
Coding is just where the data happened to land first. In a study with 758 BCG consultants, AI boosted performance on tasks within its capability range. But on tasks outside that range, consultants using AI performed roughly 23% worse than those working without it. The consultants who struggled most were those who interacted with AI output uncritically, accepting its polished surface without interrogating what lay beneath. In medical diagnosis, the same pattern: diagnostic confidence rises significantly with AI support, even as accuracy doesn't reliably follow. Less experienced physicians are more often misled by AI recommendations, even as diagnostic confidence rises significantly across the board with AI support.
The instinct is to add checkpoints, and that instinct is sound enough. But a checkpoint only works if the person staffing it has enough friction in their experience to notice when something is off. If the entire upstream workflow has been smoothed by AI, the checkpoint inherits the same fluency problem. You're asking someone to be skeptical of output that arrives polished and complete.
So where does that leave workflow design? Somewhere around making uncertainty visible. Built into the workflow itself, rather than bolted on as a warning label. Showing what the system considered and rejected. Surfacing where confidence is low instead of burying it in fluent prose. Designing workflows where the human encounters the rough edges, the seams, the places where the system hesitated.
Confidence will track fluency. That's how human cognition works, and no amount of training will override it reliably. The organizations that design around this fact, that treat it as a permanent feature of augmented work, will be the ones whose humans in the loop are actually equipped to be there.

