CURRENT | Vision

The Core Problem

The Quiet Part

By Nora Kaplan— June 11, 2026

Feature image for article: The Quiet Part

In a controlled trial, experienced developers using AI tools estimated they'd gotten about 20% faster. Measured result: 19% slower. What lingers is that completing the work, experiencing the full arc of it, didn't correct their perception. Meanwhile, 74% of teams running agent systems in production rely on human reviewers as their primary quality gate. Both findings are solid. Together, they describe something uncomfortable about where organizational AI deployment is actually headed.

The Core Problem

The Quiet Part

By Nora Kaplan— June 11, 2026

In a controlled trial, experienced developers using AI tools estimated they'd gotten about 20% faster. Measured result: 19% slower. What lingers is that completing the work, experiencing the full arc of it, didn't correct their perception. Meanwhile, 74% of teams running agent systems in production rely on human reviewers as their primary quality gate. Both findings are solid. Together, they describe something uncomfortable about where organizational AI deployment is actually headed.

Lessons and Designs

What Aviation Built After It Noticed the Problem

Aviation solved the automation-complacency problem decades ago with simulator checks, mandatory manual flying, and crew resource management. Those countermeasures rest on a foundation enterprise AI lacks: a stable, agreed-upon definition of what correct human performance looks like. A pilot's hand-flown approach can be graded against known parameters. An analyst's judgment on a market entry recommendation cannot. That structural difference determines whether countermeasures are even possible.

Lessons and Designs

Monitoring the Monitors

Organizations are building increasingly sophisticated systems to monitor whether their AI agents perform well. Almost none are checking whether the humans evaluating those agents still can. Evidence from the few domains where measurement exists — developer productivity, medical diagnostics — reveals a durable gap between perceived and actual human capability. The instrument to detect oversight degradation needs to be built now, while the people who could design it still have the expertise to know what it should measure.

Lessons and Designs

What Aviation Built After It Noticed the Problem

Aviation solved the automation-complacency problem decades ago with simulator checks, mandatory manual flying, and crew resource management. Those countermeasures rest on a foundation enterprise AI lacks: a stable, agreed-upon definition of what correct human performance looks like. A pilot's hand-flown approach can be graded against known parameters. An analyst's judgment on a market entry recommendation cannot. That structural difference determines whether countermeasures are even possible.

Lessons and Designs

Monitoring the Monitors

Organizations are building increasingly sophisticated systems to monitor whether their AI agents perform well. Almost none are checking whether the humans evaluating those agents still can. Evidence from the few domains where measurement exists — developer productivity, medical diagnostics — reveals a durable gap between perceived and actual human capability. The instrument to detect oversight degradation needs to be built now, while the people who could design it still have the expertise to know what it should measure.

Lessons and Designs

What Aviation Built After It Noticed the Problem

Aviation solved the automation-complacency problem decades ago with simulator checks, mandatory manual flying, and crew resource management. Those countermeasures rest on a foundation enterprise AI lacks: a stable, agreed-upon definition of what correct human performance looks like. A pilot's hand-flown approach can be graded against known parameters. An analyst's judgment on a market entry recommendation cannot. That structural difference determines whether countermeasures are even possible.

Lessons and Designs

Monitoring the Monitors

Organizations are building increasingly sophisticated systems to monitor whether their AI agents perform well. Almost none are checking whether the humans evaluating those agents still can. Evidence from the few domains where measurement exists — developer productivity, medical diagnostics — reveals a durable gap between perceived and actual human capability. The instrument to detect oversight degradation needs to be built now, while the people who could design it still have the expertise to know what it should measure.

The Stress Tester

The Woman Who Breaks Her Hospital's AI on Purpose

The Stress Tester

The Woman Who Breaks Her Hospital's AI on Purpose

The Causal Evidence

The Evidence for AI-Induced Deskilling Is No Longer Theoretical

Polish researchers studying AI-assisted colonoscopy weren't looking for deskilling. They found it anyway. After endoscopists gained routine access to AI polyp detection, their unassisted adenoma detection rate fell from 28.4% to 22.4% across four centres and nearly 1,500 procedures. The degradation only surfaced when the tool was taken away. With it, performance held steady.

From a completely different domain, the same shape: experienced developers using AI coding tools took 19% longer on real tasks while reporting they'd become 20% faster. The tool hides the loss from the person losing something.

Two findings, two fields, one mechanism.

The Causal Evidence

The Evidence for AI-Induced Deskilling Is No Longer Theoretical

Polish researchers studying AI-assisted colonoscopy weren't looking for deskilling. They found it anyway. After endoscopists gained routine access to AI polyp detection, their unassisted adenoma detection rate fell from 28.4% to 22.4% across four centres and nearly 1,500 procedures. The degradation only surfaced when the tool was taken away. With it, performance held steady.

From a completely different domain, the same shape: experienced developers using AI coding tools took 19% longer on real tasks while reporting they'd become 20% faster. The tool hides the loss from the person losing something.

Two findings, two fields, one mechanism.

TAKE NOTE

Accidental discovery: The colonoscopy deskilling finding fell out of a trial designed to measure AI benefits, not to test for skill loss

Statistical strength: The 6-percentage-point ADR drop reached p=0.0089 significance, though the study was observational rather than a randomized deskilling trial

Perception gap: METR's May 2026 survey of 349 technical workers confirmed developers overestimate AI productivity gains by roughly 40 percentage points

Automation bias: Erroneous AI prompts raised radiologist false-positive recalls by 12%, and pathologists abandoned 7% of correct diagnoses when shown wrong AI suggestions

Vanishing baseline: METR's 2026 replication attempt collapsed because developers refused to work without AI, rendering the control condition itself unsustainable