What Aviation Built After It Noticed the Problem

In 2017, the FAA updated its guidance on manual flight operations with SAFO 17007. Buried in the language is a phrase that says more about automation psychology than most academic papers: airlines should adopt policies that "permit and encourage manual flight operations."

The word permit is doing a lot of work there. Automation had become so thoroughly the default that pilots needed explicit organizational permission to practice flying without it. The systems, the culture, the workflow pressure all pointed toward letting the automation handle things. Countermeasures had to push against that current.

Aviation's countermeasures are well known. Crew Resource Management formalized oversight as a trained skill. Simulator recurrency training, required annually or semiannually, puts pilots through scenarios with evaluators present. SAFO 17007 mandated training in manual slow flight, upset recovery, and instrument approaches without automation. How aviation knows whether any of this is working deserves closer attention.

The measurement infrastructure is remarkably concrete. ILS deviation scores quantify how precisely a pilot tracks an instrument approach. SAGAT probes, developed by Mica Endsley in 1988, freeze a simulation at specific moments and query pilots about their situational awareness. Flight Operational Quality Assurance programs monitor actual in-flight performance data from real operations. The FAA's own survey on SAFO 17007 compliance explicitly asked airlines how they used FOQA data to track manual flight proficiency. These produce numbers. A pilot's manual approach can be graded against a known standard.

The measurement is possible because the definition of "correct" is stable and agreed upon. That agreement is doing most of the structural work in making aviation's countermeasures viable.

A 2014 NASA study by Casner and colleagues tested sixteen airline pilots in a 747-400 simulator with varying automation levels. The headline finding surprised the researchers: instrument scanning and manual control skills were mostly intact, even among pilots who reported rarely practicing them. The cognitive skills had degraded: navigating, maintaining awareness of flight status, diagnosing unusual situations. The thinking behind the flying.

Haslbeck and Hoermann's 2016 study of 126 airline pilots found a complementary pattern. Short-haul crews, who perform five times as many flights as long-haul crews, showed better fine-motor performance and superior visual skills. Fleet assignment predicted degradation more strongly than experience or rank. Long-haul pilots, spending more time with automation engaged, showed measurably worse manual performance.

The cognitive finding is the one worth sitting with. The mechanical act of reviewing an AI-generated report, reading the text, checking the formatting, clicking approve, remains perfectly intact. Nobody's motor skills atrophy from using AI. The judgment behind the review is where the erosion happens: the ability to notice what's missing, to recognize when a confident-sounding analysis has elided something important, to hold an independent mental model of what the answer should look like before seeing what the AI produced.

And here the aviation parallel reaches its structural limit. Aviation can test whether a pilot's cognitive skills have degraded because it can define what non-degraded performance looks like. A hand-flown ILS approach to minimums is either within parameters or it isn't. ICAO Document 9995 codifies specific competencies. Evidence-Based Training assesses them in context.

Most enterprise AI oversight operates in a different universe. What counts as a good judgment call on marketing copy, vendor selection, competitive analysis, or strategic planning is contextual, contested, and often only legible in retrospect. You can't build a simulator scenario for "evaluate this AI-generated market entry recommendation" because reasonable people would disagree about what the correct evaluation looks like.

The few domains where objective measurement does exist are revealing. A Lancet colonoscopy study of more than 23,000 patients found that adenoma detection rates dropped from 28.4% to 22.4% after physicians had routine AI assistance removed. Colonoscopy has an ADR metric. It's measurable. And the measurement showed exactly the degradation pattern aviation's countermeasures were designed to prevent.

The degradation is probably happening across enterprise AI use. We can see it in the domains where measurement exists. In the domains where it doesn't, the same mechanism is likely operating. We just have no instrument to detect it.

Aviation spent decades building the measurement infrastructure that makes its countermeasures possible. The flight data recorder came before CRM. The accident investigation methodology came before the simulator training curriculum. The instruments to detect the problem preceded the interventions to address it.

Enterprise AI has skipped that step entirely, deploying the automation and the human oversight simultaneously, with no baseline measurement of what the humans could do before the AI arrived. As a Strata.io analysis from May 2026 put it:

“

"Presence is not practice."

Even practice presupposes you know what competent oversight looks like in a given domain. For most enterprise AI applications, that definition doesn't exist yet.

The measurement is possible because the definition of "correct" is stable and agreed upon. That agreement is doing most of the structural work in making aviation's countermeasures viable.

“

"Presence is not practice."

Even practice presupposes you know what competent oversight looks like in a given domain. For most enterprise AI applications, that definition doesn't exist yet.