Monitoring the Monitors

If the measurement infrastructure is missing, the natural question is what building it would actually involve. The difficulty becomes apparent quickly.

I haven't found an enterprise organization that currently runs periodic proficiency checks on the humans who evaluate AI outputs. Checks on agent performance exist, increasingly. Observability dashboards, evaluation frameworks, output quality metrics. Checks on whether the people reviewing agent outputs can still tell good from bad? Whether the human half of human-in-the-loop is holding up over time? I haven't encountered any.

The evidence that the problem exists keeps accumulating from the few places where measurement is possible. The METR developer productivity study found that experienced developers believed they were 20% faster with AI coding tools while actually being 19% slower. A 39-point gap between perceived and actual performance. A follow-up survey found the overestimation is durable. Developers don't recalibrate over time. They continue to feel more productive than they are. If you asked these developers whether they needed a proficiency check, they would reasonably say no. They feel fine. And the gap between feeling fine and being fine is precisely what makes this hard to address through voluntary self-assessment.

The BCG/Harvard study on consultants using AI found something complementary. On tasks inside the AI's capability frontier, performance improved dramatically. On tasks outside it, performance dropped 19 percentage points below the no-AI condition. Twenty-seven percent of participants simply accepted AI output with zero modifications. The researchers called this the "jagged technological frontier," and the jaggedness matters because the humans couldn't tell which side of it they were on. Confidence was uniform across both zones.

Organizations, meanwhile, measure what's easy to measure. With AI-assisted workflows, what's easy to measure is volume: reports generated, tickets resolved, analyses completed. Throughput goes up. Confidence in the quality of oversight drifts down. Both trends can coexist for a long time before anyone connects them, because they're tracked by different teams looking at different dashboards.

There's a subtler dynamic underneath the throughput problem. Reviewing AI outputs and maintaining domain expertise feel identical from the inside. An analyst who spends her days evaluating AI-generated competitive reports is engaging with competitive intelligence all day. She's reading about competitors, assessing market positions, flagging inconsistencies. It looks like expertise maintenance, and it might be. It might also be something different: building a model of the AI's error patterns rather than building independent domain knowledge. The two activities produce similar confidence levels and very different capabilities when the AI is wrong in a novel way.

What would detection actually involve?

The first thing it would need is a baseline. And this is where the problem recurses. Establishing a baseline for human judgment quality requires knowing what good judgment looks like independent of AI assistance. For a pilot, that's straightforward: hand-fly an approach, measure the deviation. For an analyst evaluating a market entry strategy, it's genuinely unclear what the test would be.

Aviation's FOQA model suggests one starting point: instead of testing humans in isolation, monitor the pattern of their interactions with AI systems over time. Are override rates declining? Is the time spent reviewing each output shrinking? Are the modifications becoming more superficial? None of these individually proves degradation. Together, a trend line might suggest something worth investigating. Interaction monitoring tells you that behavior is changing, though. It can't tell you whether the change reflects growing trust in a reliable system or growing inability to evaluate it. You'd need something more direct.

The harder version is periodic unassisted performance exercises, analogous to aviation's mandatory manual flying. Have the team produce an analysis without AI tools. Compare it to their AI-assisted work from the same period. Compare it to their unassisted work from six months ago. This sounds simple and is organizationally almost impossible. It feels like a waste of time to people who believe, with that durable 39-point confidence gap, that they're performing well. It requires management to value something other than throughput. And it requires someone senior enough to evaluate the unassisted work, which presupposes the evaluator's own expertise hasn't degraded.

Aviation discovered its degradation problem through accident investigation. Planes crashed, investigators worked backwards, and they found the automation-complacency link in the wreckage. Then they built the instruments. Then they built the countermeasures. That order mattered.

Enterprise AI doesn't produce crashes. That's the product promise, and it's also the epistemological problem. The system keeps running. Outputs keep flowing. Dashboards stay green. The Deloitte 2026 survey found only one in five companies has a mature governance model for autonomous AI agents. Even mature governance, as currently conceived, monitors the agents. Whether the humans governing them can still do so competently goes unexamined.

The recursive problem

The capacity to build the oversight test and the capacity that test is meant to measure sit on the same decay curve.

The uncomfortable possibility is that this instrument needs to be built now, while the humans who could design it still have the unassisted expertise to know what it should measure. Expecting to need oversight and actually investing in the capacity to provide it are very different commitments. And that curve only goes one direction.

If the measurement infrastructure is missing, the natural question is what building it would actually involve. The difficulty becomes apparent quickly.

What would detection actually involve?

The recursive problem

The capacity to build the oversight test and the capacity that test is meant to measure sit on the same decay curve.