The Test Nobody Wants to Take — A Conversation With the CLO Measuring What AI Leaves Behind

Ninety-four percent of organizational leaders have made little or no meaningful progress designing how humans and AI should work together.¹ The other six percent aren't saying much about what they've found. Probably because it's uncomfortable.

We spoke with a chief learning officer at a professional services firm — roughly 2,800 consultants across North America and Europe — who fourteen months ago began running quarterly unassisted performance assessments. The program is simple in concept: pull consultants out of their AI-assisted workflow, give them representative tasks, measure what they can do alone. The execution, she'll tell you, is anything but.

She asked that her firm not be identified. We've honored that request by also not identifying her, or, for that matter, confirming with any certainty that she exists. What we can confirm is that the problem she describes is real, the data she references is published, and the measurement gap she's trying to close is one almost nobody else is attempting.

What made you start measuring this?

Pepper: It wasn't some grand insight. It was a partner meeting. Someone said, "Our junior consultants are producing better work than ever," and I sat with how uncomfortable that made me. Because I couldn't tell whether that meant our people were getting better or our tools were getting better.

I'd been reading the BCG research, the Dell'Acqua study on the jagged frontier, where consultants using AI on tasks outside its reliable capability performed nineteen percentage points worse than those working without it.² And the follow-up that identified "self-automators" — twenty-seven percent of consultants who basically handed the entire problem to the AI and accepted whatever came back. Forty-four percent of that group made zero modifications to the output.³

Here's the thing that kept me up: self-automator work looks fine. In the short run, you genuinely cannot distinguish someone who wrestled with a problem alongside the AI from someone who copy-pasted a prompt and went to lunch. The difference only surfaces in what they can do next quarter, next year. And we had absolutely no instrument to detect it.

So you built one.

Pepper: We built a very imperfect one. Yes.

What does it actually look like?

Pepper: Every quarter, we pull a stratified sample — by tenure, practice area, office — and give them a half-day of tasks. No AI tools. No internet for some exercises. Closed-book, essentially.

The tasks mirror real deliverables: structure a problem framework from a client brief, identify gaps in a market analysis, build a recommendation from incomplete data. We also include what I privately call "frontier traps" — tasks designed to sit in the zone where AI tends to sound confident but get things wrong. Multi-step causal reasoning. Problems that require unstated context you'd only have if you'd actually been in the room with the client.

Output gets blind-scored by senior partners against a structured rubric. Speed, quality, problem-framing depth, and — this one matters most — whether they correctly identify what's missing. That last dimension is the hardest to score and the most diagnostic.

I'm guessing this wasn't universally popular.

Pepper: laughs The utilization argument was the loudest. You're pulling consultants off billable work for a half-day every quarter to prove they can do something we've given them tools not to have to do. One practice leader told me it was like testing whether your accountant can still use an abacus. Which is a great line, and completely wrong, but I'll get to that.

The deeper resistance was more interesting. Some partners experienced the program as an accusation. Like I was saying their teams couldn't actually do the work. In a high-performance culture, measuring unassisted capability feels like measuring doubt. Nobody wanted to be the practice group whose numbers looked soft.

And then there's the political layer. We have a C-suite AI adoption mandate. Anything that generates data suggesting AI-assisted work might have costs gets read as disloyalty to the strategy. I spent more time on internal framing than on assessment design. Honestly, probably twice as much.

You've been at this for over a year. What's the data showing?

Pepper: I want to be careful, because a year is four data points. That's a suggestion, not a trend. But.

Senior consultants, five-plus years of experience, are stable. Their unassisted performance looks essentially the same as it did fourteen months ago. They had the skills before the tools arrived. The skills are holding.

The mid-career group, two to four years, is more interesting. On structured tasks — market sizing, framework application, the kind of thing where AI is genuinely helpful — they're fine. Maybe slightly slower unassisted, but quality holds. On novel problem framing, ambiguous briefs, the frontier-trap exercises? A divergence is emerging. It's modest. It's within the range where I could be wrong. But it's consistent across three quarters now.

The junior cohort — people hired into an AI-assisted environment from day one — that's where I lose sleep. Their AI-assisted deliverables are, by client metrics, excellent. Their unassisted scores on novel problem framing are not tracking where I'd expect for their career stage.

The performance paradox.

Pepper: That's the UTS researchers' term for it, and it's exactly right.⁴ Output quality goes up. Durable capability may not be keeping pace. And in my data, those two metrics are now decorrelated. I have consultants whose client satisfaction scores are the highest in their cohort and whose unassisted problem-framing scores are the lowest.

Same people.

What can't you measure?

Pepper: Almost everything that matters most.

I can't measure tacit knowledge. The judgment you build from ten thousand hours of client interaction. Whether someone can read a room when the client goes off-script. Whether they know what they don't know. IMD published something in May saying it plainly: there is, as yet, no clear way to train and develop junior employees when AI is doing much of their work.⁵ Most firms can see the problem. Few have solved it.

I also can't establish a clean pre-AI baseline. We rolled out tools before I started measuring. So I know the trajectory, but I don't know the starting altitude.

And this is the one that really bothers me: I can't separate AI-induced atrophy from never having built the skill in the first place. For junior hires, those are different problems requiring different solutions, and my instrument can't distinguish them.

That's a significant limitation.

Pepper: There's one more. Eighteen months of AI-assisted work means my consultants may have internalized AI-generated patterns as their own thinking. When they perform "unassisted," some of what they produce might be recalled AI output they've absorbed without realizing it. I'm trying to measure a clean signal in a population where the signal is already contaminated.

I know that. I do it anyway, because a noisy signal beats flying blind.

Why do you think almost nobody else is doing this?

Pepper: Because the dashboards are green.

Every metric that matters to a quarterly business review — utilization, client satisfaction, deliverable speed, revenue per consultant — all going up. The degradation I'm watching for takes years to surface, and when it does, it'll look like a hiring problem or a culture problem or a market shift. The causal chain back to "we let a skill atrophy because we couldn't see it happening" will be untraceable.

“

I'm not trying to prove AI is bad. I genuinely believe these tools make our work better. What I'm trying to build is the measurement infrastructure that lets us know whether "better work" and "better workers" are still the same thing. Right now, for most organizations, that's an article of faith. I'd prefer it to be a data point.

And the abacus guy? The reason he's wrong is that nobody ever forgot arithmetic because they had a calculator. They forgot when the calculator was giving them a wrong answer. That's what I'm testing for.

Deloitte, 2026 Global Human Capital Trends. Only 6% of leaders report making real progress designing human-AI collaboration. ↩
Dell'Acqua, F., et al. "Navigating the Jagged Technological Frontier." Organization Science, March 2026. https://pubsonline.informs.org/doi/10.1287/orsc.2025.21838 ↩
Randazzo, S., et al. "Cyborgs, Centaurs and Self-Automators." Harvard Business School Working Paper No. 26-036, December 2025. https://www.hbs.edu/faculty/Pages/item.aspx?num=68273 ↩
Lodge, J.M. and Loble, L. "Artificial Intelligence, Cognitive Offloading and Implications for Education." University of Technology Sydney, March 2026. https://www.uts.edu.au/news/2026/03/experts-warn-unstructured-ai-use-in-schools-risks-cognitive-atrophy ↩
IMD. "How to Build Judgment When AI Does the Work." May 2026. https://www.imd.org/ibyimd/talent/how-to-build-judgment-when-ai-does-the-work/ ↩

What made you start measuring this?

So you built one.

Pepper: We built a very imperfect one. Yes.

What does it actually look like?

I'm guessing this wasn't universally popular.

You've been at this for over a year. What's the data showing?

Pepper: I want to be careful, because a year is four data points. That's a suggestion, not a trend. But.

The performance paradox.

Same people.

What can't you measure?

Pepper: Almost everything that matters most.

I also can't establish a clean pre-AI baseline. We rolled out tools before I started measuring. So I know the trajectory, but I don't know the starting altitude.

That's a significant limitation.

I know that. I do it anyway, because a noisy signal beats flying blind.

Why do you think almost nobody else is doing this?

Pepper: Because the dashboards are green.

“

Deloitte, 2026 Global Human Capital Trends. Only 6% of leaders report making real progress designing human-AI collaboration. ↩
Dell'Acqua, F., et al. "Navigating the Jagged Technological Frontier." Organization Science, March 2026. https://pubsonline.informs.org/doi/10.1287/orsc.2025.21838 ↩
Randazzo, S., et al. "Cyborgs, Centaurs and Self-Automators." Harvard Business School Working Paper No. 26-036, December 2025. https://www.hbs.edu/faculty/Pages/item.aspx?num=68273 ↩
Lodge, J.M. and Loble, L. "Artificial Intelligence, Cognitive Offloading and Implications for Education." University of Technology Sydney, March 2026. https://www.uts.edu.au/news/2026/03/experts-warn-unstructured-ai-use-in-schools-risks-cognitive-atrophy ↩
IMD. "How to Build Judgment When AI Does the Work." May 2026. https://www.imd.org/ibyimd/talent/how-to-build-judgment-when-ai-does-the-work/ ↩

The Test Nobody Wants to Take — A Conversation With the CLO Measuring What AI Leaves Behind

What made you start measuring this?

So you built one.

What does it actually look like?

I'm guessing this wasn't universally popular.

You've been at this for over a year. What's the data showing?

The performance paradox.

What can't you measure?

That's a significant limitation.

Why do you think almost nobody else is doing this?

Footnotes

What made you start measuring this?

So you built one.

What does it actually look like?

I'm guessing this wasn't universally popular.

You've been at this for over a year. What's the data showing?

The performance paradox.

What can't you measure?

That's a significant limitation.

Why do you think almost nobody else is doing this?

Footnotes