Try to benchmark a process that's already been automated. You'll find yourself interviewing people who half-remember how things used to work, reading documentation that was out of date before the migration, reconstructing a baseline from artifacts that were never designed to be evidence. You're measuring against a ghost.
This is where most organizations are headed. They haven't arrived yet, and that matters more than the industry conversation suggests.
The maturity data tells a familiar story if you read it the usual way: roughly 59% of enterprises have moved past AI piloting, but only about 9% report meaningful autonomous workflows in production. McKinsey puts the number with truly mature AI strategies at 1%. The more interesting signal may be about timing: most organizations are in the last period where certain things are still possible.
Right now, in most companies, the analysts still run the reports. The operational teams still carry judgment built from years of doing the work by hand. Someone still knows what a good output looks like because they've produced thousands of them manually. That knowledge has a half-life. Once agentic workflows are load-bearing, once staffing levels have adjusted and the old process lives only in a wiki page no one maintains, the raw material for building evaluation systems starts to degrade.
Meanwhile, success stories circulate with remarkable efficiency. A team reports dramatic reductions in review hours. The number makes the rounds. What doesn't travel with it is the methodology: how they measured, what baseline they used, what they controlled for. Forrester data suggesting 88% of agent pilots fail to reach production hints at the distance between a promising demo and a reliable deployment. The anecdotes move faster than the frameworks that would let anyone evaluate them.
This gap points toward something specific. The organizations that eventually cross from piloting to autonomy will need measurement infrastructure that distinguishes between "the system is running" and "the system is running well." The difference sounds trivial. It is enormous. One tracks execution health: uptime, latency, completion rates. The other tracks decision quality: whether the outputs are actually good, by standards that someone with domain expertise would recognize. By the time most organizations realize they need the second kind, the people who could have defined it have learned to trust the system's output. Or moved to other roles. Or both.
Only one in five companies reports a mature governance model for autonomous agents. The ones building evaluation infrastructure now, while old and new ways of working still coexist, are doing something that looks slow from the outside. They're running new workflows alongside old ones. They're sitting with experienced operators and capturing what those people actually check, what patterns make them pause, what "good" feels like before that feeling becomes impossible to reconstruct.
The window is the coexistence period. It won't last because it's supposed to end. Autonomy is the goal, and the foundation that makes autonomy trustworthy gets built here, in the overlap, or it doesn't get built at all.
Nobody will send a calendar invite when it closes.

