Benchmarks up, context windows longer, a thousand subagents running in parallel. The capacity to verify any of it is quietly disappearing.