Among organizations running agents in production, 89% have observability in place. Only 52% run evaluations of any kind. That 37-point spread is a confidence gap masquerading as coverage.
Observability tracks what happened: latency, token counts, tool calls, error rates. It confirms execution. Every step glows green. But when an agent corrupts its own context at turn six and compounds that corruption through turn eleven, step-level tracing registers nothing wrong. The reasoning broke, not the infrastructure.
Significant capital is flowing toward closing this gap. Whether the instruments being built can measure what actually matters is a harder question than the investment suggests.
