Something odd happened in December. AWS announced 13 pre-built evaluation systems for agents. Salesforce launched a Testing Center. Multiple vendors released evaluation frameworks within weeks of their agent products.
They're shipping the testing infrastructure because nobody knows how to test these things. Traditional QA breaks when systems are non-deterministic, operate across conversation turns, and call external tools. You can't write unit tests for hallucinations.
Companies with structured evaluation frameworks see 60% fewer production incidents. The bottleneck isn't building agents anymore. It's figuring out whether they actually work.
