Late 2024 brought reasoning models. Everyone expected better math. What actually happened: reliable tool calling at scale.
Coding agents need hundreds of tool invocations across expanding context windows without breaking down. SWE-Bench scores jumped from Devin's 13.86% in early 2024 to 80%+ by 2025. The architecture that emerged: reasoning models plan workflows, cheaper models execute tasks. Training against verifiable rewards taught models to decompose problems into steps. That capability consumed compute originally meant for pretraining. Most 2025 progress came from longer RL runs, not bigger base models.
