The Context Window Fills Up Long Before It's Full

The distinction that matters landed about a year ago, when Karpathy observed that the serious work in LLM applications has moved past writing better prompts to deciding what enters the context window, when, and in what form. Cognition AI had been calling this "the #1 job" for engineers building agents. By September 2025, Anthropic had shipped platform primitives for managing it. The term "context engineering" won. But the problem it names is still mostly unsolved.

Start with a pair of numbers that make the problem concrete. On the LoCoMo benchmark, which tests memory across multi-session conversations, a system that selectively retrieves relevant facts scores 91.6% accuracy using roughly 7,000 tokens per retrieval. The full-context approach, loading entire conversation history into the window, scores 72.9% at around 26,000 tokens.

Roughly 4x fewer tokens and nearly 20 points more accurate. More information produced worse results.

Chroma's research explains why. They tested 18 frontier models and found that every one degraded as input length increased. Well before the context window limit, performance was already sliding. A model with a 200K token window showed significant accuracy drops at 50K. The decline was continuous, a steady degradation inside the window's stated capacity. Chroma calls it "context rot," and the implication is architectural: if longer context reliably degrades performance, then the system's job is to keep the context window clean.

The selective-retrieval system behind those benchmark numbers extracts key facts during ingestion rather than replaying raw transcripts. Its biggest accuracy gains land on temporal queries (+29.6 points) and multi-hop reasoning (+23.1 points), exactly the categories where accumulated noise does the most damage. The context window behaves like working memory.

Teams now have a clear vocabulary for context strategies. Write facts externally, select what's relevant, compress what's redundant, isolate what interferes. LangChain's taxonomy organizes these neatly. But no benchmark tests whether one strategy produces more consistent outcomes than another over 20 steps. The benchmarking community measures token cost. The reliability community measures consistency, how often an agent succeeds across repeated attempts. These conversations happen in separate rooms.

The coupling between the two is real. Chroma showed that fewer tokens can mean better accuracy. Presumably that means better consistency over session length. But "presumably" is doing real work in that sentence, because nobody has measured the interaction. Nobody is sitting on the data. It doesn't exist yet.

The deployer's blind spot

Every context strategy is also a bet on how the system fails at step 30, with no way to check the odds.

Which means the team choosing a context architecture today is choosing a reliability envelope without data on what that envelope contains. They can see the token cost. The consistency curve remains unmapped.

Things to follow up on...

Anthropic's context editing primitives: Their September 2025 framework reduced token consumption by 84% in 100-turn evaluations while enabling workflows that would otherwise fail from context exhaustion.
Pass^k and consistency decay: The tau-bench paper showed that even a model with 90% single-attempt success drops to 57% consistency at eight trials, making the case that capability and reliability are fundamentally different measurements.
Mem0's memory architecture evolution: Their April 2026 algorithm scored 20 points higher than the previous version on LoCoMo, with the largest gains on temporal and multi-hop queries where context noise does the most damage.
Context rot under real conditions: Chroma's original study notes that their controlled experiments likely understate the problem, since real-world applications involve greater complexity than the benchmark conditions tested.