Adeveloper running a coding agent documented the decay curve. For the first thirty minutes, accurate multi-file edits with correct cross-references. By sixty minutes, missing imports. By ninety, tunnel vision on single files, contradicting its own earlier decisions.
Same prompt. Same model. Ninety minutes of accumulated context.
His fix was to stop carrying context forward entirely. Fresh instance for each iteration. Inject only the relevant filesystem state. Never let conversational history accumulate. It worked, which tells you something about where agent behavior actually originates.
In a single-turn interaction, the quality of your instruction dominates the quality of the response. This is the world prompt engineering was built for. But multi-step agent execution introduces a different dynamic. Each tool call produces output that folds back into the context window. Conversation history accumulates. The original instructions get physically pushed deeper into the token sequence with every step. Research on long-context behavior has shown that models attend most strongly to information at the beginning and end of their context, with performance dropping 30% or more for material in the middle. A production agent making fifty tool calls gradually buries its own behavioral instructions under the weight of its own work product.
The prompt is still there. The model just can't find it as clearly.
The agent undermines its own instructions through the act of executing them. Calling tools, receiving outputs, maintaining history — each of those correct actions degrades the very context that tells the agent what to do next. And what accumulates carries weight beyond its volume. Intermediate reasoning, error traces, tool outputs the model treats as relevant signal. One production team found that a single hallucinated state would embed itself in the agent's goals, and because goals are referenced at every decision point, the false information reinforced itself across dozens of turns. The context gets longer and, worse, misleading.
Practitioners kept running into the asymmetry this produces: a carefully engineered prompt drowning in stale tool outputs performed worse than a simple prompt operating in a clean, focused context. The proof showed up in what teams actually built. Across multiple major agent frameworks, the same pattern emerged independently. Spin up sub-agents with isolated context windows. Return only the result. Discard the working history. If the prompt were the binding constraint, this would be pointless overhead. It became standard practice because a fresh context with the right information consistently beat an accumulated context with everything.
Andrej Karpathy captured the shift:
"Filling the context window with just the right information for the next step."
Curating what surrounds the instruction. That was the work.
The old vocabulary kept pulling attention toward the wrong surface. "Prompt engineering" centered the instruction as the thing to optimize. Production systems kept demonstrating that the informational environment surrounding the instruction determined whether the model could even follow it. The environment was doing most of the behavioral work, and surprisingly simple instructions performed well inside a well-curated one. Practitioners needed a word for what they were actually spending their time on.
The new vocabulary is a starting point, and an honest one. If the context window is where behavior gets determined, then every architectural decision about what enters it, what gets compressed, what persists from prior steps, is a behavioral decision. The code that selects which tool outputs to keep and which to summarize shapes how the agent acts. The logic that decides when to spawn a fresh sub-agent draws a line between reliable and unreliable reasoning. Teams are encoding behavioral specification in infrastructure code, often without recognizing it as such. "Context engineering" at least makes that visible. What follows from the visibility is still an open question.
Things to follow up on...
- Error accumulation has math: A formal analysis of tool-using agents proved that context distortion grows linearly across sequential steps, with periodic re-grounding every nine steps sufficient to control it.
- Bigger windows, worse performance: Chroma Research tested 18 frontier models and found 20–50% performance drops as input length grew from 10k to 100k+ tokens, with coherent documents paradoxically degrading performance more than shuffled ones.
- The eval gap compounds this: A LangChain survey of 1,300+ practitioners found that only 52% run offline evaluations on agent behavior, meaning most teams lack the instrumentation to detect context-driven failures before production.
- Google's architectural response: Google's Agent Development Kit explicitly lets parent agents scope what sub-agents see to prevent context explosion, suppressing ancestral history while preserving factual state in the session layer.

