Share

Context Over Instruction

Where Agent Behavior Actually Gets Decided

By Nora Kaplan— May 28, 2026

Feature image for article: Where Agent Behavior Actually Gets Decided

A developer's coding agent worked perfectly for thirty minutes, then started contradicting its own decisions. The prompt was the same. The model was the same. Everything else in the context window had been accumulating. His fix was radical: stop carrying context forward entirely. It worked. Across production agent systems, the same asymmetry kept surfacing — a carefully engineered prompt drowning in accumulated tool outputs performed worse than a simple prompt in a clean, focused context. The field needed a word for the thing practitioners were actually optimizing, because "prompt" didn't describe it.

Context Over Instruction

Where Agent Behavior Actually Gets Decided

By Nora Kaplan— May 28, 2026

A developer's coding agent worked perfectly for thirty minutes, then started contradicting its own decisions. The prompt was the same. The model was the same. Everything else in the context window had been accumulating. His fix was radical: stop carrying context forward entirely. It worked. Across production agent systems, the same asymmetry kept surfacing — a carefully engineered prompt drowning in accumulated tool outputs performed worse than a simple prompt in a clean, focused context. The field needed a word for the thing practitioners were actually optimizing, because "prompt" didn't describe it.

The Perception Layer

What Browser Agents See Before They Think

By Rina Takahashi— May 28, 2026

Feature image for article: What Browser Agents See Before They Think

Before a browser agent reasons about a task, it needs a representation of the page. Raw HTML, a screenshot, or the accessibility tree. That choice, made once and early, sets the token budget, failure modes, and reliability ceiling for everything downstream. OpenAI, Microsoft, and Perplexity built their agent systems independently and all three landed on the same primary perception layer: infrastructure originally designed for people who can't see screens. The convergence is worth understanding.

The Perception Layer

What Browser Agents See Before They Think

By Rina Takahashi— May 28, 2026

Before a browser agent reasons about a task, it needs a representation of the page. Raw HTML, a screenshot, or the accessibility tree. That choice, made once and early, sets the token budget, failure modes, and reliability ceiling for everything downstream. OpenAI, Microsoft, and Perplexity built their agent systems independently and all three landed on the same primary perception layer: infrastructure originally designed for people who can't see screens. The convergence is worth understanding.

Further Reading

Demystifying Evals for AI AgentsAnthropic's engineering team frames evals as compounding investments in behavioral judgment. Each eval built makes the next behavioral change easier to assess, and the returns accu...

State of Agent EngineeringThe number that should keep people up at night: 89% of practitioner teams have observability in place, but only 52% run evals. Nearly nine in ten teams can watch what their agents ...

Quick links

Model Context Protocol Threat Modeling and Analysis

The State of AI Agents in 2026: A Practitioner's Guide

How AI Agents Are Making Accessibility a Business-Critical Development Priority

Vocabulary That Shapes Decisions

The agent ecosystem borrows vocabulary freely and pays for it later. Terms drift in from adjacent fields, pick up new connotations, and settle into usage where two words that sound interchangeable actually point at different architectures, different risk surfaces, different engineering bets.

These five distinctions keep showing up because getting them wrong has traceable consequences. A team optimizes the wrong metric. A security review scopes against the wrong threat model. Someone builds retrieval infrastructure where they needed persistence, or vice versa. The vocabulary here is load-bearing. Swap one term for its near-neighbor and you end up building the wrong thing, confidently.

Vocabulary That Shapes Decisions

The agent ecosystem borrows vocabulary freely and pays for it later. Terms drift in from adjacent fields, pick up new connotations, and settle into usage where two words that sound interchangeable actually point at different architectures, different risk surfaces, different engineering bets.

These five distinctions keep showing up because getting them wrong has traceable consequences. A team optimizes the wrong metric. A security review scopes against the wrong threat model. Someone builds retrieval infrastructure where they needed persistence, or vice versa. The vocabulary here is load-bearing. Swap one term for its near-neighbor and you end up building the wrong thing, confidently.

Core Concepts

Context Fills the Window, Memory Outlives It

Context is what the model sees right now: the prompt, retrieved documents, tool outputs, prior steps in this turn. Memory is what the system decides to persist and may surface later, across sessions, through retrieval logic you have to build. Conflating them is how teams end up stuffing everything into the context window and wondering why their agent has no durable sense of anything it learned last Tuesday.

System Design

Who Picks the Next Step Changes Everything

If your code decides every step, you've built a workflow. It might call a model at each node, but the control flow is yours. If the model decides what to do next, you're in agent territory. The engineering, the eval strategy, and especially the failure modes are different animals. Treating an agent like a workflow means your error handling assumes predictability you don't have.

Eval Metrics

Succeeding Once Tells You Almost Nothing Useful

Pass@1 asks: did the agent get it right on the first try? Pass^k asks: does it keep getting it right across k repeated attempts? Tau-bench found agents scoring above 60% on pass@1 that crater below 25% at pass^8 in retail scenarios. If your production system needs to work reliably, not just occasionally, pass@1 is the wrong scoreboard.

Agent Protocols

Reaching Tools and Coordinating Agents Are Different

MCP gives an agent access to tools and data sources. A2A lets agents talk to other agents. They're complementary layers in a stack, but the risk surfaces barely overlap. Expanding what an agent can touch raises one set of security questions. Expanding who it can delegate to raises a genuinely different set. Scoping a security review against the wrong protocol means missing the threats that actually apply.

Production Readiness

Watching Failures Pile Up Won't Prevent Them

Observability tells you what happened: which tools fired, what errors surfaced, how state changed. Evals tell you whether what happened was correct. A LangChain survey found 89% of teams had adopted observability but only 52% were running evals. That gap explains a lot. Teams can see their agents failing in granular detail and still have no systematic way to catch the failures before users do.

Landscape Snapshot

While most of the industry took a long weekend, the agent stack kept hardening. Infrastructure choices are quietly becoming defaults. Security surfaces are getting named and ranked. Tooling ecosystems are consolidating before the practices they support have fully matured.

These six snapshots cover the layers where agent behavior actually gets shaped: markets, failure modes, architectural bets, security boundaries. What connects them is a timing problem. Decisions being made right now at the infrastructure level will constrain what's possible later at the governance level, and the governance vocabulary is still catching up.

Landscape Snapshot

While most of the industry took a long weekend, the agent stack kept hardening. Infrastructure choices are quietly becoming defaults. Security surfaces are getting named and ranked. Tooling ecosystems are consolidating before the practices they support have fully matured.

These six snapshots cover the layers where agent behavior actually gets shaped: markets, failure modes, architectural bets, security boundaries. What connects them is a timing problem. Decisions being made right now at the infrastructure level will constrain what's possible later at the governance level, and the governance vocabulary is still catching up.

Market Signal

Observability Platforms Hit $2.69 Billion in 2026

LLM observability grew 36% year-over-year and is projected to reach $9.26B by 2030. Gartner expects half of all GenAI deployments to invest in observability by 2028, up from roughly 15% today. Worth noting: agent observability is a different animal than model monitoring. Failures show up in multi-step causal chains, and diagnosing them requires full-session traces, not just token-level metrics.

Infrastructure Pattern

Durable Execution Returns Without the Custom Infrastructure

The OpenAI Agents SDK integrated with Temporal went GA in March 2026, turning agent reasoning steps into discrete activities with automatic retry and crash recovery. Microsoft's Agent Framework ships a parallel implementation. The idea underneath both: checkpoint state at every transition so a process can pause, crash, and pick up where it left off.

Security Risk

Tool Poisoning Ranks First Among MCP Threats

Tool poisoning scores Critical at 46.5 out of 50 in DREAD analysis and tops OWASP's LLM Top 10. Peer-reviewed research published in MDPI this May tested seven major MCP clients and found most lack sufficient static validation. Tool descriptions get reviewed once at connection time. After that, tool responses flow into agent context unchecked, and in multi-agent chains, the poisoning propagates.

Orchestration Risk

Infinite Handoff Loops Break Multi-Agent Systems First

The failure everyone hits first: Agent A hands off to B, B to C, C back to A, and nobody owns the task. Five agents can mean 25x the management complexity of one. Princeton NLP research found single agents matched or beat multi-agent setups on 64% of benchmarked tasks given identical tools and context. Practitioner consensus is settling: most teams reach for multi-agent orchestration well before a single agent hits its ceiling.