CURRENT | Foundations

Tool Architecture Study

Where the Snapshot Lives Changes Everything Else

By Rina Takahashi— June 18, 2026

Feature image for article: Where the Snapshot Lives Changes Everything Else

Playwright now ships two modes for agent-driven browser automation. One streams page snapshots into the LLM's context window. The other writes them to disk as files. The token difference is 4x. The difference in what each mode leaves behind after the session ends is harder to quantify and harder to ignore. One produces artifacts you can diff and inspect independently of the agent's own account. The other stores observation and inference in the same stream, inseparable after the fact. Where the snapshot lives turns out to determine what you can verify.

Tool Architecture Study

Where the Snapshot Lives Changes Everything Else

By Rina Takahashi— June 18, 2026

Playwright now ships two modes for agent-driven browser automation. One streams page snapshots into the LLM's context window. The other writes them to disk as files. The token difference is 4x. The difference in what each mode leaves behind after the session ends is harder to quantify and harder to ignore. One produces artifacts you can diff and inspect independently of the agent's own account. The other stores observation and inference in the same stream, inseparable after the fact. Where the snapshot lives turns out to determine what you can verify.

Conceptual Piece

The Context Window Fills Up Long Before It's Full

By Rina Takahashi— June 18, 2026

Feature image for article: The Context Window Fills Up Long Before It's Full

A model with a 200K token context window starts degrading at 50K. Selective retrieval using a quarter of the tokens beats full-context loading by nearly 20 accuracy points. The finding gives "context engineering" its teeth. The work is keeping the context window clean. But teams choosing a context strategy today are choosing a reliability envelope with no data on what that envelope actually contains. The research to measure it hasn't been done.

Conceptual Piece

The Context Window Fills Up Long Before It's Full

By Rina Takahashi— June 18, 2026

A model with a 200K token context window starts degrading at 50K. Selective retrieval using a quarter of the tokens beats full-context loading by nearly 20 accuracy points. The finding gives "context engineering" its teeth. The work is keeping the context window clean. But teams choosing a context strategy today are choosing a reliability envelope with no data on what that envelope actually contains. The research to measure it hasn't been done.

Further Reading

Building Effective Agents — Anthropic's Workflow Patterns as Implicit Context DecisionsEvery orchestration pattern here quietly decides what the model sees. Reread it with that lens after this issue.

τ-bench: The Benchmark That Gave Agent Inconsistency a NumberPass^k measures how often success repeats. Context degradation over long sessions is one reason it doesn't.

Quick links

OWASP Top 10 for Agentic Applications 2026

State of AI Agent Memory 2026 — Mem0

Playwright MCP — Official Getting Started Docs

MCP Threat Modeling: Tool Poisoning as Context Injection, Formalized

Concepts Worth Knowing

The agent ecosystem coins vocabulary faster than it retires confusion. Terms with real engineering consequences get compressed into conference shorthand until the original distinction vanishes entirely. Concepts that should have precise boundaries end up meaning whatever the speaker needs them to mean that afternoon.

These six cards mark where that confusion tends to concentrate. Each isolates one distinction and makes it concrete. A few will feel obvious once stated plainly. Others might shift something you thought was already settled. The goal is sharper lines between ideas that blur easily, so the harder conversations that build on them have somewhere solid to stand.

Concepts Worth Knowing

The agent ecosystem coins vocabulary faster than it retires confusion. Terms with real engineering consequences get compressed into conference shorthand until the original distinction vanishes entirely. Concepts that should have precise boundaries end up meaning whatever the speaker needs them to mean that afternoon.

These six cards mark where that confusion tends to concentrate. Each isolates one distinction and makes it concrete. A few will feel obvious once stated plainly. Others might shift something you thought was already settled. The goal is sharper lines between ideas that blur easily, so the harder conversations that build on them have somewhere solid to stand.

System Design

Where Control Lives Defines What You've Built

What matters is who decides what happens next. If code dictates every step, you have a workflow, even when it calls a model. If the model chooses tools and loops on observations, the system has agentic control. That line is more durable than chatbot versus agent.

Core Concepts

Context and Memory Break in Different Directions

Context is what the model sees right now. Memory is what the system persists and may retrieve later. They fail differently: context degrades over length as attention weakens across distant tokens. Memory degrades through consolidation, as summarization and lifecycle operations quietly erode precision.

Tool Design

A Tool Is a Contract Between Two Worlds

A tool is a contract between probabilistic reasoning and deterministic execution. Good contracts have narrow purpose, clear naming, typed inputs and outputs, explicit side effects, and permission boundaries aligned to the action. When the contract is sloppy, the model fills in plausible but invalid parameters.

Context Strategy

Four Strategies for Filling the Context Window

Lance Martin at LangChain organized context engineering into four strategies: write (save context externally), select (pull relevant context in), compress (retain only needed tokens), and isolate (separate unrelated context). The taxonomy names what most teams already do by instinct and exposes what they're skipping.

Security Model

MCP's Trust Gap Lives Between Connect and Runtime

Tool descriptions get reviewed once, when the agent first connects to a server. Tool responses flow straight into the LLM context with no equivalent check. That unguarded runtime channel is exactly what attackers exploit in tool poisoning. The asymmetry is structural, baked into the protocol's timing.

Evaluation Gap

Public Benchmarks Measure a Fundamentally Different World

Public benchmarks structurally miss real customer data messiness, private tools and permissions, enterprise policy exceptions, adversarial users, production cost and latency constraints, and long-running tasks spanning days. The gap has nothing to do with benchmarks being wrong. They measure a different environment than the one your agents operate in.

What's Happening Now

Six months ago, where browser state lives and who runs the reasoning loop looked like implementation details. They've turned out to be architectural choices with long shadows, shaping cost, security posture, and what's even visible to the people responsible for governing these systems.

Several developments this month make those fault lines tangible. A token benchmark puts a number on context architecture. A managed platform absorbs the agent harness entirely. A W3C draft rewires what browsers report in real time. And two lines of research quantify gaps the field has mostly hand-waved about: one in tool security, one in observability.

Infrastructure choices are becoming governance choices, whether anyone planned for that or not.

What's Happening Now

Six months ago, where browser state lives and who runs the reasoning loop looked like implementation details. They've turned out to be architectural choices with long shadows, shaping cost, security posture, and what's even visible to the people responsible for governing these systems.

Several developments this month make those fault lines tangible. A token benchmark puts a number on context architecture. A managed platform absorbs the agent harness entirely. A W3C draft rewires what browsers report in real time. And two lines of research quantify gaps the field has mostly hand-waved about: one in tool security, one in observability.

Infrastructure choices are becoming governance choices, whether anyone planned for that or not.

Browser Automation

Playwright CLI Saves State to Disk, Slashes Tokens

Microsoft's @playwright/cli, released early 2026, writes accessibility snapshots to disk instead of streaming them into the agent's context window. Benchmarks show roughly 27,000 tokens per task versus 114,000 via MCP, a gap that widens to 10x on longer sessions. Microsoft now recommends CLI for coding agents with filesystem access.

Agent Platforms

Google Moves the Agent Loop to Its Servers

The Gemini Managed Agents API, announced at Google I/O 2026, lets developers spin up persistent agents with a single API call. Google hosts the reasoning loop, sandbox execution, and state persistence. The tradeoff is real: when the harness belongs to someone else, so does your visibility into what it's doing.

Web Standards

WebDriver BiDi Adds Real-Time Events via WebSocket

Published June 1, 2026 as a W3C Working Draft, WebDriver BiDi replaces classic WebDriver's HTTP request-response model with bidirectional WebSocket communication. Browsers can now push network events, console logs, and exceptions as they happen. Selenium, Puppeteer, and Playwright are integrating it. For agent infrastructure, the significance is what becomes observable during automation that previously wasn't.

Agent Security

MCPTox: Capable Models Fall Hardest for Poisoned Tools