What Browser Agents See Before They Think

How browser agents represent web pages to themselves silently determines their reliability, cost, and failure modes before any reasoning begins.

Screen readers and browser agents have something in common that matters more than it might sound. Both are sequential perceivers that need to act on web content they can't see. Both need structure, not pixels. And both, it turns out, rely on the same piece of browser infrastructure to make sense of a page.

That infrastructure is the accessibility tree. Every major production agent system is independently converging on it as the primary perception layer, and the reasons why say a lot about where the real architectural decisions in browser automation happen.

There are three ways to represent a page to an agent, and each one forecloses something that matters.

Raw DOM gives you everything on the page, all at once, with no filtering. A login page generates 15,000+ tokens of context: tracking scripts, wrapper divs that exist only for CSS grid layout, inline styles repeated on every element, hidden containers for A/B testing frameworks. An enterprise dashboard generates far more. The vast majority of that HTML carries zero information relevant to clicking a button or filling a form. The context window fills up with structural noise, and the agent hasn't formed a single intention yet. A decorative <div> and a submit button get equal representation.

Screenshots give spatial understanding. Anthropic's Computer Use tool trains Claude to count pixels from screen edges to locate targets. This works across arbitrary desktop applications where no structured data exists. But a button that says "Submit" in a screenshot is just pixels. The agent doesn't know it's a button until it reasons about the visual context, and that reasoning costs tokens, time, and introduces ambiguity that structured data would have resolved already. Appearance comes through clearly. Semantic structure drops away.

The accessibility tree occupies genuinely different ground. The browser builds it by walking the DOM and stripping out everything non-semantic and non-interactive. What remains are roles, labels, states, and relationships. Playwright MCP's documentation reports ~200–400 tokens per snapshot for simple pages. A practitioner measured ~3,800 tokens for a login form that generates 15,000+ tokens as raw HTML. The compression is semantic: the filtering the accessibility tree performs is exactly the filtering a task-oriented agent needs, because it was designed for perceivers who navigate by meaning and ignore appearance.

Look at the production systems. OpenAI's Atlas directs publishers to add ARIA tags for agent compatibility, using the same labels and roles that support screen readers. Microsoft's Playwright MCP operates on accessibility snapshots by default, no vision models needed. Reverse-engineering of Perplexity's Comet shows it using Chrome's accessibility tree with element reference IDs, falling back to vision selectively. Three systems, built independently, landing on the same primary perception layer. Anthropic's Computer Use is the instructive exception: it chose screenshots first because it targets desktop automation across arbitrary applications where the accessibility tree doesn't exist uniformly. The tradeoff is explicit in their documentation: high token consumption.

But the accessibility tree has hard boundaries. Canvas elements are opaque to it. A chart, a custom editor, a game interface: no tree entries at all. Playwright MCP includes a vision mode specifically for pages with incomplete accessibility trees, like canvas apps and custom-rendered UIs. Missing ARIA labels create a subtler failure. An icon-only button with no aria-label appears in the tree but has no name. The agent finds the element but can't identify what it does. Dynamic content that changes state without proper aria-expanded attributes leaves the agent unable to tell whether its last action worked.

The empirical anchor

Research from UC Berkeley and the University of Michigan, presented at CHI 2026, found that task success dropped from 78% to 42% when a computer-use agent was constrained to keyboard-only navigation, the mode where agents depend entirely on structured representation rather than visual cues.

When that structure is sparse or poorly labeled, the agent's world shrinks to what the tree exposes. And what the tree exposes depends on markup most teams think about in terms of compliance checklists, not production reliability.

The convergence pattern leads somewhere worth naming. The web's accessibility infrastructure was designed to help people who can't see screens navigate by meaning. Now it's becoming the primary substrate through which software navigates the web too. When infrastructure teams start treating ARIA quality as a production metric for agent reliability, the pressure to improve web accessibility will come from people with different priorities than the disability advocates who built the standards in the first place. The values embedded in that infrastructure, what gets labeled, what gets exposed, what gets left opaque, are propagating into agent architecture in ways nobody specifically decided on. By the time teams notice the connection between their ARIA coverage and their agent success rates, the pattern will already be set.

Things to follow up on...

Accessibility as agent infrastructure: SitePoint covers how the CHI 2026 research and production choices by Atlas, Playwright MCP, and Comet are turning accessible markup into a business-critical development priority.
Token bloat in practice: A practitioner benchmark found that Playwright MCP's standard snapshot mode burned 114K tokens per test, prompting a shift to the newer CLI mode at roughly a quarter of the cost.
Hybrid perception architecture: The arXiv paper on building browser agents describes how combining accessibility tree snapshots with selective vision achieved ~85% success on the WebGames benchmark, compared to ~50% for prior browser agents.
Comet's perception layer reverse-engineered: Zenity Labs published a detailed teardown showing how Perplexity's Comet uses Chrome's accessibility tree with element reference IDs and falls back to on-screen coordinates only when structured data is missing.

There are three ways to represent a page to an agent, and each one forecloses something that matters.

The empirical anchor

Things to follow up on...

Accessibility as agent infrastructure: SitePoint covers how the CHI 2026 research and production choices by Atlas, Playwright MCP, and Comet are turning accessible markup into a business-critical development priority.
Token bloat in practice: A practitioner benchmark found that Playwright MCP's standard snapshot mode burned 114K tokens per test, prompting a shift to the newer CLI mode at roughly a quarter of the cost.
Hybrid perception architecture: The arXiv paper on building browser agents describes how combining accessibility tree snapshots with selective vision achieved ~85% success on the WebGames benchmark, compared to ~50% for prior browser agents.
Comet's perception layer reverse-engineered: Zenity Labs published a detailed teardown showing how Perplexity's Comet uses Chrome's accessibility tree with element reference IDs and falls back to on-screen coordinates only when structured data is missing.