The Accessibility Tree Wasn't Built for AI Agents

Screen readers and LLMs have almost nothing in common, except for one thing: neither of them can see a web page. Both need the page translated into structured, sequential text before they can do anything useful with it.

The accessibility tree was built decades ago to solve that problem for blind users. It strips a page down to its semantic bones: roles, labels, states, hierarchy. A textbox labeled "Email." A button labeled "Sign In." Layout, color, pixel coordinates all drop out. What remains is what things are and what they do.

It turns out that's almost exactly what an LLM needs too. That convergence points at something interesting about what makes a page navigable for either kind of non-visual consumer: semantic structure is what carries the meaning. Playwright MCP is the clearest place to watch what happens when you build an entire automation architecture around that bet.

Snapshot, Reason, Act

The pattern is called snapshot-and-ref. When an agent needs to interact with a page, it requests a snapshot: a YAML representation of the page's accessibility tree. Each element comes back with a role, a label, and a unique reference number. That textbox gets [ref=1]. That button gets [ref=3].

The agent reads the tree, decides what to do, and acts on a ref: browser_click({ ref: "3" }). A fresh snapshot comes back with updated refs. The agent reasons again. Snapshot, reason, act, snapshot.

The page is always text, never pixels. And because refs regenerate with each snapshot, the agent works from the current state of the page rather than its memory of what the page looked like a moment ago.

What the Representation Costs

A typical page snapshot runs roughly 500 to 5,000 tokens of structured text. The same page as a screenshot, processed through a vision model, runs 10,000 to 50,000. Even at the narrow end, snapshots cost a fraction of what vision requires, and the gap compounds with every action in a multi-step workflow. For workflows that touch dozens of pages across hundreds of runs, this is the difference between a viable operating cost and an untenable one.

Serialization choices compound too

One developer found that small decisions like whether to assign refs to every element or only interactive ones produced 51–79% variation in token cost on real pages.

The representation you choose matters, and how you serialize it matters almost as much.

Where the Bet Breaks

The accessibility tree is only as good as the markup behind it. WebAIM's 2026 analysis of one million home pages found 95.9% had detectable WCAG failures, averaging 56.1 errors per page. Pages with ARIA present actually had more errors on average than pages without it. The tree an agent receives is often incomplete, mislabeled, or structurally misleading.

Then there are cases where no tree exists at all. Canvas-rendered applications like Figma or WebGL scenes expose nothing meaningful to the accessibility tree. The agent sees toolbar buttons (those are standard HTML) but the canvas itself is blank. Shadow DOM components in modern design systems can stop tree traversal at their boundaries, making visually present elements invisible to the agent.

These are the conditions under which the accessibility tree has always been fragile. Screen reader users have navigated around these gaps for years, building workarounds and institutional knowledge about which sites work and which don't. Now agents inherit the same limitations, for the same reasons, because they're built on the same substrate. Agent capability on any given page is bounded by accessibility decisions made for a completely different kind of user, often decades ago, by developers who had no idea this second consumer was coming.

Choosing how your agent perceives a page determines what it can act on and what it silently misses. Anyone building on the accessibility tree should know where both boundaries sit.

Things to follow up on...

MCP vs CLI tokens: The Playwright team benchmarked the same browser automation task at roughly 114,000 tokens through MCP versus 27,000 through the CLI, because MCP streams the full accessibility tree into context at every step while CLI saves snapshots to disk and lets the agent read only what it needs.
Web accessibility is getting worse: WebAIM's 2026 data shows page complexity up 22.5% year over year and ARIA usage up 27%, but pages with ARIA present averaged more errors than pages without it, which means the substrate agents depend on is degrading even as adoption grows.
Snapshot noise from hidden elements: A filed GitHub issue documents how Playwright MCP's snapshots include elements behind modal dialogs and off-screen content as valid refs, causing agents to attempt interactions with elements they can't actually reach.
Format choices compound quietly: A head-to-head comparison of accessibility tree serialization approaches found that decisions like ref assignment scope and attribute inclusion produced 51–79% token cost variation on the same pages, suggesting the serialization layer deserves as much attention as the perception layer itself.