The Web Wasn't Built for This

A single form field on a modern web page can require more than a hundred lines of HTML. Wrapper divs nest inside wrapper divs. Styling hooks multiply. ARIA attributes scatter across the markup. To a person with a browser, it's a text box. To a machine trying to operate that page, it's a thicket of signals with no obvious hierarchy, the semantically important element buried under layers of presentational scaffolding that a visual cortex would simply ignore.

This is the gap Magnus Müller walked into when he started building Browser Use. The founding question, as he's described it, was disarmingly simple: how hard could it be to connect LLMs to the web? The early answer looked like a familiar automation problem. CSS selectors break when a designer renames a class. XPath expressions shatter when the DOM restructures. Every web scraping practitioner knows this frustration. But Müller's trajectory suggests he hit something deeper than a tooling problem. The breakages traced back to a structural mismatch: the web encodes meaning through spatial arrangement, visual convention, and contextual cues that assume a human interpreter. No selector strategy, however clever, can bridge that gap, because it's baked into how the web was made.

His most consequential design choice was to reject vision-based navigation, the approach that seems most natural: giving an AI a screenshot and letting it figure things out the way a person would.

“

"A lot of agents rely on vision-based systems and try and navigate websites through screenshots, and in that process, things break."

— Magnus Müller, TechCrunch, March 2025

Instead, Browser Use converts the DOM into a structured, indexed representation. Elements get numbered. The semantic skeleton is preserved. The visual layout, the pixel-level rendering that makes a page legible to a human: discarded.

This works because it sidesteps a translation problem vision agents face on every page load. Humans infer structure from appearance effortlessly. We spent our entire lives learning that menus live at the top, buttons look clickable, grayed-out text means unavailable. A vision model has to rediscover these conventions from pixels, every time, on pages it has never seen. DOM-driven approaches outperform vision-driven ones by twelve to seventeen percentage points on common tasks. The gap reflects how much work the translation layer has to do before reasoning even begins.

But the web doesn't just resist spatial translation. It resists temporal stability. Müller has pointed to LinkedIn as a site that "changes the way the website works all the time, so agents often fail on sites like those." Every redesign is an implicit assertion that the previous arrangement was wrong. This is the normal condition of the web. Browser Use's response is a learning pattern: the first time an agent encounters a site, it navigates and maps the structure; subsequent runs execute against what it learned. The architecture treats instability as something to absorb and adapt to.

When Browser Use achieved 97% on Online-Mind2Web, the revealing detail was where the remaining failures came from. Products had gone out of stock since the benchmark was authored. Prompts had become ambiguous as sites changed underneath them. Two tasks were structurally impossible for any browser agent, requiring image editing and chess-piece interaction outside the browser's medium entirely. The failures belonged to the environment.

Müller himself has noted that current benchmarks still can't capture tasks like extracting a thousand products across platforms, because the compositional demands of real enterprise use exceed what evaluation frameworks were designed to measure. The trajectory from hobbyist frustration to that observation traces where the real complexity lives. Smarter agents keep arriving. The harder problem persists underneath them: the space between a human artifact that was never meant to be read by machines and the infrastructure required to read it anyway, reliably, at scale, on a surface that won't hold still.

Things to follow up on...

Google's WebMCP origin trial: Chrome 149 introduces an experimental standard that lets websites expose structured tools directly to browser agents, a move to retrofit the web with the machine-readable layer it never had.
The benchmark's own fragility: Online-Mind2Web researchers found that a naive search agent scores 51% on WebVoyager simply by Googling answers, which prompted the harder benchmark Browser Use ultimately topped.
Stagehand's self-healing alternative: Rather than giving the LLM full browser control, Stagehand limits AI to targeted actions with Zod schema validation, re-engaging the model only when cached DOM mappings break.
The hybrid architecture emerging: Production teams are converging on a split where stable, high-frequency workflows stay in deterministic Playwright scripts while dynamic interfaces get AI browser agents, treating the two as complementary layers rather than competing approaches.