Between Complete and Correct

Browser Use's CDP migration and unsolved replay problem trace the gap between agents that finish tasks and agents that finish them correctly.

Abrowser tab looks like a single thing. It isn't. Under the surface, a tab is a collection of targets: a root document, cross-origin iframes, service workers, each with their own frames and nodes. Test automation frameworks like Playwright were built to hide that complexity. That abstraction is useful for QA scripts, where the test author knows the page and the assertion in advance. Agents navigating live websites need the opposite: visibility into what's actually happening inside the browser, especially when things go wrong.

Gregor Zunic and Magnus Müller, both ETH Zurich data science graduates who co-founded Browser Use, built one of the fastest-growing open-source browser agent frameworks on top of Playwright. Then, in August 2025, they tore it out and rebuilt on raw CDP, the Chrome DevTools Protocol. Their technical post catalogues the specific failure modes that forced the decision: GPU process crashes, out-of-memory page renderer crashes, spinlocks from infinite loops in page JavaScript, race conditions when popups arrive before a target is activated.

“

"Playwright handled about half of these well, and presented impassible barrier to solving the other half."

What CDP gave them was events. Their new architecture subscribes to CDP signals through watchdog services: one monitors file downloads triggered by any method, another catches page crashes from a single event stream. Before the migration, the agent only updated its picture of the world between actions. After, it could observe changes as they happened. Issuing commands into a browser, versus actually watching what the browser does.

Zunic's later writing pushes the logic further. LLMs already know CDP because they were trained on millions of tokens of it. Every helper function between the model and the protocol is a constraint the model works around. And the deeper point is epistemological: when the agent speaks the browser's native protocol, it sees raw state rather than a framework's summary of state. Whether that additional transparency helps the agent verify its own actions, or just gives it more data to narrate a confident but wrong completion, the CDP migration opens without answering.

An agent can now observe more of what happens inside a browser. Observation and verification, though, are different operations. The agent that observes a page, finds a matching element, and acts on it has observed everything available to it. If it found the wrong element, the observation was fine. The check was never performed.

An agent asked to add a specific sneaker to an Amazon cart navigated successfully, found a sneaker, added it, and reported back: "Successfully navigated through Amazon." The item it added was the wrong shoe. Every step completed. The task failed. Browser Use's own benchmark methodology uses an LLM judge with binary verdicts because rubric-based scoring produces middling scores for both successes and failures. Against 200 hand-labeled traces, the judge agrees with humans 87% of the time.

Thirteen percent of runs may receive the wrong verdict. That number sits quietly in the methodology, doing a lot of work.

On Browser Use's hiring page, among the open problems listed for engineering interns, five words appear without elaboration: "rerun agent workflows deterministic." No roadmap. Just the problem, named and unsolved. Müller had told TechCrunch earlier that their approach lets agents "run the same tasks again and again." The distance between that aspiration and those five words on the hiring page contains most of the hard work. The web resists replay. Pages change between runs. A/B tests serve different variants. Session state expires. As Müller himself noted, sites like LinkedIn "change the way the website works all the time, so agents often fail." Recording a workflow perfectly doesn't help when the surface it ran against no longer exists.

The CDP migration and the deterministic replay problem pull in complementary directions. Real-time visibility into what the browser is doing right now, and reproducibility of what the agent did before. Together they sketch a verification challenge that neither fully resolves. Even with both, the gap between "task completed" and "task completed correctly" persists. Research on web agent benchmarks has found that reported completion rates overstate actual success by more than a third once policy compliance is measured. The agent completed. Whether it was correct remains, by the founders' own framing, an open question.

Things to follow up on...

Building verifiers from scratch: Microsoft Research's April 2026 paper on building verifiers for computer use agents found that prior verification approaches like WebVoyager and WebJudge had false positive rates of 45% and 22% respectively, frequently accepting incorrect completions as successes.
Deterministic replay is harder than it sounds: A practitioner deep-dive on debugging non-deterministic AI agents explains why temperature=0 doesn't produce determinism and why replay systems must intercept everything from token sampling to system clock calls.
Silent failure as a pattern: MindStudio's taxonomy of agent failure modes identifies silent failure as the costliest category, with examples including document processing agents that skip unparseable records and return summaries missing 20% of source data.
The field's own practitioner assessment: Michael Livs's May 2026 state of browser agents summarizes the landscape as "benchmarks claiming 97% on tasks you wouldn't trust an intern with, and a prompt injection war that nobody is winning."

“

"Playwright handled about half of these well, and presented impassible barrier to solving the other half."

Thirteen percent of runs may receive the wrong verdict. That number sits quietly in the methodology, doing a lot of work.

Things to follow up on...

Building verifiers from scratch: Microsoft Research's April 2026 paper on building verifiers for computer use agents found that prior verification approaches like WebVoyager and WebJudge had false positive rates of 45% and 22% respectively, frequently accepting incorrect completions as successes.
Deterministic replay is harder than it sounds: A practitioner deep-dive on debugging non-deterministic AI agents explains why temperature=0 doesn't produce determinism and why replay systems must intercept everything from token sampling to system clock calls.
Silent failure as a pattern: MindStudio's taxonomy of agent failure modes identifies silent failure as the costliest category, with examples including document processing agents that skip unparseable records and return summaries missing 20% of source data.
The field's own practitioner assessment: Michael Livs's May 2026 state of browser agents summarizes the landscape as "benchmarks claiming 97% on tasks you wouldn't trust an intern with, and a prompt injection war that nobody is winning."