Two Kinds of Debt

The Operational Inheritance

Two decades of browser automation taught engineers to distrust the gap between command and outcome. Web agents inherited the gap and every failure mode it produces. The skepticism didn't transfer.

By Rina Takahashi— June 11, 2026

Two decades of browser automation taught engineers to distrust the gap between command and outcome. Web agents inherited the gap and every failure mode it produces. The skepticism didn't transfer.

In June 2009, Simon Stewart published a post on the Google Testing Blog titled "My Selenium Tests Aren't Stable!" The comment thread became its own practitioner document. Engineers described tests where waitForElement succeeded and the very next action failed to locate the same element. Tests passed on two machines and failed on two others for no discernible reason. This was five years after Selenium's creation, and the community was already describing a problem without a solution.

The problem was structural. Selenium's own documentation describes it plainly: sometimes the browser reaches the right state first and things work as intended; sometimes the automation code executes first and things don't. A race condition between the script and the page.

What followed was a decade of workarounds. Thread.sleep() came first: pause execution for a fixed duration, hope the page catches up. The duration was always either too short or too long. Implicit waits came next, a global setting that polled for element presence. But presence and interactability are different things; an element can exist in the DOM while an animation overlay covers it completely.

Explicit waits with WebDriverWait were the more surgical response. Engineers wrote code that waited for specific conditions before acting. Wait until the button is clickable, then click it. This was genuinely closer to a solution. It worked well enough that large test suites came to depend on it. And then practitioners discovered its blind spot: the wait checks one element. It confirms the button is clickable. It has no awareness of the modal still animating into view, the invisible overlay about to intercept the click. One practitioner described the gap precisely:

“

"WebDriverWait says 'the button is clickable.' Reality says 'there's an invisible overlay from an animation that will intercept your click.'"

So teams added retry logic on top. Re-run failed tests automatically. Retries masked flakiness rather than fixing it.

Each workaround encoded another lesson about the gap between issuing a command and the command doing what was intended.

When Microsoft released Playwright in 2020, it formalized what this community had learned the hard way. Before executing any action, Playwright runs a series of actionability checks: Is the element attached to the document? Visible? Stable, meaning not mid-animation? Enabled? And critically: does it receive events? That last check directly addresses the failure mode that explicit-wait engineers kept encountering. Playwright asks a subtler question than whether an element is clickable. It asks whether the click will arrive. If the checks don't pass, it throws a TimeoutError with a diagnostic message. It fails loudly.

Web agents operate on this same substrate. They navigate the same dynamic DOMs, encounter the same overlays, face the same race conditions. A Microsoft Research study found that even when a screenshot clearly shows a pop-up, and even when the DOM contains a clickable close button, agents cannot infer they must dismiss the overlay before proceeding.

The Agent-E project's failure analysis found something sharper. In more than half of failed tasks, the agent was self-aware that something went wrong. But the remaining failures were what the researchers called "oblivious." DOM elements reset silently. Dates reverted to incorrect formats. The agent didn't notice. It reported success.

A Selenium test that clicks a covered button gets an error, and a human investigates. A web agent that clicks a covered button may receive no error signal at all. It proceeds to the next step and reports "task complete."

Twenty years of browser automation produced a community that learned, through painful repetition, to distrust the gap between command and outcome. Playwright encoded that distrust into its architecture. Web agents inherited the substrate where all of this happens, along with every failure mode it produces. The skepticism that took twenty years to develop didn't come along for the ride.

“

"WebDriverWait says 'the button is clickable.' Reality says 'there's an invisible overlay from an animation that will intercept your click.'"

So teams added retry logic on top. Re-run failed tests automatically. Retries masked flakiness rather than fixing it.

Each workaround encoded another lesson about the gap between issuing a command and the command doing what was intended.