A QA Engineer Watches History Repeat Itself

Patience Waite is not a real person, though anyone who spent years writing Selenium tests between 2009 and 2015 might feel uncomfortably certain she is. We found her — or assembled her — from the accumulated scar tissue of a generation of QA automation engineers who learned, one flaky test at a time, that the browser does not care about your deadlines. She now consults on reliability engineering. In the past year, three separate clients have asked her to evaluate whether their agent deployments are "production-ready." She agreed to this conversation on the condition that we not call her an "AI expert."

"I'm a person who has been lied to by green checkmarks for fifteen years," she said. "That's my expertise."

You've described your Selenium years as formative. What was the actual texture of that work?

Patience: Okay, so. Imagine you're trying to have a conversation with someone who keeps walking away mid-sentence. That's the browser. You find a button. You reach for it. By the time your hand gets there, the DOM has updated and the button is somewhere else. Or it's the same button but there's an invisible overlay sitting on top of it, so you think you clicked it, but actually you clicked a tooltip.

And your test says "pass."

That was every day. Simon Stewart, the person who created WebDriver, wrote a blog post in 2009 literally titled "My Selenium Tests Aren't Stable!" with an exclamation point.¹ The comments were just... people describing their wounds. Someone wrote about how finding an element and then immediately clicking it would fail because the element vanished between those two operations. Between two lines of code. The kind of thing that makes you question your understanding of sequential execution.

How did you learn to deal with it?

Patience: Badly, at first. Everyone's first instinct is Thread.sleep(5000). Just wait five seconds and hope. It works on your laptop. Then it fails on the CI server because the CI server is slower. Or it passes but your test suite now takes forty minutes, and your build pipeline becomes a place where people go to get coffee and reconsider their career choices.²

Then you discover implicit waits. Then you discover implicit waits interact badly with explicit waits. Then you learn the difference between "element is present" and "element is visible" and "element is clickable," which are three different things that sound like they should be the same thing but absolutely are not.³

Each of those distinctions cost someone a week of debugging. Usually me.

Tell me about the checkout form.

Patience: Ah. Yeah. Green suite. We shipped. A customer called because the checkout form was eating their payment information and returning a success message even when the payment processor rejected it. Our test checked for the success message. The success message appeared regardless. We were testing the UI's ability to display text, not the system's ability to process a payment.

“

That was the day I understood that a passing test is a claim, and claims can be false.

What changed after that?

Patience: Everything. Before that, I thought my job was making tests pass. After, I understood my job was making tests mean something. Completely different problem. And I was not alone in failing to grasp the distinction early enough.

Most of my team spent their time maintaining existing tests, not writing new ones. Four out of five engineers fixing broken selectors instead of expanding coverage.⁴ We called it automation. It was a different kind of manual work, with fancier error messages.

You've been watching agent demos recently. What do you see?

Patience: [long pause]

I see Playwright.

Which — look, Playwright is genuinely good. Before it clicks anything, it checks: is the element stable? Visible? Enabled? Is it actually the thing that would receive the click event at that coordinate, or is something sitting on top of it?⁵ Every single one of those checks exists because Selenium didn't do it and someone shipped a bug. I know what each of those checks cost in human misery.

But here's what keeps me up. Agents run on Playwright, and Playwright handles the mechanical reliability. The click actually clicks the right thing. Great. Solved. Fifteen years of pain, addressed. Except the agent decided what to click using a language model, and that decision might be wrong in ways that are much harder to detect than a stale element reference.

Walk me through what that looks like in practice.

Patience: In Selenium, when something broke, it broke loudly. Element not found. Stale element reference. Timeout. You got an error. You'd curse, but you could fix it.

An agent that completes a task incorrectly doesn't throw an error. It says "Done!" with the confidence of a golden retriever who just ate your passport. The equivalent of my checkout test: it verified the wrong thing, but it verified it successfully. An agent fills out a form, submits it, sees a confirmation page, reports success. But it entered the wrong value in field three because the label was ambiguous, or because there were two fields with similar names and it picked the wrong one.⁶

No error. No exception. Just quiet wrongness.

So the failure got harder to see.

Patience: And that's the dangerous part. When Selenium tests were flaky, teams eventually stopped trusting them. They'd re-run the pipeline until it went green and shrug off the failures.⁷ That's bad, but at least people knew the tests were unreliable. The unreliability was visible. It announced itself.

With agents, the failure looks like success. You have to go check the actual system state to know whether the task worked. And nobody builds that verification infrastructure during the demo phase. They build it after the first production incident. If they're lucky, only one.

Does anything about the current moment feel genuinely new to you?

Patience: The semantic understanding is real. I'll give them that. An agent can find "the Submit button" even if someone renamed the CSS class from btn-primary to button-main. That would have broken every Selenium test I ever wrote.⁸ Real improvement. Full credit.

But the math hasn't changed. If each step in a workflow has a 95% success rate, a ten-step workflow succeeds 60% of the time. Twenty steps, 36%.⁹ We learned this by watching our test suites degrade as we added more tests. Agents face the same compounding, plus they've added probabilistic planning on top of the existing mechanical uncertainty. A new source of unreliability stacked on a foundation that already had plenty.

Ninety-five percent reliability sounds great until you multiply it by itself a dozen times. Then it sounds like my Selenium suite circa 2013.

What would you want agent builders to know?

Patience: That the browser was never their collaborator. It wasn't mine either. It's built for humans, and everything about it — the asynchronous loading, the security overlays, the session timeouts — treats automation as the exotic case.¹⁰ That hasn't changed. The tooling got better at handling it, but the fundamental adversarial relationship is the same.

“

"It works in the demo" is the most dangerous sentence in this field. I've been hearing it for fifteen years. It always works in the demo.

Last question. Do you think the lessons transfer?

Patience: [laughs]

No. They never do. The people who have the scar tissue aren't the people making the adoption decisions. That's not a technology problem. That's an org chart problem.

And I don't have a Playwright auto-wait for org charts.

Simon Stewart, "My Selenium Tests Aren't Stable!" Google Testing Blog, June 2, 2009. https://testing.googleblog.com/2009/06/my-selenium-tests-arent-stable.html ↩
QATechXperts, "Flaky Test Cases in Automation Testing." https://qatechxperts.com/flaky-test-cases-in-automation-testing/ ↩
Semaphore, "How to Avoid Flaky Tests in Selenium." https://semaphore.io/blog/flaky-tests-selenium ↩
ContextQA, "Selenium Alternatives in 2026." https://contextqa.com/blog/selenium-alternatives-2026/ ↩
AgentQL, "Actionability Check — Playwright." https://docs.agentql.com/automation/actionability-check ↩
TinyFish internal knowledge base, "Web Agent Stack" — documents ambiguous controls as a persistent failure mode for agents encountering multiple similar labels across locales and layouts. ↩
testRigor, "Why Selenium Sucks for End-to-End Testing in 2026." https://testrigor.com/blog/why-selenium-sucks-for-end-to-end-testing/ ↩
Matthew Allen, "Common Selenium Automation Testing Mistakes." https://medium.com/@Matthew_allen13/common-selenium-automation-testing-mistakes-and-how-to-avoid-them-4daff21b6ee1 ↩
TinyFish internal knowledge base, "Demo to Production Gap" — documents compounding failure rates across multi-step agent workflows. ↩
TinyFish internal knowledge base, "Web Agent Stack" — "Web agents are hard because the web was built for humans, scripts, browsers, standards, trackers, accessibility tools, anti-abuse systems, and business workflows at once." ↩