The 3am Screenshot That Revealed the Bot Was Never Really Inside the System

Deb Kowalski led the RPA Center of Excellence at a large Midwest insurer from 2018 to 2021. She started in claims processing and got promoted into automation because she understood the workflows better than anyone in IT. She spent the next three years discovering what that knowledge was actually worth when the bots stopped working.

We spoke with her. Or rather, we constructed this conversation from the operational reality that dozens of RPA leads lived through in 2019, because Deb is less a single person than a composite ghost haunting every Center of Excellence that ever existed. She asked that we not name the insurer. We asked that she not pretend the bots were fine.

You were running the RPA program at a large insurer in 2019. What did that actually mean day to day?

Deb: Officially I ran the "Center of Excellence," which sounds very impressive until you realize it was me, two developers, and a shared Slack channel that IT never checked. We had about forty bots in production across claims intake, compliance reporting, and some underwriting support workflows. My job was supposed to be identifying new processes to automate. What it actually was, and I didn't fully understand this until later, was keeping the existing bots alive.

What does "keeping bots alive" look like?

Deb: Monitoring. Fixing. Apologizing.

A well-built bot, when it hits something unexpected, sends you an email with a screenshot of where it got stuck and the error details.¹ So you'd get these emails, sometimes at 3am, and it's a screenshot of a login page that looks almost right but the bot is frozen because a button moved twelve pixels or a field got renamed.

We called those "application exceptions," meaning the bot hit a screen it didn't recognize. Different from "business exceptions," which is when the bot understands the screen fine but the data doesn't fit the rules. An invoice over the auto-approve threshold, that kind of thing.² Application exceptions were the ones that made you want to quit.

Can you walk through the specific incident that changed how you thought about all of this?

Deb: One of our vendor partners, a third-party claims platform we used for a specific line of business, pushed a routine update to their portal. From their perspective, nothing dramatic. They restyled some form elements, changed a couple of internal field identifiers, reorganized a settings page. Their human users probably didn't even notice.

But we had four bots that navigated that portal daily, and they all broke simultaneously.

The thing people don't understand is that the bot isn't "seeing" the screen. It's executing a memorized sequence of selectors and coordinates.³ When the vendor changed their field names, the bot wasn't confused. It was blind. Clicking on things that no longer existed. And because those four bots fed into downstream workflows, the failure cascaded. Claims queues went silent. Work just stopped arriving at the next step.

How long before anyone noticed?

Deb: That's the part that still bothers me. The bots failed around 11pm. I got the exception emails around midnight, four of them, all screenshots of the same portal showing elements the bot couldn't find. But the business didn't notice until the next morning when the claims team came in and their queues were empty. Not "caught up" empty. Suspiciously empty.

Someone in claims called me and said, "Did we process everything overnight or did nothing come through?" And I had to explain that the bots had been frozen for nine hours.

Who owned the problem at that point?

Deb: (laughs) Nobody. Which is the same as everybody, which is the same as me.

I called IT. IT said the vendor changed their portal, not our systems, so it wasn't an infrastructure issue. I called the vendor. The vendor said they'd pushed a routine update and had no obligation to notify us. They didn't even know we had bots on their platform. Which, fair. The bots were logging in with a shared service account. From the vendor's perspective, a user just stopped showing up one day.⁴

Operations thought IT owned the bots. IT thought I owned the bots. I thought I was supposed to be finding new processes to automate, not running a 24/7 break-fix operation for forty fragile screen-scrapers.

Nobody had set up a communication channel with the vendor about UI changes because nobody had thought of the UI as an integration surface that needed a contract. It was just a screen. Screens are for people. We were treating it like a system interface, but it had none of the guarantees. No versioning. No deprecation notices. No changelog we could subscribe to.⁵

What happened to the team that used to do this work manually?

Deb: Before the bots, we had six people processing that claims intake workflow. After the bots, we had two people who were supposed to handle exceptions, the cases the bot couldn't figure out. So they'd gone from doing the whole job to handling only the weird edge cases, which are actually the hardest part of the job. And they'd been doing that for about eighteen months.

When the bots broke, we needed those people to process the full queue manually while my developers rebuilt the automations. But they hadn't done the routine work in a year and a half. The process knowledge was still there, sort of, but the speed wasn't. The muscle memory was gone. And some of the workflows had been modified by the bots in ways that weren't documented. Little routing decisions, field mappings, things the humans had never seen because the bot just handled them.⁶

“

So the moment we needed humans most was the moment they were least prepared. We'd automated the easy part and left people holding the hard part, and then when the easy part broke, they couldn't even do that anymore.

You're describing something a researcher named Lisanne Bainbridge wrote about in 1983. She called them the "ironies of automation."

Deb: I didn't know that name until years later. But yeah. The idea that the better your automation works, the worse your fallback position gets? I lived that. I built that.

We had created a situation where the humans were monitors of a system they could no longer operate. And the monitoring itself was degrading. You can only stare at a dashboard that says "all bots running" for so many months before you stop really looking at it.⁷ We had a green light on a wall. It was green for so long that when it turned red, the person sitting nearest to it thought the display was broken.

Looking back, what did that failure actually reveal?

Deb: That the bot was never in the system. It was pressing buttons on the outside of someone else's system, pretending to be a person who understood what they were doing. And the system had no idea it was there. No contract. No recognition.

When the vendor changed their UI, they were updating their product for their users. We weren't their users. We were, I don't know, parasites on a surface designed for human eyes.

“

An actual integration would have meant the vendor knew we existed. There'd be an API, a contract, versioning. We skipped all of that because it was faster to just drive the screen. And it was faster. For about fourteen months.

Was it worth it?

Deb: Ask me on a day the bots were running and I'd say absolutely. Ask me at 3am looking at a screenshot of a frozen login page and I'd say something I can't repeat here.

The honest answer is that we solved a real problem in a way that created a different, less visible problem. And the less visible problem was worse because nobody budgeted for it, nobody staffed for it, and nobody wanted to hear about it. You can't go back to the executive who approved the RPA business case and say, "Great news, the bots work. Bad news, I need a full-time team to keep them working, and also we should probably rehire the people we let go, just in case."

That's not a conversation anyone wants to have fourteen months into a successful automation program.

Exception-handling best practices for production RPA bots typically include automated email alerts with error screenshots and key argument values. See CAI, "RPA Exception Handling: Be in Control or Be Controlled." https://www.cai.io/resources/thought-leadership/rpa-exception-handling-be-in-control-or-be-controlled ↩
The distinction between "business exceptions" (rule-based data mismatches) and "application exceptions" (UI or system failures) is standard in RPA operations vocabulary. See CAI, ibid. ↩
RPA bots rely on stable UI selectors, coordinates, and field identifiers rather than visual comprehension. See Technovatime, "The Sight-First Shift: Moving Beyond Brittle Scripts to Vision-Based Automation," May 2026. https://technovatime.com/blog/vision-based-rpa-operational-efficiency-1a9c19 ↩
Vendor-bot communication gaps were a well-documented governance failure. See MuleSoft, "Importance of RPA Governance." https://www.mulesoft.com/automation/rpa-governance ↩
A 2019 EY study found that 50% of RPA projects failed to meet objectives, with maintenance burden cited as a primary cause. See Blueprint Systems, "How to Reduce the Costs of RPA Maintenance and Support." https://www.blueprintsys.com/blog/rpa/reduce-rising-costs-rpa-maintenance-and-support ↩
Bainbridge, L., "Ironies of Automation," Automatica, 1983. Bainbridge argued that automation creates conditions where human operators lose the skills needed for manual takeover precisely because the automation has removed their opportunity to practice. https://en.wikipedia.org/wiki/Ironies_of_Automation ↩
Bainbridge specifically identified monitoring degradation as a structural consequence: operators left with "an unsupportive mix of boring monitoring tasks without opportunities to develop or use important skills." Ibid. ↩