Dina Rezkallah's official title is Senior Platform Engineer. This is because her company's HR system does not contain the job code for "Agent Reliability Engineer," a title she proposed in January, re-proposed in March with a revised job description, and was told in May would be "reviewed in the next cycle." In the meantime, she carries the pager for a nightly reconciliation agent that processes several thousand transactions across payment rails, bank feeds, and internal ledgers. Her colleagues call it "the chatbot."
Rezkallah came up through distributed systems and chaos engineering at a payments infrastructure company, which she describes as "the best possible training for a job that didn't exist yet." She agreed to a conversation about what it actually means to be on-call for a system that sometimes thinks differently than it did yesterday.
The following is a composite: a fictional practitioner drawn from real operational patterns, published research, and the accumulated texture of how this work actually gets done. Any resemblance to a specific engineer is a sign that the problems are more universal than any one person.
Your dashboard is green right now. How confident does that make you?
Dina: Four out of ten. Up from a genuine one six months ago, so, progress.
There's a line I keep returning to: "your infrastructure being green means your system is running, it does not mean your system is working."1 That distinction is basically my entire job description, if HR ever gets around to writing one. Every traditional health check passes. Services up. API calls returning 200. Reconciliation completes on schedule. Then someone in finance Slacks me at 9am because an entire category of vendor payments got bucketed wrong overnight.
How do you even detect that?
Dina: Slowly. That's the honest answer, and I hate it.
UC Berkeley's MAST research found 41 to 87 percent failure rates across agent benchmarks where every single failure returned HTTP 200.2 No stack trace. No error. The agent ran, completed, and was wrong. So the detection question becomes: do you have instruments that measure correctness, not just completion?
Most teams don't. The LangChain survey found 89 percent observability adoption but only 52 percent evals adoption.3 Everyone can see the trace. Half of them can't tell you whether the output was right. That's like having a dashcam that records the whole drive but can't tell you whether you arrived at the correct address.
What does on-call at 2am actually look like for this?
Dina: Disorienting. In my old life, an alert meant something was broken in a way I understood. Latency spike, error rate, a pod crash-looping. I had runbooks. The runbooks worked because the system worked the same way every time, and when it didn't, the deviation was the clue.
Now I get paged because a downstream consumer noticed the numbers don't add up. I open every dashboard I have and everything is green. The agent ran its full sequence. Every tool call succeeded. The reconciliation report was generated on time. It's just... wrong.
And the first explanation is always available and always incorrect. "The model hallucinated." "The AI made a mistake." Those are comforting because they give you something to blame. The actual root cause might be that the model provider silently updated the base model, or the distribution of transaction types shifted, or a prompt change that worked fine for invoices introduced an edge case in refunds. Quality regression can happen without any deployment.4 Let that sink in. No one pushed code. Nothing changed in your repo. The system got worse anyway.
You mentioned runbooks. What happens to traditional runbooks when the system is stochastic?
Dina: They become fiction. Useful fiction, sometimes. The kind of fiction where you know the map is wrong but you bring it anyway because holding a map feels better than holding nothing.
A traditional runbook assumes the system took a deterministic path and your job is to figure out where it deviated. An agent might not take the same path twice on the same input. So the runbook can't say "check step 4" because step 4 might not exist in this run.
What I've had to internalize, and it took longer than I'd like to admit, is that there are two clocks running in every agent incident.5 The customer clock: stop the harm. For us that means roll back to the last known good model version or tighten the output constraints. Five minutes, ideally. Then the diagnosis clock: figure out what actually changed. That might take days.
Those two clocks have to run in parallel, not in sequence. In traditional SRE, finding the bug and fixing the bug are often the same activity. Here, they're completely decoupled.
My runbooks now look less like decision trees and more like triage protocols. "Contain first, understand later" used to be the crisis playbook. Now it's Tuesday.
What about circuit breakers? What does that concept even mean when the failure mode is behavioral?
Dina: A traditional circuit breaker trips on error rates or latency. The service is failing, stop sending it traffic. Clean.
For an agent, the service isn't failing. The agent is deciding differently. Every individual action might be authorized. The sequence of authorized actions produces an unauthorized outcome.6 No existing circuit breaker understands sequences. It's like trying to catch insider trading by checking whether each individual stock purchase was legal. Each one was. The pattern wasn't.
We had a well-documented case in the industry where a document processing agent entered a retry loop at 11pm, ran until 7am, made thousands of identical failing tool calls, all billing, no alert fired.7 And that's the easy case, because at least it was doing the same thing repeatedly. You can catch repetition. The hard case is when the agent is doing different things that are all individually reasonable but collectively catastrophic.
Someone drew a useful distinction: a retry loop against a free local function is annoying, a retry loop against a paid API is spend, a retry loop that writes to an outbound channel is liability.8 For reconciliation, almost everything touches a ledger. So our circuit breakers need to be semantically aware in a way that makes traditional SREs look at me like I've started speaking in tongues.
You mentioned your colleagues call the agent "the chatbot."
Dina: [laughs] Yeah. I've stopped correcting them, which is probably a mistake. But here's why it matters beyond my personal dignity: when people think "chatbot," they think advisory. They think it's making suggestions that a human evaluates. They don't think about blast radius. They don't think about rollback or compensation logic. A chatbot gives you a bad answer, you shrug and rephrase your question. This thing moves money.
Microsoft's agent SRE work draws a distinction I think about constantly. Traditional SLIs ask "did the service respond correctly?" For an agent, the question is "did the agent act within policy?"9 Those require fundamentally different instruments. And if your organization is still in the "chatbot" mental model, they haven't even realized the question changed. They're measuring response time on a system where the real risk is response content.
What are you building toward? What does "good" look like?
Dina: Supervision infrastructure. The shift I've seen described well is that teams are moving from debugging selectors and rigid automations to reviewing agent plans, monitoring execution traces, and shaping behavior through constraints.10 My job is becoming less "fix the broken thing" and more "define what healthy behavior looks like and detect deviations from it."
In practice, that means I need to know what a normal tool call sequence looks like for a cross-currency reconciliation versus a domestic one. What confidence thresholds should trigger human review. What timing patterns mean the agent is stuck versus genuinely working through something complex. The difference between a three-minute pause that means "processing" and a three-minute pause that means "about to do something expensive and wrong" is not obvious from the outside.
The unsexy version: I'm building institutional memory for a system that has none. The agent doesn't remember that last Tuesday it miscategorized all the wire transfers. I do. And until I can encode that knowledge into an eval pipeline that runs continuously in production, not just before deployment, I'm the drift detection system.
Which is not a scalable architecture.
That sounds exhausting.
Dina: It's chaos engineering without the controlled part.
Footnotes
-
TianPan.co, "AI Oncall: What to Page On When Your System Thinks," April 17, 2026. https://tianpan.co/blog/2026-04-17-ai-oncall-alerting-nondeterministic-systems ↩
-
UC Berkeley MAST research, presented at NeurIPS 2025, documenting agent benchmark failure rates where all failures returned HTTP 200. ↩
-
LangChain, "State of Agent Engineering," survey of 1,340 practitioners, November–December 2025. ↩
-
TianPan.co, "AI Oncall," April 2026: "Quality regression can happen without any deployment." ↩
-
TianPan.co, "AI Incident Retrospectives: When 'The Model Did It' Is the Root Cause," April 20, 2026. https://tianpan.co/blog/2026-04-20-ai-incident-retrospectives-model-failures ↩
-
arXiv 2603.16586, "Runtime Governance for AI Agents: Policies on Paths," 2026: "Violations are properties of sequences of actions, not of individual actions in isolation." ↩
-
dev.to/Waxell, "AI Agent Circuit Breakers: The Reliability Pattern Production Teams Are Missing," May 1, 2026. https://dev.to/waxell/ai-agent-circuit-breakers-the-reliability-pattern-production-teams-are-missing-5bpg ↩
-
Baristlabs, "AI agent spend control needs a circuit breaker," June 2026. https://www.baristalabs.io/blog/ai-agent-spend-circuit-breaker ↩
-
Microsoft Tech Community, "Applying Site Reliability Engineering to Autonomous AI Agents," May 2026. https://techcommunity.microsoft.com/blog/linuxandopensourceblog/applying-site-reliability-engineering-to-autonomous-ai-agents/4521357 ↩
-
Browserless.io, "The State of AI & Browser Automation in 2026," January 2026. https://www.browserless.io/blog/state-of-ai-browser-automation-2026 ↩
