Where You Can Afford to Be Surprised

Choosing where the model picks the next step is a risk-tolerance decision that reshapes how failures look, debug, and get explained.

"Agent" means at least three different things right now. Salesforce uses it for rebranded copilots. ServiceNow advertises thousands of pre-built ones. Anthropic acknowledges the confusion directly in their own guide: some people mean fully autonomous systems, others mean prescriptive implementations following predefined steps. Same word, structurally different things.

Worth getting past the terminology, though, because underneath it sits a design decision that actually matters: who picks the next step?

In a workflow, code does. An LLM might generate text or classify inputs at each node, but the sequence is fixed. Step two follows step one. If step three fails, you know where to look. You wrote step three.

In an agent loop, the model picks. It observes, decides, acts, observes again. The path through the system emerges at runtime rather than being predetermined.

Both are reasonable architectures. Moving control to the model changes how failure works in ways that ripple through everything downstream.

Code-determined paths produce bounded failures. A timeout. A validation rejection. An API error at a known step. You can write tests against these because you know what the steps are before anything runs. You can replay the exact sequence that broke. You can explain what happened to a VP who doesn't write code, because you can point at the step.

Model-determined paths produce creative failures. A practitioner account on Wire Blog describes a pattern anyone who's run agent loops will recognize: an agent asked to fix a flaky test starts, a few steps in, refactoring the test harness instead. No step failed. No instruction was violated. The original goal is still somewhere in context. It's just no longer the most salient thing driving decisions. The system isn't broken after this kind of drift. It's just somewhere else, and the distance grows quietly with each step.

This kind of failure doesn't trigger alerts. There's no principled methodology for regression-testing intent alignment across a sequence of model-chosen actions. Debugging requires trace infrastructure to reconstruct a path that was chosen at runtime, and the same inputs might produce a different path tomorrow. Recovery shifts from prevention to cleanup, because side effects may land before any checkpoint fires.

And explaining what happened to stakeholders changes in a way that matters more than the engineering. With a workflow, you point at the step that broke and describe the fix. With an agent loop, you're explaining that the system did something unexpected and you can't fully promise it won't do something differently unexpected next time. That's a harder conversation, and it's the one decision-makers will actually sit through.

Agent architectures earn their keep in genuinely ambiguous tasks. They're also expensive in a specific way: the operational overhead of managing systems whose failure modes you can't fully enumerate in advance, which compounds faster than compute or token costs alone. Anthropic's own recommendation is to find the simplest solution possible and increase autonomy only when simpler approaches fall short.

Most production systems end up as hybrids: deterministic workflows handling steps with known paths, model-directed loops handling the parts where the task genuinely can't be predetermined. There's a spectrum between the two.

The load-bearing decision

Where you draw the line between deterministic workflow and model-directed loop is a bet on where you can tolerate surprise and where you need the boring predictability of a fixed path. It looks like architecture. Underneath, it's risk tolerance.

Teams that treat it as a risk question tend to draw better lines than teams chasing capability for its own sake.

Things to follow up on...

Consistency across repeated runs: Sierra's τ-bench found that an agent achieving ~85% success on a single run drops to roughly 25% when measured across eight consecutive attempts, quantifying the gap between demo performance and production reliability.
Goal drift as formal measurement: A 2025 AAAI/ACM paper formalizes goal drift as a measurable failure mode, describing how a coding agent gradually expands its actions into forbidden directories despite initially respecting scope restrictions.
Real-time failure detection gaps: The Partnership on AI identifies cross-step behavioral monitoring as the least developed category of agent controls, noting that tracking whether an agent is still doing what the user intended requires fundamentally different approaches than rule-based checks.
Hybrid architecture in practice: A practitioner breakdown from Graph Digital describes the dominant production pattern as deterministic workflows with agent reasoning injected only at decision boundaries, where the workflow provides auditability and the model handles genuine ambiguity.