You'll encounter two versions of the same agent in a single quarter. One runs in a demo and impresses everyone in the room. The other runs in a production workflow and starts producing strange, circular, or confidently wrong outputs somewhere around turn twelve. The model is the same. The tools are the same. The task is roughly the same. What's different is what the agent is carryingwhen it reasons — and whether anyone engineered that carefully. When a public sector CAIO tells you "we tried an agent and it went sideways," the precise language that earns you the next thirty minutes of conversation is: "How many turns was it running, and what was your context management strategy?" That question reframes a vague failure as a diagnosable architecture problem.
Subject Profile: Agents-in-Demos
What it is: An agent running in a controlled environment against a curated task, typically presented to buyers or stakeholders to demonstrate autonomous multi-step reasoning.
What it does: The agent receives a goal, emits a tool call, gets a result back, reasons about what to do next, emits another tool call, and repeats until the task completes. In a demo, this loop runs four to seven times. The agent retrieves some data, drafts something, checks a condition, produces an output. It looks clean because it is clean — the task was chosen to be completable in a short loop.
Who's behind it: Usually a vendor solutions engineer or an internal AI team running a proof-of-concept. The environment is purpose-built. The tools return predictable, well-formatted results. The task has been rehearsed. Nobody is running a twenty-turn demo.
What makes it distinct: Demo agents are reliable because they're short. The context window — the working memory the model reasons against — never fills up. Tool outputs don't go stale because there are only three of them. Nothing gets contradicted because the loop ends before contradiction is possible. The structural constraint of the demo environment masks the failure mode entirely.
Okta Concept Mapping — Session Length and the Illusion of Stability
Think about how a short-lived session behaves versus a long-running one. A fifteen-minute privileged session rarely surfaces the token refresh edge cases that a four-hour session will hit. Demo agents are the fifteen-minute session: the environment is too short for the failure modes to appear. For intuition-building, that comparison works. It stops working here: IDAM sessions are designed to be short, with expiry as a security property. Agent context windows have no built-in expiry. In a buyer conversation, this distinction matters: the buyer who's thinking "we just need to tune the session parameters" is solving the wrong problem.
Subject Profile: Agents-in-Production
What it is: An agent running against real tasks in a live environment, where the loop runs as long as the task requires — which is often much longer than anyone planned for.
What it does: Same loop as the demo agent, but it doesn't stop at turn seven. A production agent working through a procurement workflow, a document review, or a multi-system data reconciliation might run fifteen, thirty, or fifty turns. Every turn, the tool output gets appended to the context. The model reads the full context — everything that's happened so far — before deciding what to do next. By turn fifteen, that context includes tool outputs that were relevant three steps ago, intermediate conclusions that were superseded two steps ago, and at least one result that directly contradicts a later result. The model is reasoning against all of it simultaneously.
Who's behind it: Engineering teams who've moved past the proof-of-concept and are running agents on real workloads. The practitioners who've written honestly about this — engineers at companies running agents at scale in customer support, code review, and data pipelines — describe a consistent pattern: the first production failure is almost never a wrong tool call. It's a degradation. The agent starts producing outputs that are technically coherent but operationally wrong, and when you trace back through the context, you find it was reasoning from a tool result that was superseded four turns earlier.
What makes it distinct: Production agents fail when the context becomes a liability. The model didn't get worse. It got handed a worse room to think in. The failure is architectural, and you can't fix it by switching models.
Okta Concept Mapping — Audit Log Rotation and the Cost of Keeping Everything
Production SIEM environments don't keep every raw event forever. They summarize, rotate, and archive — not because the events aren't valuable, but because reasoning against an unbounded event stream is operationally unworkable. Production context management runs on the same instinct applied to agent working memory: keep what's needed for current reasoning, compress what's historical, drop what's been superseded. The operational logic is the same; the stakes are different. A bloated SIEM is slow. A bloated agent context produces confident wrong answers. In a buyer conversation, frame context management as a reasoning hygiene problem, not a cost optimization.
Comparison: What Actually Happens Across Fifteen Turns
I'm using scenario mapping here — tracing what happens across a multi-turn agent run in both environments — because the failure mode is a process, not a property. A trait table would describe what's different. Scenario mapping shows why it matters.
Start with the mechanics. An agent loop is: model emits a tool call, harness executes it, result returns to the context, model reads the full context and decides what to do next. Repeat. The context window is the bounded space where all of this accumulates. (One term collision worth resolving: in LLM contexts, "token" means a unit of text — roughly three-quarters of a word — not a credential artifact. The context window is measured in these text tokens, typically somewhere between 8,000 and 200,000 depending on the model. This has nothing to do with OAuth bearer tokens or JWTs. When a buyer says "we're hitting token limits," ask which kind.)
In a demo, the loop runs six turns. The context at turn six contains: the original task, six tool calls, six tool results. Maybe 2,000 text tokens. The model has no trouble holding all of it in view. Everything is still relevant. Nothing has been superseded. The agent completes the task and the room applauds.
In production, the loop runs fifteen turns on a moderately complex task. The context at turn fifteen contains: the original task, fifteen tool calls, fifteen tool results, plus any intermediate reasoning the model emitted between calls. Some of those tool results are stale — the data source was queried at turn three and again at turn eleven, and the results don't match. Some are superseded — the agent decided at turn seven to change approach, but the turn-four output is still sitting in the context. Some are contradictory — two different tools returned conflicting status information, and neither was explicitly resolved.
The model reads all of it. It doesn't know which parts to discount. It has no mechanism for marking a prior result as superseded. It reasons against the full accumulated context, which is now a room full of discarded drafts. The output it produces is coherent — the model is doing exactly what it was designed to do — but it's reasoning from a corrupted premise set.
A wrong tool call, a hallucination in the conventional sense: those aren't what's happening here. The failure is a degradation that accumulates turn by turn until the agent's working context no longer reflects the current state of the task.
What production teams do about it. The techniques that engineering teams have converged on aren't exotic. They're the same instincts that show up in any system that has to manage a growing state space.
Summarization is the most common. At defined intervals — every five turns, or at natural task checkpoints — the agent (or the harness) compresses the accumulated context into a summary: what has been established, what was tried and superseded, what the current state is. The summary replaces the raw tool outputs in the context. The model now reasons against a compact, current-state representation instead of a transcript of everything that ever happened.
Selective pruning is more surgical. Rather than summarizing everything, the harness identifies tool outputs that are no longer relevant — because they were superseded by a later call to the same tool, or because the task has moved past the stage they were relevant to — and removes them from the context. The model's working memory stays current.
Hard resets at defined checkpoints are the most aggressive technique and the most reliable for long-running tasks with clear phase boundaries. When the agent crosses a task boundary — finishing a research phase and moving to a drafting phase, for example — the context is wiped and rebuilt from scratch with only what's needed for the next phase. The agent starts the new phase with a clean room.
These are engineering choices made by the team that builds the harness around the model. A demo environment doesn't need them because the loop is too short. A production environment that skips them will hit the fifteen-turn failure, reliably, on any sufficiently complex task.
When an agency says "we tried an agent and it went sideways," the first diagnostic question isn't about the model or the tools. It's about the loop length and the context management strategy. If the answer is "we didn't have one," you've found the problem.
Okta Concept Mapping — Just-in-Time Access and the Hard Reset
JIT provisioning gives an identity exactly the access it needs for a specific task, then revokes it when the task ends. The next task gets a fresh grant, not a carry-forward of everything from the last one. A hard context reset at a task checkpoint follows the same design logic applied to agent working memory: don't carry the accumulated state of the last phase into the next one. Re-establish what's needed. Start clean. The design principle is nearly identical — but JIT access is about authorization scope, and the security property is enforced. Context resets are about reasoning quality, and there's no enforcement mechanism. The agent won't refuse to reason from a stale context the way a system will refuse an expired token. That's why the reset has to be engineered into the harness, not left to the model.
How to Say This in the Field
| Don't say | Do say | Why it matters |
|---|---|---|
| "The agent hallucinated" | "The agent's context accumulated stale tool output, and it started reasoning from outdated information" | Reframes a model problem as a solvable architecture problem |
| "You need a better model" | "What context management strategy is your team using across multi-turn runs?" | Moves the conversation from model selection to architecture — where you can actually help |
| "Agents are unpredictable" | "Long-running agents without context checkpoints are unpredictable. Production teams solve this with summarization and hard resets at task boundaries." | Separates inherent unreliability from a specific, fixable design choice |
| "We're not sure agents are ready for production" | "What does your team's context management look like? That's usually the gap between demo-grade and production-grade." | Turns an abstract readiness question into a concrete engineering question |
| "The agent kept going in circles" | "That's a context accumulation symptom — the model is re-reading superseded outputs and re-deriving the same conclusions" | Names the mechanism, which signals you understand the failure mode |
| "We need to add more memory" | "Memory and context management are different problems. Memory is what persists across sessions. Context management is what the agent carries within a single run." | Prevents a conflation that will derail the architecture conversation |
| "We hit token limits" | "When you say token limits — are you talking about context window size, or API rate limits? They have different fixes." | Resolves the term collision before it creates confusion |
| "The demo worked fine" | "Demos rarely run more than six or seven turns. The accumulation problem doesn't surface until you're in the fifteen-to-twenty turn range." | Explains why the demo-to-production gap exists without implying the vendor misled anyone |
| "We just need to tune the agent" | "The tuning that matters here isn't model parameters — it's the summarization and pruning logic in the loop harness" | Points to the right layer of the stack |
| "How do we prevent the agent from going off the rails?" | "Summarization at defined turn intervals, plus hard resets when the agent crosses a task boundary. The loop needs a janitor." | Gives a concrete, memorable answer that the buyer can take back to their team |

