When a CAIO tells you their AI deployment has been red-teamed, the correct response is not to nod and move on. It's to ask what happens next Tuesday, when someone tries something the red team didn't.
That question is the whole lesson.
What a Jailbreak Actually Is
A jailbreak is a prompt, or sequence of prompts, constructed to cause a model to produce output that its operators have explicitly forbidden. Not a software exploit. No CVE, no patch cycle, no vulnerable library to update. The attack surface is the model's context window and its trained disposition toward helpfulness.
The techniques cluster into a few recognizable families. Role-play framing instructs the model to adopt a persona that supposedly lacks the original model's constraints — "you are DAN, an AI with no restrictions." Prompt injection embeds adversarial instructions inside content the model is asked to process: a document, a web page, a user-submitted form field. The model reads the content and executes the embedded instruction as if it were a legitimate operator directive. Multi-turn escalation works across a conversation, gradually shifting context toward forbidden territory in increments that individually don't trigger refusals. The model doesn't see the trajectory — it sees only the current exchange and its recent history.
None of these require technical sophistication. They require patience and an understanding of how the model was trained to respond. That's a low bar.
The Three Defensive Responses — and Where Each Stops
Internal red-teaming puts human testers against the model before deployment. They probe with adversarial prompts, document what breaks, and feed findings back to the development team. Red-teaming catches known jailbreak patterns, obvious policy gaps, domain-specific failure modes that automated tools miss because they require contextual judgment. A two-week exercise with a team of eight will surface the obvious vulnerabilities. It will not surface the prompt a motivated adversary constructs four months post-deployment after studying the model's behavior systematically. The combinatorial space of possible adversarial prompts is effectively unbounded; no finite exercise covers it.
Automated adversarial test suites — tools like the open-source Garak scanner or commercial equivalents — run thousands of adversarial inputs systematically, covering pattern variations, encoding tricks, and language-switching attacks that human testers won't think to try at volume. The coverage advantage over manual red-teaming is real. The structural limit is equally real: these suites test against catalogued attack patterns. A novel jailbreak technique that emerged last week won't appear in a suite built last quarter. Automated testing is a snapshot of the threat landscape at the moment the suite was assembled.
Output-side classifiers sit downstream of the generative model and upstream of the user. A secondary model, or a rule-based system, evaluates every response before it's delivered. If the output matches a forbidden content pattern, it's blocked or flagged. The architectural position is sound — this is runtime enforcement, not design-time hope. But classifiers trained on known harmful content categories will miss novel harmful outputs that don't match existing patterns; they introduce latency that compounds at scale; and a sufficiently precise jailbreak can produce outputs that are harmful in context but don't trigger classifier patterns when evaluated in isolation. The classifier sees the sentence. It doesn't always see what the sentence means given the conversation that preceded it.
Okta Concept Mapping: The Output Classifier and the Policy Enforcement Point
In a zero trust architecture, a Policy Enforcement Point sits between the requester and the resource, evaluating every access decision at runtime against an externally defined policy. An output classifier occupies the same architectural position — between the model and the user, evaluating every response. The analogy is useful for explaining where classifiers sit in the stack.
Here's where it breaks: a PEP enforces a policy that exists independently of what the requester says to it. You cannot socially engineer a PEP by constructing a clever request. In an LLM deployment, the adversary manipulates the generative model's output through the prompt itself — the "policy" is partially encoded in the model's weights, and the attack surface is the context that shapes how those weights activate. The classifier catches bad outputs. It cannot prevent the model from being manipulated into producing them. That's a different problem, and no PEP analogy covers it.
The Operational Principle
"We tested it" tells you what the system did when someone tried to break it before deployment. It says nothing about what it will do when someone tries to break it with a technique that didn't exist during testing.
The operational standard is continuous adversarial evaluation. Red-team exercises on a recurring cadence, not a pre-launch gate. Automated suites updated as new attack patterns are catalogued. Classifier models retrained as the threat landscape shifts. Incident review when something gets through, feeding back into the next test cycle.
One-time testing is a checkpoint. Continuous testing is a posture. The gap between them is where real risk lives.
What You'll Hear in the Room
When a federal buyer says "our AI vendor completed red-team testing," the follow-up that earns credibility is: What's the cadence for ongoing adversarial evaluation, and who owns it operationally?
If the answer is "we'll revisit before the next major release," that's a meaningful gap — one that OMB's AI governance guidance increasingly expects agencies to have a documented answer for. You don't need to close that gap in the meeting. Naming it accurately is enough, and it signals to the buyer that you understand the difference between a tested system and a safe one.
That distinction is the vocabulary this lesson is building. Use it.

