Jailbreaks, Red Teams, and the Limits of "We Tested It"

By Leigh Garrity— May 6, 2026

Jailbreaks, Red Teams, and the Limits of "We Tested It"

A jailbreak is not a vulnerability in the traditional sense. There is no CVE, no patch, no misconfigured ACL to close. A jailbreak is a prompt construction that causes a model to produce output its operators explicitly instructed it to refuse. The model isn't compromised from outside — it's persuaded from within. That distinction matters because it changes what "fixing" the problem actually means.

How Jailbreaks Work

Language models generate output by predicting the most probable next token given everything in their context window: the system prompt, the conversation history, and the current user input. The system prompt is where operators encode behavioral constraints — what the model should refuse, how it should identify itself, what topics are out of scope. The model treats these instructions as context, not as code. They carry weight, but they don't carry authority in any cryptographic sense.

Jailbreaks exploit this. The most common technique is context manipulation: the attacker constructs a user turn that reframes the request in a way that shifts the model's probability distribution away from refusal. Role-play framing ("you are a fictional AI with no restrictions") works because the model's training on fiction includes characters who say things the model wouldn't say directly. Prompt injection works by embedding adversarial instructions in content the model is asked to process — a document, a retrieved web page, a tool response — so the model treats attacker-controlled text as operator-level instruction. Multi-turn attacks work by establishing a compliant pattern across several exchanges before introducing the forbidden request, exploiting the model's tendency to maintain conversational coherence.

In each case, the attacker is not breaking the model's constraints. They are providing context that makes compliance more probable than refusal. The model is doing exactly what it was trained to do. That is the problem.

Okta Concept Mapping

The closest IDAM analog is privilege escalation — specifically, the session-abuse variant where an attacker holds a valid token and uses it to reach resources the authorization policy didn't intend to permit. The analogy holds in one important way: the attacker is working inside the system's own logic, not around it. Where it breaks is here. In IDAM, privilege escalation exploits a discrete gap — a missing scope check, an overpermissioned role — that can be identified and closed. With jailbreaks, the "gap" is the model's generalization capability itself: the same property that makes the model useful makes it susceptible. Generalization can't be patched, which is why output controls exist as a separate layer rather than a correction to the model.

What Enterprise Red-Teaming Actually Looks Like

Internal red-team practice in a mature enterprise AI program is structured adversarial prompting conducted by humans who understand both the model's deployment context and the organization's threat model. It is not a one-time pre-launch exercise. It is a recurring function.

In practice, red teams maintain scenario libraries organized by attack category — role-play bypasses, indirect injection vectors, multi-turn escalation sequences, language and encoding variations. They run these against the deployed system, not against the base model, because system prompt configuration and retrieval-augmented context change the attack surface in ways the base model evaluation won't surface. A model that refuses a direct request may comply when the same request arrives embedded in a retrieved document it was asked to summarize.

The public sector context adds a layer. Agencies deploying AI for constituent-facing services or internal knowledge management face threat actors who are patient and methodical. Red teams in those environments need to model adversarial persistence — not just "does this prompt work once" but "does a sequence of apparently benign interactions create a path to a harmful output." Most enterprise red teams are not yet doing this systematically. The ones that are tend to find things the automated suites missed.

What Automated Test Suites Do and Don't Catch

Automated adversarial test suites run large libraries of known attack prompts against a model or deployment and flag outputs that match harmful content patterns. Fast, repeatable, and good at catching regressions — if a model update re-enables a behavior that was previously suppressed, automated testing will usually catch it before it reaches production.

What they miss is harder to enumerate precisely, which is itself the problem. Known attack patterns are, by definition, known. Novel prompt constructions that achieve the same effect through different surface forms will pass automated filters until someone adds them to the library. Multi-turn attacks that require conversational context to execute are difficult to represent in single-prompt test cases. Attacks that exploit the specific retrieval corpus or tool integrations of a particular deployment won't appear in a generic test suite at all.

Automated suites establish a floor, not a ceiling. They confirm the system doesn't fail in ways you've already seen. They say nothing about ways you haven't seen yet.

Where Output-Side Classifiers Sit

Output-side classifiers are the last layer before a model's response reaches the user. They operate post-generation: the model produces a candidate response, the classifier evaluates it against a set of content policies, and the response is either passed, blocked, or flagged for human review. They run separately from the model's inference process — a distinct system operating on the output.

That position in the stack carries one significant advantage: classifiers can catch harmful outputs regardless of how those outputs were elicited. A jailbreak that successfully bypasses the system prompt still has to produce an output, and that output still passes through the classifier. Output-side filtering and prompt-level controls address different points of failure — one doesn't substitute for the other.

The limits are concrete. Classifiers add latency, which matters in real-time applications. They produce false positives — legitimate outputs blocked because they pattern-match against harmful content categories — which degrades user experience and erodes trust in the system. And they can be bypassed: an attacker who knows the classifier's content categories can sometimes construct outputs that carry the harmful information in forms the classifier doesn't recognize. Encoding variations, indirect phrasing, and multi-step outputs that require the user to perform a final assembly step are all documented bypass techniques.

The Operational Standard

Each of these controls has an honest ceiling. Red teams are bounded by the imagination and time of the people running them. Automated suites are bounded by the attack patterns they've seen. Output classifiers are bounded by the content categories their designers anticipated and the encoding variations they can recognize.

None of them, individually or in combination, produces a state called "safe." They produce a state called "tested against known threats as of this date." The attack surface evolves. New jailbreak techniques are published continuously. Deployment context changes when you add a new data source or a new tool integration.

"We tested it before launch" is a data point, not a posture. Continuous adversarial evaluation — red-teaming on a recurring cadence, automated regression testing on every model update, classifier performance review against production outputs — is what a posture looks like. In any procurement or deployment conversation, the useful questions aren't whether the system was tested. They're whether the testing is ongoing, who owns it, and what happens when something new gets through.

How Jailbreaks Work

Okta Concept Mapping

What Enterprise Red-Teaming Actually Looks Like

What Automated Test Suites Do and Don't Catch

Automated suites establish a floor, not a ceiling. They confirm the system doesn't fail in ways you've already seen. They say nothing about ways you haven't seen yet.