A language model's most exploitable property is the same one that makes it useful: it follows instructions.
Safety training overlays constraints on that instruction-following, but those constraints are learned behaviors, not hard-coded rules. They're patterns the model acquired during training — patterns that tell it to refuse certain categories of requests. A jailbreak is any input that causes the model to execute its instruction-following training while the safety constraints fail to activate. The model isn't malfunctioning. It's doing exactly what it learned to do. The attack exploits the seam between two things it learned simultaneously.
That precise framing matters. "Tricking the AI" obscures the mechanism. Understanding the mechanism is what lets you evaluate whether a defense actually addresses it.
How the Attack Surface Works
Jailbreaks don't require technical sophistication in the traditional sense. Common techniques include role-play framing ("respond as a character without restrictions"), hypothetical distancing ("for a novel I'm writing, describe how..."), incremental escalation through a sequence of individually benign steps, and prompt injection via content the model is asked to process — a document, a retrieved web page, a tool output — that contains embedded instructions the model then follows.
Each technique reframes the request in a way that the model's safety training doesn't pattern-match as a refusal trigger, while the instruction-following training still activates. The attack surface is not a list of known vulnerabilities. It's the gap between the model's learned refusal patterns and the effectively infinite space of ways a request can be framed in natural language. That gap cannot be fully mapped in advance.
Red-Teaming: What It Does and Doesn't Do
Internal red-teaming is the practice of having a dedicated team attempt to elicit harmful outputs before deployment. Structured exercises define a threat model — what harm categories are in scope, what adversary capabilities are assumed — and the team works systematically to find failure modes. Documented findings drive mitigations before the model ships.
This is genuinely valuable. It's also bounded by human creativity and calendar time. A red team finds a meaningful sample of failure modes, not an exhaustive set. The model deploys with unknown vulnerabilities because the attack surface isn't enumerable.
Automated adversarial test suites extend the coverage. Tools like Garak, an open-source LLM vulnerability scanner, run thousands of known jailbreak patterns against a model systematically — coverage that no manual team can match, and repeatable as the model or its configuration changes. The limit is the same limit that applies to any signature-based detection: the suite tests known patterns. Novel jailbreaks aren't in it yet. A model that passes a comprehensive automated suite today may fail against a technique that circulates tomorrow.
Output-Side Classifiers: The Runtime Filter
Rather than preventing harmful inputs from reaching the model, output classifiers intercept the response before it reaches the user. A classifier trained to detect harmful content (dangerous instructions, certain categories of sensitive information, outputs that match known harm patterns) can flag or block responses that match its training.
This adds a meaningful layer. It also faces adversarial pressure directly. If the classifier's behavior can be probed — and it can, through the same interface as the model — adversaries can iteratively craft outputs that evade it. Harmful content encoded in unusual syntax, delivered through indirect reference, or formatted in ways the classifier wasn't trained on can pass through. Classifiers also introduce latency and false-positive rates that matter in production, particularly in high-volume deployments where every flagged output has an operational cost.
Okta Concept Mapping
Red-teaming maps most naturally to penetration testing: structured adversarial exercise, defined scope, documented findings, mitigations applied before go-live. The analogy holds far enough to be useful and breaks exactly where it matters. A web application's attack surface is bounded — by its codebase, its configuration, its exposed interfaces. Pen test coverage can approach completeness for a bounded system. A model's effective attack surface is partially defined by human creativity applied against natural language, which means it expands continuously as adversaries develop new techniques. You can achieve high confidence that a web application has no SQL injection vulnerability. You cannot achieve equivalent confidence that a model has no jailbreak vulnerability, because the next jailbreak may be a framing pattern that didn't exist when you tested.
The Conversation You're Going to Have
A CISO asks: "Has this model been red-teamed?" That question sounds like it has a yes-or-no answer. It doesn't.
Worth pressing further: What threat model did the red team use? When was the last automated adversarial run, and against what suite? What output monitoring is running in production, and what does it alert on? What's the process when a new jailbreak technique surfaces publicly?
A buyer who's done their homework will ask some version of these questions. A seller who engages with them, rather than accepting "yes, we red-teamed it" as a complete answer, is demonstrating the kind of fluency that earns trust in a federal procurement conversation. The CISO already knows testing happened. The question is whether the vendor treats it as ongoing work or a closed item.
The Operational Principle
Red-teaming, automated test suites, and output classifiers are each necessary. None is sufficient. The attack surface of a deployed model is not fixed at deployment — it expands as new jailbreak techniques emerge in the wild, as the model's context changes, and as adversaries probe the specific deployment configuration.
Treating adversarial evaluation as a pre-deployment gate produces a documented false sense of closure. The discipline is continuous: test before deployment, test after deployment, test when the configuration changes, test when new attack patterns surface in the research community or on public forums.
"We tested it" describes a point in time. "It is safe" describes a continuous state. The gap between those two claims is where the risk lives, and it's the gap your buyers are starting to ask about.

