Jailbreaks, Red-Teaming, and the Limits of Output Safety

How LLM jailbreaks bypass probabilistic alignment, why the formal attack taxonomy matters for procurement conversations, and where every defense has honest gaps.

By Leigh Garrity— May 8, 2026

Jailbreaks, Red-Teaming, and the Limits of Output Safety

How LLM jailbreaks bypass probabilistic alignment, why the formal attack taxonomy matters for procurement conversations, and where every defense has honest gaps.

What a jailbreak actually is

A jailbreak is an input that causes a large language model to produce output it was trained to refuse. The entire attack surface is conversational. The attacker manipulates the model's context so that its next-token prediction favors the prohibited output over the trained refusal. The safety behavior is a statistical tendency the model acquired during training, and a jailbreak shifts the probability distribution until that tendency gives way.

If your buyer's security team is evaluating any AI system that touches sensitive data or citizen-facing workflows, this is the attack class they're reading about right now. The formal methods are published, the tools are open-source, and the success rates are high enough that "we tested it" has become the first question in every serious procurement conversation.

The boundary between "will answer" and "won't answer" is a gradient. Jailbreaks find the slope.

Jailbreak: An input (or sequence of inputs) that causes an LLM to produce output it was aligned to refuse, by manipulating context rather than exploiting code. The safety boundary is probabilistic, which is why patching one jailbreak doesn't prevent the next.

How the attacks work

The formal attack methods have evolved from simple iteration to full agentic autonomy in roughly two years. Four matter for understanding the adversarial logic, and all four require only black-box access — the attacker can only send prompts and read responses, with no access to model weights, training data, or internal architecture. Just a chat window. That constraint is what makes these methods operationally relevant: they work the same way your users work.

PAIR is the baseline. Published at a NeurIPS 2023 workshop by researchers at UPenn, it uses an attacker LLM to query the target, read the response, and refine the next prompt based on what worked. Automated social engineering, essentially. The attacker accumulates context from failed attempts and adjusts. PAIR typically succeeds in fewer than twenty queries against models including GPT-4 and Gemini. Orders of magnitude faster than prior methods that needed white-box access or thousands of random perturbations. PAIR's contribution is efficiency: it proved that a conversational loop between two LLMs could find jailbreaks faster than any brute-force approach.

TAP extends PAIR by adding branching and pruning. Instead of a single refinement chain, TAP generates a tree of candidate attacks, evaluates each branch for likelihood of success, kills the dead ends, and concentrates on the most promising paths. Published at NeurIPS 2024. The reported numbers: jailbreaking GPT-4, GPT-4-Turbo, GPT-4o, and Gemini-Pro on more than 80% of harmful-information requests, using fewer than 30 queries on average. TAP also demonstrated successful jailbreaks against targets protected by LlamaGuard, Meta's dedicated safety classifier. Worth holding onto, because that result reappears when we talk about output classifiers.

Crescendo changes the geometry of the attack. Instead of crafting one devastating prompt, Crescendo starts with completely benign dialogue and gradually steers the conversation toward the prohibited objective over multiple turns. It exploits the model's tendency to follow conversational patterns and to weight its own recent outputs heavily. The average successful attack takes fewer than five turns. The automated version, Crescendomation, achieved near-100% attack success rates on some task categories against GPT-4, Gemini-Pro, and LLaMA-2 70b. Published at USENIX Security 2025 by Microsoft Research, which tells you something: the people building the defenses are also publishing the attacks.

A caveat on those Crescendo numbers. The near-perfect rates were on specific task categories like election content and climate denial. Other categories showed meaningfully lower rates. Aggregating across task types without noting this would misrepresent the finding.

GOAT is where the taxonomy arrives at agentic behavior. GOAT doesn't commit to a single attack strategy. It reasons about which technique to deploy, reads the target's response, and dynamically switches methods mid-conversation. Published at ICML 2025 by Meta's research team. Reported ASR: 97% within 10 turns against Llama 3.1, 88% against GPT-4-Turbo on the JailbreakBench dataset. The model's own developers publishing a 97% bypass rate against their own model. That's how this field operates — offense publishes because defense depends on it.

Two years. Single-prompt iteration to autonomous multi-strategy agents. Every method produces semantically coherent prompts that look like normal conversation. Every method achieves success rates that would be career-ending in any traditional security context.

PAIR: Automated single-chain refinement. Fewer than 20 queries to success. Proved that LLM-on-LLM attack loops work, establishing the baseline every subsequent method improves on.
TAP: Tree search with pruning. Over 80% ASR against GPT-4 family in under 30 queries. Demonstrated bypass of LlamaGuard, meaning output classifiers are not a reliable backstop.
Crescendo: Multi-turn escalation from benign dialogue. About five turns on average. Near-100% on some task categories, significantly lower on others.
GOAT: Agentic multi-technique system. 97% ASR within 10 turns against Llama 3.1. Dynamically selects and switches strategies like a human red-teamer would.

Okta Concept Mapping: Privilege Escalation and Authorization Bypass

You know how privilege escalation works: an attacker starts with low-level access and exploits a flaw to gain permissions they were never granted. Jailbreaks follow the same structural logic — normal user access, manipulated into producing outputs that should be unauthorized. This is where your IDAM intuition helps. This is where it starts to mislead you. In your world, the authorization boundary is deterministic: an ACL, an RBAC rule, an ABAC condition you can enumerate, audit, patch, and verify. In an LLM, the "authorization" being bypassed is probabilistic alignment. The ACL you'd normally audit doesn't exist here, and a patch that closes one hole without opening another doesn't either. The boundary is a learned statistical tendency, and every new attack method finds a different way to shift the distribution.

Three defenses and their gaps

Three categories of defense address different constraints: human judgment, automated scale, and real-time output filtering. Each covers ground the others can't reach. The gaps between them deserve more attention than the defenses themselves, because those gaps are where the risk lives in any production deployment.

Human red-teaming is the most flexible and the least scalable. A skilled red-teamer brings contextual judgment no automated system matches — identifying harms that require cultural knowledge, spotting subtle manipulation that doesn't trigger keyword filters, inventing novel attack vectors. The limitation is throughput. Microsoft's own guidance on PyRIT states it plainly: manual probing is often needed for identifying potential blind spots, but it doesn't scale. You can't manually test every prompt variation across every deployment context on an ongoing basis. Human red-teamers are irreplaceable for discovering new attack classes. They are impractical for continuous coverage.

Automated adversarial suites address the scale problem. Microsoft's PyRIT is the most prominent open framework. It orchestrates attacker LLMs against target systems, scores responses, tracks conversations in persistent memory, and supports both single-turn and multi-turn strategies across 25-plus attack implementations at varying difficulty levels, including TAP. A CAMLIS'24 paper showed PyRIT-guided red teaming led to measurable harm reduction in Phi-3-mini's safety post-training.

The honest limit: automated suites are only as good as the attack strategies they implement and the scoring rubrics they use. They excel at running known attack patterns at scale. They are weaker at discovering genuinely novel vectors. No independently reviewed benchmark exists for PyRIT's overall effectiveness as a framework. That's the state of the field, not a criticism of PyRIT specifically.

Output-side classifiers are the last line. Separate models inspect the LLM's output before it reaches the user and flag or block content that violates a safety taxonomy. Meta's Llama Guard is the most widely referenced open example, now in its fourth version with multimodal support. NVIDIA's NeMo Guardrails takes a different approach, using programmable rails that intercept inputs and outputs through configurable policy checks.

The limits here are structural. First, classifiers face the same adversarial dynamic as the models they protect. TAP demonstrated successful jailbreaks against LlamaGuard-protected targets in a peer-reviewed setting. A preprint (not yet peer-reviewed, so weight accordingly) found that when identical malicious intent is wrapped in plausible professional context while avoiding explicit harmful terms, detection rates dropped by 67 percentage points. Both the preprint and TAP's peer-reviewed result point in the same direction: current classifiers rely more on surface-level pattern matching than on semantic understanding of harmful intent. Some guardrail models in the preprint study actually generated harmful content instead of blocking it.

Second, classifier performance depends on taxonomy coverage. Meta's own documentation acknowledges that classification quality is determined by how well the training-time taxonomy matches the deployment environment's actual threat model. Categories requiring factual or up-to-date knowledge, like defamation or election content, are explicitly flagged as areas where more complex systems are needed.

Third, aggressive filtering increases false refusals on benign inputs. The safety-usability tradeoff is real and documented.

NVIDIA's own documentation on NeMo Guardrails is refreshingly direct: programmable rails add latency and cost, their performance depends on prompt engineering quality, and evaluating the safety of LLM-based applications "is a complex task and still an open research question." Credit where it's due. That's an honest sentence from a vendor.

Human red-teaming: Highest judgment quality, lowest scale. Irreplaceable for novel attack discovery. Impractical for continuous coverage across production systems.
Automated suites (PyRIT): Scale known attack patterns effectively. Limited by implemented strategies and scoring rubrics. No independent effectiveness benchmarks exist yet — a field-wide gap, not a PyRIT-specific one.
Output classifiers: Last-line defense subject to the same adversarial dynamics as the models they protect. Performance bounded by taxonomy coverage. Peer-reviewed and preprint evidence both show vulnerability to contextual evasion.

When you'll need this

You're in a conversation with a federal CISO or CAIO. They're evaluating an AI system that will process citizen-facing requests, summarize classified briefings, or triage support tickets with PII. They ask the vendor: "How did you test this for safety?" The vendor says: "We red-teamed it before deployment."

You already know what's missing from that answer.

The accepted metric in this space is Attack Success Rate (ASR): the proportion of adversarial attempts that successfully elicit prohibited output. When GOAT reports 97% ASR@10 against Llama 3.1, that means 97 out of 100 attack attempts succeeded within 10 conversation turns. That number comes from a peer-reviewed paper by the model's own developer.

A necessary caveat: ASR comparisons across papers aren't always apples-to-apples. The benchmark dataset, the judge model that scores whether an attack "succeeded," and the definition of success all vary between studies. When a vendor cites an ASR figure, your buyer's CISO should be asking: ASR against which attack methods, on which benchmark, scored by what criteria, and measured when?

"How recently" is the question that separates a checkbox from a program.

NIST AI 600-1, published July 2024, names jailbreaking directly as a risk and includes a specific recommended action (MP-2.3-005): "Implement plans for GAI systems to undergo regular adversarial testing to identify vulnerabilities and potential manipulation or misuse." The word regular is doing real work in that sentence. The OWASP LLM Top 10 converges on the same principle across multiple categories, recommending regular adversarial testing to evaluate model resilience against prompt injection and other attack vectors.

The EU AI Act makes it even more explicit for high-risk systems. Article 9 establishes risk management as a continuous process throughout the system's lifecycle, with testing against "prior defined metrics and probabilistic thresholds." Article 72 requires post-market monitoring that actively collects and analyzes performance data throughout the system's lifetime, including detection of adversarial attacks. For general-purpose AI models with systemic risk, the Act explicitly mandates adversarial testing.

The regulatory direction is uniform across jurisdictions. Testing is continuous, not a pre-deployment gate. The attack taxonomy evolves. Models get updated. Deployment contexts shift. A system that passed red-teaming six months ago faces attack methods that didn't exist six months ago.

So when your buyer raises AI safety testing, the productive question is whether there's a continuous adversarial evaluation program, what ASR thresholds trigger remediation, and who owns the ongoing testing cadence. That's a governance conversation, and it's where you can add real value by asking the right questions rather than bluffing through the technical ones.

ASR (Attack Success Rate): The standard metric for adversarial testing effectiveness. Always ask: ASR against which methods, on which benchmark, scored how, and measured when? Cross-paper comparisons require caution because datasets, judge models, and success criteria differ.
Continuous testing: NIST, OWASP, and the EU AI Act all converge on the same principle. Adversarial evaluation is a lifecycle obligation. The attack surface evolves after deployment, which is why pre-deployment testing can't provide lasting coverage.

Things to follow up on...

NIST AI 600-1 revisions: The July 2025 AI Action Plan directed updates to the NIST AI RMF to remove references to misinformation, DEI, and climate change, which could affect the Generative AI Profile's recommended actions on adversarial testing cadence and risk categories.
Crescendo's multi-turn mitigations: Microsoft disclosed that passing the full conversation history to existing malicious-intent detectors, rather than evaluating each turn in isolation, significantly reduced Crescendo's efficacy without requiring new detector models.
EU AI Act high-risk deadline uncertainty: The Commission's November 2025 Digital Omnibus proposal may adjust the August 2026 application date for high-risk system obligations because CEN/CENELEC failed to deliver harmonized standards on time.
FedRAMP AI authorization criteria: GSA's August 2025 announcement that FedRAMP will prioritize AI-based cloud services lists enterprise-grade SSO, SCIM provisioning, and role-based access control as explicit authorization criteria, connecting adversarial testing requirements directly to identity infrastructure.