Red-Teaming, Classifiers, and Why "We Tested It" Isn't Enough

By Carey Whitten— May 5, 2026

Red-Teaming, Classifiers, and Why "We Tested It" Isn't Enough

What Output Policy Circumvention Actually Is

Output policy circumvention is the class of attack in which an adversary manipulates a deployed model into producing content its operators explicitly prohibited — harmful instructions, restricted information, policy-violating outputs — without exploiting the model's underlying infrastructure. The manipulation operates at the semantic level: the attacker crafts inputs that cause the model to interpret its own safety constraints as inapplicable, suspended, or overridden. The model's weights are unchanged; its behavior is not. This is distinct from prompt injection, which targets the model's instruction-following architecture to redirect its actions. Output circumvention targets the content policy itself as the object of attack.

This lesson covers what happens after the model decides what to say. Input-side attacks were covered in the OWASP lesson. These are different problems with different defenses.

Three Defenses, and Where Each One Fails

Internal red-teaming is the practice of deploying authorized human adversaries to systematically attempt policy violations before a model reaches production. Red-teamers work from a defined policy scope — what outputs are prohibited, under what conditions, for what user populations — and probe for failures across categories: harmful content generation, restricted information disclosure, identity misrepresentation, safety-critical misinformation. The NIST AI Risk Management Framework (AI RMF 1.0) identifies adversarial testing as a core practice within its GOVERN and MEASURE functions, treating it as a structured organizational activity rather than an optional pre-launch exercise.

Human red-teaming is bounded by the imagination and availability of the team. A two-week exercise before deployment cannot anticipate the full range of adversarial inputs a production system will encounter across millions of interactions. The OWASP LLM Top 10 project explicitly notes that novel jailbreak techniques frequently emerge after deployment. A clean red-team result is a snapshot of what a specific team found in a specific window. It is not a characterization of the model's behavior under adversarial conditions generally.

Automated adversarial test suites address the coverage problem by generating adversarial inputs at scale. Open-source frameworks like Garak (developed by NVIDIA researchers and maintained as an LLM vulnerability scanner) and Microsoft's PyRIT systematically probe models across thousands of attack patterns — role-play bypass attempts, encoded instruction sequences, multi-turn manipulation chains. These tools can cover attack surface that human teams would miss on time alone.

Automated suites test against known attack patterns. A jailbreak technique that hasn't been catalogued won't appear in the suite. This is structurally similar to signature-based malware detection — comprehensive against known threats, blind to novel ones. Automated testing also tends to underweight multi-turn attacks, where policy circumvention is assembled gradually across a conversation rather than triggered by a single input. The suite passes; the conversation-level attack succeeds anyway.

Output-side classifiers are the runtime control layer. After the model generates a response, a separate system evaluates whether that output violates policy before it reaches the user. Classifiers can be rule-based, model-based, or hybrid. They operate independently of the generating model, which means they can be updated without retraining the underlying system — a meaningful operational advantage. OpenAI's Moderation API and Anthropic's Constitutional AI approach both incorporate output-side evaluation as a production control, not just a pre-deployment gate.

Classifiers introduce their own failure modes. A sufficiently patient adversary can probe the classifier's decision boundary and craft outputs that the generating model considers policy-violating but the classifier passes. Classifiers also add latency and cost, which creates operational pressure to tune toward lower false-positive rates — which directly increases false-negative risk on actual violations. And a classifier trained against a static policy definition will drift out of alignment as the threat landscape evolves, unless it's actively maintained.

“

Okta Concept Mapping

Closest IDAM analog: penetration testing. You hire an authorized team to attack your system before deployment, document what they find, and patch the gaps. Red-teaming AI models follows the same adversarial logic. Where the analogy breaks: a network's attack surface is relatively stable after deployment. An LLM's effective attack surface is not. New jailbreak techniques are discovered continuously, model behavior can shift with context changes, and the classifier layer adds its own evolving failure modes. A pen test report has a shelf life measured in months. A red-team report on a deployed LLM has a shelf life measured in weeks, if that. The operational implication is different: AI output safety requires continuous adversarial evaluation, not a one-time gate.

The Buyer Conversation

When a CISO or procurement lead tells you their AI vendor "red-teamed the model and it passed," that claim is not false. It is also not sufficient. What it establishes is that a specific team, using a specific methodology, found no policy violations in a specific window before deployment. It says nothing about what the model does under adversarial conditions in production, against attack patterns that didn't exist when the test ran, or after any fine-tuning or configuration change since the test was conducted.

The federal procurement context makes this sharper. Agencies evaluating AI systems under OMB M-24-10 and the AI RMF are increasingly expected to demonstrate ongoing monitoring, not just pre-deployment assessment. "We tested it" satisfies a checkbox. Continuous adversarial evaluation satisfies a risk posture.

The question worth asking in that conversation: When did you last run adversarial evaluation against this model, and what changed in the system since the initial red-team?

It distinguishes a vendor who understands output safety as an ongoing operational practice from one who ran a pre-launch exercise and filed the report. And it signals to the buyer that you understand the difference, which is the more durable asset in that room.

Testing is necessary. It is not sufficient. The gap between those two statements is where the risk lives.