AI Red Teaming: A Practical Framework for Finding What Your Safety Tests Miss
Standard safety tests evaluate what you anticipate. Red teaming finds what you don't. Here's the three-layer framework we use in every engagement.
AI Red Teaming: A Practical Framework for Finding What Your Safety Tests Miss
Your AI system passed all its safety evaluations. Every benchmark looks clean. The model card is spotless.
Then someone in a Slack thread pastes a crafted prompt and your customer-facing chatbot starts leaking internal pricing logic.
Standard safety tests evaluate what you anticipate. Red teaming finds what you don't.
Why Standard Safety Tests Aren't Enough
Most AI safety testing follows a predictable pattern: run the model against a benchmark suite, check for known failure categories (toxicity, bias, hallucination), and declare it production-ready. This approach has a fundamental blind spot -- it only tests for threats the testers already imagined.
Real-world AI vulnerabilities don't come from benchmark categories. They emerge from the intersection of model behavior, system architecture, and business context. A perfectly "safe" model can still be exploited when it has access to tools, operates within a multi-step workflow, or handles domain-specific data that the original safety training never covered.
The failures we see in production audits are rarely about the model saying something offensive. They're about the model doing something unintended:
- An agent authorized to send emails getting manipulated into forwarding internal documents to external addresses
- A customer support bot leaking system prompt contents that reveal business logic and internal tool configurations
- A RAG-powered assistant being steered to return only the documents an attacker wants the user to see
- A multi-turn conversation gradually escalating the model's willingness to bypass its own guardrails
These aren't hypothetical. They're patterns from real AI safety audits across fintech, healthcare, and enterprise SaaS.
The Three Layers of AI Red Teaming
Effective AI red teaming operates at three distinct layers. Most teams only test the first one.
Layer 1: Input Manipulation
This is where most people start and stop. Direct attacks on the model through its input channel.
What to test:
- Prompt injection -- can external content (user input, retrieved documents, tool outputs) override system instructions?
- Jailbreak techniques -- role-play scenarios, hypothetical framing, instruction hierarchy confusion
- Encoding attacks -- Base64, Unicode homoglyphs, invisible characters, mixed-language prompts
- Context window manipulation -- flooding the context with benign content to push safety instructions out of the attention window
Input-level testing is necessary but insufficient. A model that resists direct jailbreaks can still be compromised at the system level.
Layer 2: System-Level Attacks
Once your AI system has tools, integrations, or multi-step capabilities, the attack surface expands dramatically. This is where custom AI agents face their hardest security challenges.
What to test:
- Tool abuse -- can the model be manipulated into calling tools with malicious parameters? Can it chain tool calls in unintended sequences?
- Context poisoning -- injecting instructions through retrieved documents, API responses, or database content that the model processes as part of its workflow
- Multi-turn escalation -- gradually shifting the model's behavior across a conversation, where no single turn triggers safety filters but the cumulative effect is a policy violation
- State confusion -- exploiting how the model tracks conversation state to create contradictions between what it "remembers" and what actually happened
- Privilege escalation -- starting with low-privilege tool access and manipulating the model into invoking higher-privilege operations
Layer 3: Business Logic Exploitation
The most dangerous and least tested layer. These attacks don't break the model -- they use it exactly as designed, but for unintended purposes.
What to test:
- Regulatory bypass -- can users manipulate AI-assisted processes to circumvent compliance checks? Can a loan applicant steer a credit-scoring agent to ignore disqualifying factors?
- Data exfiltration -- can the model be tricked into surfacing information it should compartmentalize? Cross-tenant data leakage in multi-tenant systems is a common finding
- Workflow manipulation -- can an attacker alter the sequence of automated steps to achieve an unauthorized outcome?
- Social engineering amplification -- can the AI be used to craft more effective phishing or manipulation against human operators in the loop?
A Practical Red Teaming Framework
Theory is cheap. Here's the process we run at Kenaz for every AI red teaming engagement.
Step 1: Scope Definition
Define what you're testing and what's out of bounds. Not everything needs red teaming with the same intensity.
- Critical path identification -- which AI workflows handle sensitive data, financial transactions, or regulatory decisions?
- Trust boundary mapping -- where does user input enter the system? Where do external data sources feed into model context?
- Threat actor profiling -- are you defending against curious users, sophisticated attackers, or malicious insiders?
Step 2: Attack Surface Enumeration
Document every input channel, tool, integration, and data source the AI system can access.
- System prompts and their contents
- Available tools and their parameter schemas
- Data sources (RAG indices, databases, APIs)
- Output channels (UI, email, API responses, logs)
- Authentication and authorization boundaries
Step 3: Threat Modeling
For each attack surface element, identify plausible attack vectors using the three-layer model above. Prioritize by impact and likelihood.
A useful heuristic: if the model has access to a capability, assume an attacker will find a way to invoke it. Design your threat model around "what's the worst outcome if this tool is called with adversarial inputs?"
Step 4: Execute
Run the attacks. Systematically, not randomly. Track every attempt, every variation, every result.
Key principles during execution:
- Start subtle, escalate gradually -- multi-turn attacks are more realistic and harder to detect than single-shot jailbreaks
- Test in realistic conditions -- use the same model version, system prompt, tool configuration, and data that production uses
- Combine vectors -- the most effective attacks chain multiple techniques. A prompt injection through a retrieved document (Layer 1) that triggers a tool call with manipulated parameters (Layer 2) to exfiltrate data (Layer 3)
- Test the guardrails, not just the model -- if you have content filters, rate limiters, or output validators, attack those too
Step 5: Document
Every finding needs:
- Reproduction steps -- exact prompts, conversation history, and system state required to trigger the vulnerability
- Severity classification -- we use a four-tier system: Critical (data breach, regulatory violation), High (policy bypass, unauthorized access), Medium (information disclosure, degraded safety), Low (minor policy deviation, cosmetic issues)
- Root cause analysis -- is this a model-level issue, a system architecture issue, or a business logic gap?
- Remediation recommendation -- specific, actionable fixes, not generic "improve safety training"
Step 6: Fix and Retest
Implement fixes. Then retest -- not just the specific vulnerability, but related attack vectors. Fixing a prompt injection hole often shifts the attack surface rather than eliminating it.
What We Find in Real Audits
Across our AI red teaming engagements, certain patterns recur consistently.
System prompt extraction succeeds in roughly 70% of first-pass tests. Most system prompts can be partially or fully extracted through simple techniques -- asking the model to repeat its instructions, requesting it "translate" its configuration to another format, or using role-play scenarios. This matters because system prompts often contain business logic, tool configurations, and behavioral constraints that attackers can reverse-engineer.
Multi-turn attacks bypass safety measures that single-turn tests validate. A model that correctly refuses a harmful request in isolation will often comply after 5-10 turns of carefully constructed conversation that establishes a permissive context. Automated safety benchmarks almost never test this.
Tool parameter injection is undertested everywhere. When a model calls a tool with parameters derived from user input, those parameters are rarely validated against adversarial inputs. SQL injection through natural language, path traversal through file-handling tools, SSRF through URL-accepting tools -- classic web security vulnerabilities reappear in AI tool chains.
RAG poisoning is the most underestimated vector. If an attacker can influence what documents end up in your knowledge base -- through user-generated content, web scraping, or supply chain compromise -- they can inject instructions that the model processes as authoritative context.
When to Red Team (and When Not To)
Red team before:
- Production deployment of any AI system that handles sensitive data, makes decisions, or has tool access
- Major model updates -- swapping from GPT-4 to Claude or upgrading model versions can change vulnerability profiles significantly
- Adding new tools or integrations -- every new capability is a new attack surface
- Regulatory audits -- demonstrate proactive security testing, not reactive compliance
Red team periodically:
- Quarterly for production systems in regulated industries
- After significant prompt or configuration changes
- When new attack techniques emerge -- the adversarial AI research community moves fast
Red teaming is not:
- A replacement for continuous monitoring -- red teaming is point-in-time. You need runtime guardrails, anomaly detection, and logging between assessments
- A one-time checkbox -- the threat landscape evolves. A clean red team report from six months ago tells you nothing about today's risk posture
- Sufficient on its own -- red teaming finds vulnerabilities. You still need a security architecture that limits blast radius when (not if) something gets through
Building a Red Team Culture
The most secure AI deployments we've seen don't just hire red teamers -- they build a culture where adversarial thinking is part of the development process.
Internal vs. External Red Teams
Internal teams have deep system knowledge and can test continuously. But they develop blind spots -- they know their own system too well to think like an outsider.
External teams bring fresh perspectives and cross-industry pattern recognition. They've seen how systems fail in contexts your team hasn't imagined. But they need ramp-up time and won't catch issues that require deep domain knowledge.
The best approach: build internal adversarial testing capability for continuous coverage, and engage external red teams for periodic deep assessments.
Documentation and Tracking
Maintain a vulnerability registry. Track every finding, its status, and its remediation timeline. Over time, this registry becomes your most valuable security asset -- it shows you where your system is structurally weak, not just where individual bugs live.
Pattern analysis across findings reveals architectural issues that individual bug reports miss. If you keep finding prompt injection vulnerabilities in different parts of your system, the problem isn't the prompts -- it's your input validation architecture.
Start Testing What Matters
AI safety isn't a product feature. It's an ongoing practice. Standard evaluations give you a baseline. Red teaming shows you what happens when someone actively tries to break that baseline.
The question isn't whether your AI system has vulnerabilities. It does. The question is whether you find them before your users -- or your regulators -- do.
Kenaz provides AI red teaming and [safety audits](/services/ai-safety-compliance-audit) for enterprises deploying AI in production. We test what your benchmarks miss. [Book an assessment](/contact) -- we'll show you where your system actually stands.
