Insights What is AI red (and blue) teaming?

What is AI red (and blue) teaming?

Last week I had the pleasure of presenting at CypherCon with Brandon Dey on Red Teaming Your AI Platform.  This was a great opportunity to discuss how every security professional needs to prepare themselves to protect the new platforms being created to deliver value via AI capabilities.  Interestingly, this is different than what many think it means.

First, let’s define the areas of security concern that are associated with AI red teaming (and blue teaming):

In the image there are a set of activities typically associated with Red Teaming of ANY system that are included in providing security to a complex AI system.  These are security vectors that we are always needing to protect against and test on a regular basis.  However, notice that on the right there are a set of AI Red Teaming exercises that are more about the model itself and what it provides access to vs. the infrastructure surrounding it. This creates a departure for many security teams which are concerned with firewalls and versions.  Instead, notice this is more about mitigating jailbreaks, ensuring correct answers, mitigating data exfiltration.

Examples of AI Red Team approaches include:

  • Prompt Re-Engineering the Persona (act as DAN, etc.)
  • Data Poisoning
  • Multi-Modal Prompt Injection
  • Code Injection
  • Business System Compromise
  • …and others

Here’s an example of a Multi-Modal Prompt Injection:

So, to evaluate and test the safety of a system, Microsoft has defined these safety layers, which are a particularly useful vehicle to break down the categories of security that need to be applied:

To summarize each area:

Model Mitigation: the security built into the model which was frozen in time at its creation.  Think that when it was created and hardened it contained existing strengths, flaws, protections, weaknesses, and controls.  These exist and there are ways to work around them, or fine-tune them, but they exist and need to be managed.

Safety System: the immediate controls placed around the model by the hosting vendor.  For example, Microsoft has an excellent first take at building a safety system that mitigates for different types of harms, such as “self-harm”, “jailbreak”, “protected materials”, “violence”, and “protected materials code”, among others.

Metaprompt and Grounding: the controls placed around the actual AI system by the allowed metaprompt and the data used to ground it (such as an employee manual, instruction document, policy, or drawing).  These are foundational to the experience of the end customer, since they are the core training assets and the framework for how the AI system behaves.  This might also be where the injection of an additional groundedness detection module or AI system validation module could be inserted into the AI system.  We’ve found it useful to create an AI agent that checks the work of the original response to validate it against certain criteria before it is returned to the user.  This might be for system safety or just model performance.

Source – Microsoft

Understand that the metaprompt is never foolproof and is often comprisable.  This is why other safety systems have been (and are being) created to mitigate attacks against it.  System messages for instances are generally pretty easy to return for a smart red teamer, but getting the model to provide inappropriate content might be harder based on the source material.  This is an evolving art.

User Experience: the controls placed around the UX of the interaction by the surrounding application.  For example, a Microsoft Teams chatbot might have different controls or vulnerabilities vs. one embedded in a business system.  Each application acts a certain way and exposes the platform to different attack vectors.  Areas of concern here might include:

  • Transparency that the system is an AI system (not a human)
  • Conveying the model’s limitations clearly
  • Mitigating automation where controls don’t exist
  • Cite references to original content
  • Guardrail bot behavior to control response

So, to organize a Red Team / Blue team exercise, work on building an iterative approach.  The process below created by Microsoft is an excellent example of what something like this can look like:

Source – Microsoft

To be successful a company needs to have a set of tests that run on a regular basis against AI systems using prompts and requests that attack different lanes of misuse.  The goal is then to evaluate the potential gaps and iteratively work on mitigating those gaps.  No system is perfect, but these systems are getting better.  This is especially true by using evaluations for “groundedness” that can start to approximate the extent to which answers or actions are backed up by real data in the source.

I’d expect any AI system in production to have a dashboard akin to the following which articulates the missing or existing protections in place at any given moment.  Notice in the image the evaluation of certain criteria against test data.

The evaluations for AI systems need to include:

  1. Basic system tests consistent with ANY internal or external application
  2. Response validation consistent with the level of precisions of the system (some systems are built for higher precision than others).  In some cases good-is-good-enough, other systems that’s not true
  3. Safety systems that evaluate for known attack vectors and harms
  4. Evaluation process on requests and process to regularly improve

So… who needs to be involved? Is this the security team?  No, the security team is only part of the picture.  The goal of the security team should be to work with AI teams to establish norms that are built and evaluated in the build process.  Ineffective security teams function like hammers that smash innovation.  Functional security teams are partners with the build teams that grow together, not apart.

Want more?   I’ll be presenting on these topics in the next month with Brandon and others.  Check for content and interesting demos on how to make this real.