Rethinking AI Red Teaming and the Future of AI Security

While AI is transforming industries and changes how businesses operate, it also exposes them to new security risks. Adoption of AI models introduces new attack surfaces and evolving threats, examples of which include data breaches, intellectual property infringement or generation of harmful content. To address these concerns, organizations and the AI community have turned to AI red teaming—a method which emerged from adversarial machine learning and traditional red teaming.

AI red teaming involves simulating various attack scenarios to identify vulnerabilities in artificial intelligence applications. It has been defined as “a structured process for probing AI systems and products for the identification of harmful capabilities, outputs, or infrastructural threats” with the emphasis on identifying vulnerabilities “across the full system – including data, infrastructure, applications – not just model outputs alone.”¹

It is a proactive, offense-driven testing approach aiming to uncover weaknesses that could potentially be exploited by malicious actors, with the goal of applying safeguards in order to make AI systems more robust.

Suggestions for improvement of AI red teaming include a precise definition of the targets and their setters, the selection and rationale for the red team, and clearly defined criteria for success and failure in attacks.²

But is AI red teaming really enough to make AI systems safe, secure and trustworthy?

Criticism of AI red teaming has highlighted several drawbacks, for example, relating to the human factor including the selection process for red teamers, potential biases and blindspots in test methodologies, and the psychological impact of exposure to harmful content on those conducting the tests.³

The DEF CON 32 Hacker’s Almanack⁴ criticizes AI red teaming for being insufficient and disorganized, emphasizing that despite its promotion, it has not led to more secure systems.

The main points of their criticism are:

Lack of Standardization: Current AI red teaming practices lack a unified framework, making them ineffective in addressing the complex security needs of AI systems.
“Penetrate and Patch” and Limited Scope: The current approach of identifying vulnerabilities and then patching them is seen as insufficient. AI red teaming is often conducted in isolation and does not fully integrate with broader security practices, unlike traditional software security.
Inadequate Documentation: The documentation for what AI models are supposed to do is fragmented, and evaluations often miss large aspects of their intended domain.
Insufficient Addressing of Unknown Unknowns: Red teaming focuses on known vulnerabilities but struggles to uncover “unknown unknowns,” which are critical in AI systems, particularly in technologies as powerful as LLMs.

Potential Solutions according to the DEFCON32 Hacker’s Almanack⁴:

Focus on Intended Use and Restrictions: Start by defining what AI systems should and should not do, and then develop security strategies based on these definitions.
Secure by Design Approach: This approach emphasizes integrating security considerations into the initial design of AI systems, rather than relying solely on post-development or post-deployment testing.
Comprehensive Testing: While red teaming can be useful, it should be part of a broader testing strategy that includes systematic design and evaluation.
Integration with Traditional Security Practices: Adopt frameworks similar to the Common Vulnerabilities and Exposures (CVE) system to standardize AI vulnerability reporting and management. Adapt traditional vulnerability disclosure programs to fit the unique needs of AI, including bug bounties and designing a model card system that aligns with such programs.
Collaboration and Standardization: Encourage collaboration between the AI developers, security experts and policy makers to develop standardized methods for evaluating and securing AI systems

In summary, while AI red teaming can be a valuable tool for identifying vulnerabilities, it is not a standalone solution for securing AI systems. A more holistic approach that integrates security into the design phase and involves comprehensive testing and collaboration is necessary to ensure the development of trustworthy AI models.

1: Frontier Model Forum, 2024 https://www.frontiermodelforum.org/updates/red-teaming/
2: Red Teaming AI The Devil is in the Details https://www.techpolicy.press/red-teaming-ai-the-devil-is-in-the-details/
3: Zhang et al., 2024 https://doi.org/10.1145/3678884.3687147
4: DEFCON 32 Hacker’s Almanack, 2024 https://harris.uchicago.edu/sites/default/files/the_def_con_32_hackers_almanack.pdf

The main points of their criticism are:

Potential Solutions according to the DEFCON32 Hacker’s Almanack4:

Enjoy Reading This Article?

Potential Solutions according to the DEFCON32 Hacker’s Almanack⁴: