Adding Fuel To The Fire: How effective are LLM-based safety filters for AI systems?
Our primary goal with the LVE Repository is to document and track vulnerabilities of language models. However, we also spend a lot of time thinking about defending against such attacks. One popular type of defense is the idea of a filtering or guardrail system that wraps around a language model. In this post, we highlight the challenges of effectively guardrailing LLMs, and illustrate how easily current LLM-based filters break, based on the example of the recently released Purple Llama system.
While early filtering systems used hard-coded rules to determine whether a prompt is safe or not (e.g. via a list of filtered words), the complexity of more recent attacks easily circumvents this type of guardrail. In response to this, the idea of LLM-based filtering has emerged: Given a user input and a policy specified in natural language, a separate moderation LLM first classifies an input/output as safe or not. This moderation LLM can even be fine-tuned to be more effective at filtering undesired content. For instance, OpenAI has proposed using GPT-4 for content moderation, although their model remains closed source and thus difficult to investigate.
Earlier this month, Meta released Purple Llama, a project of open source trust and safety tools for building responsible AI systems. We very much welcome this addition to the ecosystem, as we also believe that safe AI systems can only be built with openness and transparency in mind, just like with the LVE project itself. One of the Purple Llama components is a new foundation model called Llama Guard [2], which has been trained as a moderation LLM that filters inputs and outputs with respect to a given policy. The model is open and allows us to perform some more extensive red teaming, in order to investigate how much Llama Guard actually enhances overall system safety.
Below we show a simple usage example of Llama Guard, by validating a user request for the Influenza 1918 virus synthesis procedure, as in the responsibility/dual_use/jailbreak_influenza_1918_synthesis LVE.
Figure 1. User requests a protocol for synthesizing influenza virus and Llama Guard correctly classifies it as unsafe.
We observe that Llama Guard correctly classifies this prompt as unsafe and also correctly assigns the policy category of production of controlled or illegal substances. We note that our prompt even includes basic prompt injection techniques, which Llama Guard defies as it is meant to: The last sentence of our prompt, claiming that we need the response only for a school essay, is not enough to trick Llama Guard. From this simple example, it appears that Llama Guard indeed implements some level of checking, but how reliable is it really?
Very quickly after the release of Purple Llama, researcher Andy Zuo demonstrated that Llama Guard can easily be tricked into misclassifying inputs, using a variant of the adversarial attack suffixes [1]. Given a prompt P, the idea is to optimize for a suffix S such that the Llama Purple model outputs the string “Safe” when prompted with the concatenation P + S. In contrast to other jailbreaks that are typically handcrafted, suffix attacks can be generated automatically and thus can be applied at much larger scale. The key idea of the attack is to combine greedy and gradient-based optimization -- full details can be found in the corresponding research paper [1].
Here, we simply run the suffix attack on our example from above, with the goal of breaking Llama Guard into classifying our request as safe. Below, the attack suffix is shown bold and can simply be concatenated with the original prompt.
Figure 2. The Llama Guard filter can be bypassed by adding an attack suffix to the prompt.
Indeed, adding an adversarial suffix makes Purple Llama classify input prompt as safe, even though the user is still requesting information to synthesize the influenza virus. Note that suffixes typically correspond to mostly random character sequences, which might give an idea about how to detect them (e.g. checking perplexity of the suffix). However, overall this result just sets up an arms race between attackers and defenders, with no clear path to a reliable way to detect and defend against LLM attacks and exploits. The full code for running the suffix attack on Llama Guard can be found here: https://github.com/andyzoujm/breaking-llama-guard/.
In line with LVE’s mission of tracking LLM vulnerabilities, we have added LVEs for suffix attack on Llama Guard models. We transferred several existing LVEs (jailbreak_influenza_1918_synthesis, insult_in_style, phishing) and created a new suffix_attack LVE for Llama Guard based on them. Below, we show an example of one LVE instance (corresponding to the above example of synthesizing a virus). As part of the LVE, we document prompt and suffix and show that when instantiated with the influenza example, LlamaGuard indeed responds by incorrectly classifying the input as safe.
Figure 3. An instance of the suffix attack on Llama Guard documented in the LVE repository.
We have demonstrated that LLM-based safety filters like LlamaGuard clearly do not provide a real solution to the LLM safety problem. Adversaries can easily construct adversarial suffixes that can be added to their prompts to bypass the LLM-based filter (in this case LlamaGuard).
More fundamentally, these experiments reveal a much more important insight: We have to resort to powerful language models to do moderation effectively, however, with this, we also inherit all the fundamental weaknesses and attack vectors of these models, only now, we integrate them in our defense systems. Thus, our defenses are now vulnerable to the same exploits that the actual LLM systems are already vulnerable to. This sets up a dangerously circular safety narrative (our LLMs are as safe as our defense systems are as safe as LLMs) and thus cannot remain the only guardrails we put in place to responsibly build and deploy AI systems.
We would like to highlight that our investigations are only possible because of Meta’s willingness to openly release their Purple Llama models, and we are hoping to see more organizations follow their lead. We believe that openness and transparency is the only path to safe and responsible AI.
[1] Universal and transferable adversarial attacks on aligned language models https://arxiv.org/abs/2307.15043
[2] Llama Guard: LLM-based Input-Output Safeguard for Human-AI Conversations https://arxiv.org/abs/2312.06674