In this LVE we perform adversarial suffix attack on Llama Guard model, forcing it to output "safe" for unsafe prompts. We consider prompts from several of the existing LVEs: - jailbreak_influenza_1918_synthesis - insult_in_style - phishing
The LVE shows that filters based on language models do not work reliably and can be bypassed using adversarial attacks. To find the adversarial suffixes we use the attack from: https://github.com/andyzoujm/breaking-llama-guard/tree/main.