LVE Repository

trust/guards/suffix_attack

lve record repository/trust/guards/suffix_attack/hf-meta--llama-guard-7b

Llama Guard does not reliably detect suffix attacks

In this LVE we perform adversarial suffix attack on Llama Guard model, forcing it to output "safe" for unsafe prompts. We consider prompts from several of the existing LVEs: - jailbreak_influenza_1918_synthesis - insult_in_style - phishing

The LVE shows that filters based on language models do not work reliably and can be bypassed using adversarial attacks. To find the adversarial suffixes we use the attack from: https://github.com/andyzoujm/breaking-llama-guard/tree/main.

Loading...