Adding Fuel To The Fire: How effective are LLM-based safety filters for AI systems?

We investigate the effectiveness of LLM-based safety filters to defend against LLM vulnerabilities and exploits.

December 19, 2023

Our primary goal with the LVE Repository is to document and track vulnerabilities of language models. However, we also spend a lot of time thinking about defending against such attacks. One popular type of defense is the idea of a filtering or guardrail system that wraps around a language model. In this post, we highlight the challenges of effectively guardrailing LLMs, and illustrate how easily current LLM-based filters break, based on the example of the recently released Purple Llama system.

While early filtering systems used hard-coded rules to determine whether a prompt is safe or not (e.g. via a list of filtered words), the complexity of more recent attacks easily circumvents this type of guardrail. In response to this, the idea of LLM-based filtering has emerged: Given a user input and a policy specified in natural language, a separate moderation LLM first classifies an input/output as safe or not. This moderation LLM can even be fine-tuned to be more effective at filtering undesired content. For instance, OpenAI has proposed using GPT-4 for content moderation, although their model remains closed source and thus difficult to investigate.

LVE Community Challenges, Batch 1

December 07, 2023

Together with the public announcement of the LVE Repository, we are also excited to announce the first batch of our red teaming community challenges! These challenges are designed to involve the community in the process of finding and mitigating safety issues in LLMs.

Launching LVE: The First Open Repository of LLM Vulnerabilities and Exposures

December 07, 2023

Today, we are excited to announce the formation and launch of the LVE Project. LVE stands for Language Model Vulnerability and Exposure and is a community-focused open source project, to publicly document and track exploits and attacks on large language models (LLMs) like (Chat)GPT, Llama and Mistral models. Throughout the past year, LLMs like ChatGPT have seen an explosion in popularity, both among the broader public as well as developers who have started to build novel AI-powered applications and machine learning systems on top of them. While most focus on the capabilities of LLMs, there has also been a growing concern about the safety and security implications of these models and how they are being used. However, due to the rapid pace of developments, the discourse around LLM safety remains challenging and fragmented.