Resources for finding LVEs

Some resources to help you get started with LVEs.

The list is by no means exhaustive, so please feel free to contribute any useful resources you find to this list.

Research papers

Title URL
Red Teaming Language Models with Language Models
Universal and Transferable Adversarial Attacks on Aligned Language Models
Siren's Song in the AI Ocean: A Survey on Hallucination in Large Language Models
Getting from Generative AI to Trustworthy AI: What LLMs might learn from Cyc
Evaluating Superhuman Models with Consistency Checks
Reasoning or Reciting? Exploring the Capabilities and Limitations of Language Models Through Counterfactual Tasks
Not what you’ve signed up for: Compromising Real-World LLM-Integrated Applications with Indirect Prompt Injection
The Reversal Curse: LLMs trained on “A is B” fail to learn “B is A”
In ChatGPT We Trust? Measuring and Characterizing the Reliability of ChatGPT


Compromising LLMs: The Advent of AI Malware

Blog posts

Evaluating Language Model Bias with 🤗 Evaluate
What’s Wrong with Large Language Models and What We Should be Building Instead

Existing Datasets

Antrophic HH-RLHF

Social media

LLM Security


Currently LVEs for multi-modal models are not supported.

Abusing Images and Sounds for Indirect Instruction Injection in Multi-Modal LLMs
Optical Illusions