Unlocking the Pandora's Box: Universal Adversarial Attacks on Language Models and the Need for Robust AI Alignment
After several weeks of busy schedule, this week during my vacation in the wilderness of New Hampshire, I finally got time to read a couple of AI safety related papers.
This paper titled "Universal and Transferable Adversarial Attacks on Aligned Language Models" had been on my radar for several weeks. It presents a method for developing adversarial attacks that effectively manipulate large language models (LLMs) to produce harmful or inappropriate content. The method uses a combination of greedy and gradient-based search techniques to automatically generate adversarial suffixes. These suffixes, when added to a range of user queries, can increase the likelihood of an LLM producing undesirable responses.
The research found that the adversarial prompts created by this method can be applied widely, including on black-box, publicly released LLMs. The success of this attack transfer is particularly high against GPT-based models, which could be due to the fact that Vicuna, was trained on outputs from ChatGPT.
The paper presents several key findings regarding adversarial attacks on aligned LLMs:
The paper raises important questions about how to protect such systems from producing objectionable content and highlights the need for more robust alignment and safety mechanisms in LLMs. It also underlines the importance of further research into the development of more reliable attacks against aligned language models. The fact that Claude 2 performed the best when pushed under attack with this technique, raises the question whether "Constitutional AI" based model are better aligned vs GPT+RLHF models.
领英推荐
The attack vectors defined in this paper are similar to another recently published paper called "Jailbroken: How Does LLM Safety Training Fail?", which I covered in a recent article of mine titled "Why LLMs security systems are fragile?". I find the Competing Objectives Problem to be the root cause of why the universal/adversarial attacks are more successful on the GPT model as compared to "Constitutional AI" based Claude 2 model. The crux of the problem is that a safety-trained LLM has to balance between different goals that can conflict with each other. For example, the LLM may have been pretrained on a large and diverse corpus of text, which gives it a lot of general knowledge and language skills, but also exposes it to harmful or inappropriate content. The LLM may also have been trained to follow instructions from users, which gives it a lot of flexibility and usefulness, but also makes it vulnerable to manipulation or exploitation. The LLM may also have been trained to be safe and harmless, which gives it a lot of ethical and social values, but also limits its capabilities and responses.
These different objectives can compete with each other when the LLM receives a prompt that asks for a restricted behavior, such as producing harmful content or leaking personal information. The LLM may have to choose between either refusing the prompt, which may satisfy its safety objective, but violate its pretraining and instruction-following objectives, or responding to the prompt, which may satisfy its pretraining and instruction-following objectives but violate its safety objective. Depending on how the LLM is trained and optimized, it may favor one objective over another, and thus be susceptible to jailbreak attacks that exploit this trade-off.
In contrast, "Constitutional AI" is a method for training AI systems using a set of rules or principles that act as a constitution for the AI system.?This approach allows the AI system to operate within a societally accepted framework and aligns it with human intentions. The findings of this research will be particularly useful for AI developers, and researchers in the field of AI security and ethics.
Authors: Andy Zou , Zifan Wang , Zico Kolter and Matt Fredrikson
Head of ML Operations Engineering for #GenAI
1 年Very interesting work. It would be expected that input to the LLM is sanitized. On the other hand maybe the LLM dumping it’s training set with such an attack, should be avoidable. Which makes this work important in many scenarios. Thanks for sharing, will make time to deep dive.
Head of AI at Wise
1 年Why is this a problem? If you try to make LLMs produce "harmful " (however defined) outputs, you can do so, so what? If I try to hit myself on the head with a hammer, I may succeed in breaking my skull - does this mean all hammers should be made of rubber to prevent that?
Copilot Studio & Power Platform @ Microsoft
1 年Spot on… if English is the new coding language then it is also new hacking language lol In my experience the hard work is in grounding, not necessarily in the ??commodity?? LLMs. Like I would not ask transistors to make coffe but transitors are part of my coffe machine.
Change Agent und Creator | Bildung (MINT, BNE) | Transformation | Future Skills | Impressum und Datenschutzerkl?rung: Links in der Kontaktinfo | ?Jede Reise über 1000 Meilen beginnt mit dem ersten Schritt.“ (Laozi)
1 年Thanks for sharing, this is very interesting!
Principal Solution Architect | Microsoft Copilot Studio | Power CAT
1 年Great read! Thanks for sharing your thoughts on this. Pretty wild to see how you can “hack” models to ask this kind of questions!