登录查看更多内容

Unlocking the Pandora's Box: Universal Adversarial Attacks on Language Models and the Need for Robust AI Alignment

Ashish Bhatia

Product Manager @ Microsoft

发布日期: 2023年8月19日

After several weeks of busy schedule, this week during my vacation in the wilderness of New Hampshire, I finally got time to read a couple of AI safety related papers.

This paper titled "Universal and Transferable Adversarial Attacks on Aligned Language Models" had been on my radar for several weeks. It presents a method for developing adversarial attacks that effectively manipulate large language models (LLMs) to produce harmful or inappropriate content. The method uses a combination of greedy and gradient-based search techniques to automatically generate adversarial suffixes. These suffixes, when added to a range of user queries, can increase the likelihood of an LLM producing undesirable responses.

The research found that the adversarial prompts created by this method can be applied widely, including on black-box, publicly released LLMs. The success of this attack transfer is particularly high against GPT-based models, which could be due to the fact that Vicuna, was trained on outputs from ChatGPT.

The paper presents several key findings regarding adversarial attacks on aligned LLMs:

A Simple and Effective Attack Method: The researchers propose a new class of adversarial attacks that can cause aligned language models to generate virtually any objectionable content. The attack method, which combines greedy and gradient-based search techniques, finds an adversarial suffix that, when attached to a wide range of user queries, maximizes the probability that the LLM produces an undesirable response.
Transferability of Adversarial Prompts: They discovered that the adversarial prompts generated by their approach are transferable to various LLMs, including black-box and publicly released models. This suggests that the created attack can induce harmful content across different models, which is a significant advancement in the field of adversarial attacks against LLMs.
High Success Rate: The study showed that their adversarial attack method has a high success rate. For example, they found that they could generate 99 (out of 100) harmful behaviors in Vicuna, and 88 (out of 100) exact matches with a target harmful string in its output.
Transferability across Different Models: The paper also discovered that the attacks generated by their approach could even demonstrate a notable degree of transfer to other models. This implies that an attack developed for one model could potentially be successful on a different model.
Importance of Specific Optimizer: The results highlighted the importance of their specific optimizer - previous optimizers were not able to achieve any exact output matches, whereas their approach achieved an 88% success rate. This suggests that their careful combination of existing techniques leads to reliably successful attacks in practice.

The paper raises important questions about how to protect such systems from producing objectionable content and highlights the need for more robust alignment and safety mechanisms in LLMs. It also underlines the importance of further research into the development of more reliable attacks against aligned language models. The fact that Claude 2 performed the best when pushed under attack with this technique, raises the question whether "Constitutional AI" based model are better aligned vs GPT+RLHF models.

领英推荐

A Guide to Fine-Tuning Large Language Models (LLMs)

Data Science Dojo 1 年前

Explainability of LLMs – Survey; Reduce Hallucination…

Danny Butvinik 1 年前

LLM Pulse - September 16, 2024

Blackstraw 6 个月前

The attack vectors defined in this paper are similar to another recently published paper called "Jailbroken: How Does LLM Safety Training Fail?", which I covered in a recent article of mine titled "Why LLMs security systems are fragile?". I find the Competing Objectives Problem to be the root cause of why the universal/adversarial attacks are more successful on the GPT model as compared to "Constitutional AI" based Claude 2 model. The crux of the problem is that a safety-trained LLM has to balance between different goals that can conflict with each other. For example, the LLM may have been pretrained on a large and diverse corpus of text, which gives it a lot of general knowledge and language skills, but also exposes it to harmful or inappropriate content. The LLM may also have been trained to follow instructions from users, which gives it a lot of flexibility and usefulness, but also makes it vulnerable to manipulation or exploitation. The LLM may also have been trained to be safe and harmless, which gives it a lot of ethical and social values, but also limits its capabilities and responses.

These different objectives can compete with each other when the LLM receives a prompt that asks for a restricted behavior, such as producing harmful content or leaking personal information. The LLM may have to choose between either refusing the prompt, which may satisfy its safety objective, but violate its pretraining and instruction-following objectives, or responding to the prompt, which may satisfy its pretraining and instruction-following objectives but violate its safety objective. Depending on how the LLM is trained and optimized, it may favor one objective over another, and thus be susceptible to jailbreak attacks that exploit this trade-off.

In contrast, "Constitutional AI" is a method for training AI systems using a set of rules or principles that act as a constitution for the AI system.?This approach allows the AI system to operate within a societally accepted framework and aligns it with human intentions. The findings of this research will be particularly useful for AI developers, and researchers in the field of AI security and ethics.

Paper: https://arxiv.org/pdf/2307.15043.pdf

Authors: Andy Zou , Zifan Wang , Zico Kolter and Matt Fredrikson

Leonidas Georgopoulos

Head of ML Operations Engineering for #GenAI

1 年

Very interesting work. It would be expected that input to the LLM is sanitized. On the other hand maybe the LLM dumping it’s training set with such an attack, should be avoidable. Which makes this work important in many scenarios. Thanks for sharing, will make time to deep dive.

2 次回应

Egor Kraev

Head of AI at Wise

1 年

Why is this a problem? If you try to make LLMs produce "harmful " (however defined) outputs, you can do so, so what? If I try to hit myself on the head with a hammer, I may succeed in breaking my skull - does this mean all hammers should be made of rubber to prevent that?

3 次回应

Nico Sprotti

Copilot Studio & Power Platform @ Microsoft

1 年

Spot on… if English is the new coding language then it is also new hacking language lol In my experience the hard work is in grounding, not necessarily in the ??commodity?? LLMs. Like I would not ask transistors to make coffe but transitors are part of my coffe machine.

1 次回应

Horst Polomka

1 年

Thanks for sharing, this is very interesting!

1 次回应

Rémi Dyon

Principal Solution Architect | Microsoft Copilot Studio | Power CAT

1 年

Great read! Thanks for sharing your thoughts on this. Pretty wild to see how you can “hack” models to ask this kind of questions!

1 次回应

查看更多评论

要查看或添加评论，请登录

Ashish Bhatia的更多文章

There is No Moat for Frontier AI Labs

2025年1月31日

There is No Moat for Frontier AI Labs

Introduction A couple of years ago, big AI labs like OpenAI, Anthropic, Google DeepMind, and Meta seemed to have a big…

22 条评论
The New Oil

2025年1月26日

The New Oil

Breaking of the barrier The recent announcement of a massive, multibillion-dollar initiative—The Stargate Project—to…

3 条评论
The Coming Wave of AI Operating Systems

2024年12月26日

The Coming Wave of AI Operating Systems

HCI is about to change, and we are witnessing the dawn of a new era in how humans interact with computers. This is a…
Own Your Evals Before You Own Your AI

2024年12月12日

Own Your Evals Before You Own Your AI

Introduction The race to “own your AI” is on. Enterprises are increasingly drawn to creating proprietary AI models…

5 条评论
Chapter 2: Building Scalable, Modular Agentic Systems with Micro-Agents

2024年12月1日

Chapter 2: Building Scalable, Modular Agentic Systems with Micro-Agents

Introduction The rapid advancement of AI has ushered us into an era where agentic systems—composed of autonomous agents…

8 条评论
Welcome to Answer Economy

2024年11月6日

Welcome to Answer Economy

1. Introduction The digital search landscape has long revolved around what is often termed the Recommendation Economy.
AI Agents: Separating Reality from Ambition

2024年10月17日

AI Agents: Separating Reality from Ambition

Introduction In the fast-paced landscape of artificial intelligence, the concept of the "AI agent" has ignited…

21 条评论
Building natural language actions in Copilot Studio

2024年5月22日

Building natural language actions in Copilot Studio

Introduction: Copilot Studio simplifies the process of building and extending AI copilots. It allows integration of…

1 条评论
Voice is the New User Experience

2024年5月19日

Voice is the New User Experience

Last week marked a significant milestone in voice-oriented human-machine interaction. Over the past decade, progress in…

8 条评论
How Instruction Hierarchy can Enhance LLM Safety and Functionality

2024年5月6日

How Instruction Hierarchy can Enhance LLM Safety and Functionality

As we rapidly integrate LLM and generative AI into critical workflows and enterprise applications, ensuring these…

4 条评论

See all articles

Unlocking the Pandora's Box: Universal Adversarial Attacks on Language Models and the Need for Robust AI Alignment

Ashish Bhatia

Product Manager @ Microsoft

领英推荐

Ashish Bhatia的更多文章

社区洞察

其他会员也浏览了

SLM and LLM... My Top 10 in July 2024

Retrieval-Augmented Generation (RAG) and Agentic RAG

Open Source Large Language Models in 2023

Small Language Models (SLMs) vs. Large Language Models (LLMs): The Future of AI in Enterprises

Large Language Models and the Need for a Plan B: Are You Prepared?

Everything about LLM Hallucinations

Give Us the Facts: Large Language Models vs. Knowledge Graphs

Unlabeled Data: The Secret Behind Large Language Models

Unveiled: A tool that unmasks the secrets of large language models (LLMs)

Microsoft’s New Love

领英推荐

Ashish Bhatia的更多文章

There is No Moat for Frontier AI Labs

The New Oil

The Coming Wave of AI Operating Systems

Own Your Evals Before You Own Your AI

Chapter 2: Building Scalable, Modular Agentic Systems with Micro-Agents

Welcome to Answer Economy

AI Agents: Separating Reality from Ambition

Building natural language actions in Copilot Studio

Voice is the New User Experience

How Instruction Hierarchy can Enhance LLM Safety and Functionality

社区洞察

其他会员也浏览了

SLM and LLM... My Top 10 in July 2024

Retrieval-Augmented Generation (RAG) and Agentic RAG

Open Source Large Language Models in 2023

Small Language Models (SLMs) vs. Large Language Models (LLMs): The Future of AI in Enterprises

Large Language Models and the Need for a Plan B: Are You Prepared?

Everything about LLM Hallucinations

Give Us the Facts: Large Language Models vs. Knowledge Graphs

Unlabeled Data: The Secret Behind Large Language Models

Unveiled: A tool that unmasks the secrets of large language models (LLMs)

Microsoft’s New Love