登录查看更多内容

Breaking the Jargons #Issue 10

Parul Pandey

Community ?? & Open Source| Co-author of Machine learning for High-Risk Applications | Kaggle Grandmaster(Notebooks)

发布日期: 2024年4月8日

Have you tried the Gandalf Challenge? It’s an interactive game where the players attempt to uncover a secret password protected by Gandalf the wizard, who symbolizes real-world LLM applications. As you progress, Gandalf becomes more adept at concealing the password. I highly recommend trying it if you haven’t already done so. This fun game demonstrates that just as we can deceive Gandalf into revealing passwords, LLMs can also be influenced to divulge sensitive information.

Many in the field of ML focus on the benefits like automation and increased revenue, but there's a subset of practitioners who view computer systems adversarially. These individuals work to protect organizations from those who seek to exploit ML systems for malicious purposes. It's essential to adopt an adversarial mindset to understand ML security and consider intentional abuse and misuse of ML systems, including the ones we're currently working on. This practice, known as red-teaming, involves teams of skilled experts attempting to breach ML systems and sharing their findings with product owners, thereby safeguarding your organization's valuable data. After all, the best way to stop an attacker is to think like one!

This week's newsletter presents a series of research papers focusing on red teaming for generative models, including a comprehensive survey on red teaming strategies for LLMs, a study on curiosity-driven red teaming techniques, Anthropic’s Many-shot Jailbreaking, innovative Crescendo Multi-Turn LLM Jailbreak Attack and discusses gaps and opportunities in AI audit tooling. Alongside these, there is the Generative AI Red Teaming report, Snap's AI safety measures, and discussions about the energy demands of AI. Furthermore, a webinar led by Sarah Hooker emphasizes the importance of model efficiency, while a newly introduced course focuses on Red Teaming techniques for LLM applications. Lastly, the Code Corner spotlights TrustAIRLab, a research lab dedicated to trustworthy machine learning.

??Research Rundown

Research Papers I enjoyed last week:1

1. Against The Achilles Heel: A Survey on Red Teaming for Generative Models

This paper provides an in-depth examination of the challenges and strategies in red teaming within the context of generative models. The authors surveyed 129 papers to address gaps in understanding prompt attacks on Language and Vision Language Models (LLMs and VLMs).

The paper provides a structured review of these attacks, including various aspects such as risk taxonomy, attack strategies, evaluation metrics, and defensive approaches. The authors propose a comprehensive taxonomy of LLM attack strategies grounded in the inherent capabilities of models developed during pretraining and fine-tuning. This classification focuses on abilities like instruction following and generation, providing a more fundamental framework that can be extended to different modalities.

Furthermore, automated red-teaming methods have been approached as searching problems, breaking down popular search methods into three components: state space, search goal, and search operation. This approach broadens the design space for future automated red-teaming methodologies, ensuring advancements in model security.

2. Curiosity-driven Red-teaming for Large Language Models

Traditionally, detecting undesirable responses from LLMs has involved human testers forming a "red team" to create input prompts that provoke these responses. However, this manual approach is resource-intensive and time-consuming.

An alternative approach is to automate the generation of test cases using a separate LLM, termed the red-team LLM, trained via reinforcement learning (RL). This red-team LLM, distinct from the target LLM, learns to generate prompts that maximize rewards based on a defined undesirability metric. Essentially, it acts as a policy trained via RL to produce prompts that elicit responses from the target LLMs while maximizing rewards. However, existing RL-based methods often generate test cases lacking diversity, leading to inadequate coverage of prompts eliciting unwanted LLM responses. This deficiency arises because RL methods typically focus solely on maximizing rewards, neglecting the need to cover all potential test cases.

To address this issue, the paper introduces curiosity-driven red teaming (CRT), leveraging principles of curiosity-driven exploration to broaden test case coverage. By optimizing for novelty alongside task rewards, CRT enhances the diversity and effectiveness of generated test cases.

Evaluation results demonstrate that CRT significantly improves coverage compared to current RL-based red-teaming methods, effectively eliciting toxic responses from LLMs even after fine-tuning with reinforcement learning from human feedback (RLHF).

3. Anthropic’s Many-shot Jailbreaking

Anthropic’s new paper explores a jailbreaking technique, which they term Many-shot jailbreaking. This technique exploits the long-context windows of large language models (LLMs). While these large context windows enhance model performance, they also make the model vulnerable. So, how does this technique work? The attacker inputs a prompt with numerous faux dialogues where the model appears to comply with harmful requests, bypassing the LLM's safety training.

The chances of harmful responses increase with more dialogues following a scaling law. The study highlights the trade-off of increasing LLM context windows, making them more useful but also more susceptible to adversarial attacks.

4. Great, Now Write an Article About That: The Crescendo Multi-Turn LLM Jailbreak Attack

Developing ethical LLMs involves defining and enforcing boundaries between acceptable and unacceptable topics of conversation. Despite being trained to avoid certain content, LLMs may still possess knowledge of it. However, they are expected to reject or divert discussions on prohibited topics. This gap between capability and behavior creates opportunities for jailbreak attacks, aiming to bypass ethical constraints.

The paper introduces Crescendo, a new jailbreak attack that guides conversation from innocuous dialogue to forbidden topics. It leverages the LLM's inclination to follow patterns and focus on recent text, including its own output. The figure illustrates Crescendo's execution on ChatGPT (GPT-4) and Gemini Ultra models.

Anthropic’s Many-shot Jailbreak technique (discussed above) showcases a similar effect to Crescendo within the user's context rather than the model's. However, Crescendo differs significantly in two main aspects.

Firstly, it doesn't assume the user has malicious knowledge that is required for insertion.
Secondly, Crescendo is more practical, with models featuring smaller contexts, making it cost-effective. While a basic input filter can defend against Many-shot Jailbreak, it's less effective against Crescendo.

Crescendo's design makes it resistant to traditional detection techniques, as it utilizes standard input commands rather than explicit mentions of malicious tasks.

5. Towards AI Accountability Infrastructure: Gaps and Opportunities in AI Audit Tooling

The paper assesses the tools available for AI audits and identifies gaps in supporting AI accountability. Despite many tools designed for setting standards and evaluating AI systems, these often fall short in practice for achieving accountability goals. The research involves interviews with 35 AI audit practitioners and an analysis of 390 tools to map out the ecosystem of AI audit tools. Through their interviews and landscape analysis, the researchers categorize the tools into seven critical stages of the AI audit process, highlighting areas where tools are abundant and lacking. A significant finding is the concentration of tools focused on evaluation and standards management, with a relative scarcity of tools designed for harm discovery, advocacy, and audit communication.

The study recommends developing a more comprehensive infrastructure for AI accountability that includes tools for harm discovery, advocacy, and effective communication of audit results.

领英推荐

AI Treason: The Enemy Within

CyberArk 1 个月前

Grounding the AI hype in reality

Okta 10 个月前

Securing the Future of AI: A Deep Dive into OWASP’s…

Priyal Walpita 1 年前

The research revealed various AI audit tools developed by academic, commercial, non-profit, and governmental entities, each serving distinct purposes. Furthermore, the authors have provided an interactive dataset at tools.auditing-ai.com, offering a more comprehensive view of the AI auditing tool landscape.

?? Curated Blogs and Articles

1. Generative AI Red Teaming Transparency Report

This report explores the impact of public red teaming, focusing on the Generative AI Red Teaming Challenge held at DEF CON 31. In this study, eight leading LLMs were tested to mimic dangerous real-world situations.

2. Snaps Safety Efforts With AI Red Teaming From HackerOne

Before launching its first text-to-image AI feature, Snap conducted a red teaming exercise to ensure the technology wouldn't produce content that could negatively impact its community.

3. Should AI Be Scaled Down?

As LLMs grow increasingly sophisticated, producing texts that closely mimic human writing, their energy demands soar, with GPT-3's training consuming as much energy as 120 American households in a year. Critics argue that the tech industry's race to create ever-larger models is both costly and unsustainable, likening it to an unrealistic endeavor.

The article is based on the webinar series called - Harvard Efficient ML Seminar Series. The link to the webinar is provided in the next section.

?? Webinar Watch

What does scale give us: Why we are building a ladder to the moon.

Sara Hooker from Cohere advocates for a shift in focus towards improving model efficiency and data quality rather than simply increasing model size and proposes the idea of a centralized system to rate models for energy efficiency, aiming to make the AI field more sustainable and conscious of its environmental impact.

?? Course

Red Teaming LLM Applications by Deeplearning.AI, in association with Giskard, teaches you to test and strengthen your LLM-based applications against potential threats by carrying out red teaming attacks against your own LLM-based applications.

?? Code Corner: Featured Repositories and Resources

TrustAIRLab (Trustworthy AI Research Lab) is a research lab dedicated to trustworthy machine learning, with a focus on safety, privacy, and security.

??Tweet Spotlight

Behind every groundbreaking AI making headlines, there's a team of real people making it all happen.

I hope this edition of “Breaking the Jargons” provides valuable insights and stimulating reads. Until next time, keep reading.

1 Image Credits: All images featured in this newsletter have been sourced directly from the research papers mentioned in this edition.