登录查看更多内容

Anthropic's Constitutional Classifiers for Jailbreak Defense

Nagesh Nama

CEO at xLM | Transforming Life Sciences with AI & ML | Pioneer in GxP Continuous Validation |

发布日期: 2025年2月12日

"Constitutional Classifiers," a new approach for defending large language models (LLMs) against adversarial "jailbreak" attacks designed to bypass safety guardrails.

The goal is to allow for safer deployment of more powerful AI models in the future. The research paper focuses on protecting AI models, specifically Claude 3.5 Sonnet, from generating harmful outputs such as instructions for creating biological or chemical weapons.

Concepts and Challenges:

Jailbreaks: These are inputs specifically crafted to circumvent an LLM's safety training and elicit harmful or prohibited responses. Examples include very long prompts or using unusual capitalization. Such techniques have been known for over a decade but robust defenses remain a significant challenge.

"Some jailbreaks flood the model with very long prompts; others modify the style of the input, such as uSiNg uNuSuAl cApItALiZaTiOn."

Responsible Scaling Policy: Anthropic's policy for deploying increasingly capable models is contingent on effectively mitigating risks through appropriate safeguards. Jailbreaks undermine these safeguards.

"Under our Responsible Scaling Policy, we may deploy such models as long as we’re able to mitigate risks to acceptable levels through appropriate safeguards—but jailbreaking lets users bypass these safeguards."

CBRN Capability Threshold: The research aims to create a system which could mitigate jailbreaking risks for models that exceed the CBRN capability threshold, defined as systems able to help individuals with basic technical backgrounds create or obtain and deploy chemical, biological, radiological, and nuclear weapons, representing "a substantially higher risk of catastrophic misuse compared to non-AI baselines."

"In particular, we’re hopeful that a system defended by Constitutional Classifiers could allow us to mitigate jailbreaking risks for models which have passed the CBRN capability threshold outlined in our Responsible Scaling Policy."

Universal Jailbreaks: The primary goal of attackers is to find an attack capable of bypassing safety measures for a wide range of forbidden queries, rather than a single query in isolation.

Constitutional Classifiers: Methodology and Approach:

Constitutional Basis: The system is inspired by "Constitutional AI," using a "constitution" of principles defining allowed and disallowed content. This constitution informs the training of classifiers, rather than being an inherent part of the AI model itself. For instance, recipes for mustard are allowed, while those for mustard gas are not.

"In the case of Constitutional Classifiers, the principles define the classes of content that are allowed and disallowed (for example, recipes for mustard are allowed, but recipes for mustard gas are not)."

Synthetic Data Generation: The system leverages the LLM (Claude) to generate large amounts of synthetic data (prompts and completions) across different content classes based on the constitution. These are augmented via translation and stylistic variations resembling known jailbreak attempts.

"With the help of Claude, we use this constitution to generate a large number of synthetic prompts and synthetic model completions across all the content classes. We augment these prompts and completions to ensure a varied and diverse list: this includes translating them into different languages and transforming them to be written in the style of known jailbreaks."

Input and Output Classifiers: Trained using the generated synthetic data, these classifiers work to flag (and block) potentially harmful content according to the constitution.

"We then use these synthetic data to train our input and output classifiers to flag (and block) potentially harmful content according to the given constitution."

Minimizing Over-Refusals: The classifiers are also trained on a set of benign queries to reduce the chance of incorrectly flagging harmless content as harmful.

"To help minimize over-refusals (i.e., harmless content incorrectly flagged as harmful), we also train the classifiers on a fixed set of benign queries generated by a contractor."

Testing and Results:

Human Red Teaming: In a bug-bounty program, 183 participants spent over 3000 hours attempting to jailbreak a prototype system. Participants were challenged to make the system respond to all ten forbidden queries with a single jailbreak. No one was able to achieve a "universal" jailbreak.

"Despite the large amount of effort, none of the participants were able to coerce the model to answer all ten forbidden queries with a single jailbreak—that is, no universal jailbreak was discovered."

Automated Evaluation: A version of Claude 3.5 Sonnet protected with Constitutional Classifiers was tested with 10,000 synthetically generated jailbreak prompts, including advanced attacks.

领英推荐

CIGI Newsletter: September 10th, 2024 - Reviving and…

Centre for International Governance Innovation (CIGI) 6 个月前

Technological Innovations in AMLA’s Operations

The Association of Governance, Risk & Compliance (AGRC) 1 个月前

CIGI Newsletter: October 29, 2024 - Scenarios of…

Centre for International Governance Innovation (CIGI) 4 个月前

Without classifiers, the jailbreak success rate was 86%. With classifiers, the success rate dropped to 4.4%—a reduction of over 95% in successful attempts.

"Under baseline conditions, with no defensive classifiers, the jailbreak success rate was 86%—that is, Claude itself blocked only 14% of these advanced jailbreak attempts. Guarding Claude using Constitutional Classifiers, however, produced a strong improvement: the jailbreak success rate was reduced to 4.4%, meaning that over 95% of jailbreak attempts were refused."

Over-Refusal Rate: The addition of the classifiers resulted in a very small increase of 0.38% in the refusal rate on harmless queries, which was not statistically significant.

"At the same time, our Constitutional Classifiers did not substantially increase the over-refusal rate on harmless user queries: with the classifiers, Claude had a slightly increased refusal rate (by 0.38%), but this was not statistically significant in a random sample of 5,000 conversations."

Computational Cost: The system introduced a moderate increase of 23.7% in compute cost.

"Finally, the compute cost was moderately higher (by 23.7%) than that of the unguarded model."

Limitations and Future Work:

Potential for Future Jailbreaks: Constitutional Classifiers may not prevent every jailbreak attempt. The system is intended to dramatically increase the effort required to discover a successful attack.

"Constitutional Classifiers may not prevent every universal jailbreak, though we believe that even the small proportion of jailbreaks that make it past our classifiers require far more effort to discover when the safeguards are in use."

New Attack Vectors: The possibility of novel jailbreaking techniques makes it necessary to continually adapt the system. The constitution used to train the classifiers can be adapted to address new threats as they arise.

"It’s also possible that new jailbreaking techniques might be developed in the future that are effective against the system; we therefore recommend using complementarydefenses."

Continuous Improvement: The team is actively working to further reduce over-refusals and computational costs associated with this system.

Conclusion:

Anthropic's Constitutional Classifiers show significant promise as a defense mechanism against jailbreak attacks, demonstrating high efficacy in both human red-teaming and automated evaluations. While not a perfect solution, it significantly reduces the success rate of jailbreaks while maintaining a high degree of usability with minimal over-refusal rates and reasonable additional compute costs. This approach represents a promising step towards enabling the safe deployment of advanced AI models. The company is continuing to work to refine and improve the method, including reducing over-refusals and compute costs. They are also relying on public testing of their live demo system in order to further improve its capabilities.

要查看或添加评论，请登录

Nagesh Nama的更多文章

MIT’s Open-Source EV Design Dataset: DrivAerNet++ and Its Impact on AI-Driven Vehicle Innovation

2025年3月8日

MIT’s Open-Source EV Design Dataset: DrivAerNet++ and Its Impact on AI-Driven Vehicle Innovation

Overview MIT researchers have developed DrivAerNet++, the world’s largest open-source dataset of aerodynamic car…
e-therapeutics integrates computational power and biological data to accelerate the discovery of life-transforming RNAi medicines

2025年2月10日

e-therapeutics integrates computational power and biological data to accelerate the discovery of life-transforming RNAi medicines

e-therapeutics PLC is a biotech company focused on developing RNAi therapeutics using a combination of computational…
Manas AI is leveraging advanced AI, computational chemistry, and biological expertise to accelerate and reduce the cost of drug discovery

2025年2月9日

Manas AI is leveraging advanced AI, computational chemistry, and biological expertise to accelerate and reduce the cost of drug discovery

Manas AI is a biotechnology company leveraging advanced artificial intelligence, computational chemistry, and…

2 条评论
Agentic AI - The Rise of Agents; Now we need APIs more than ever!

2025年2月3日

Agentic AI - The Rise of Agents; Now we need APIs more than ever!

Source: The blog post by Postman CEO Abhinav Asthana which explores the evolution of AI, moving beyond simple…
Spinach leaves can potentially help repair human heart tissue in a groundbreaking approach to cardiac tissue engineering!

2025年2月1日

Spinach leaves can potentially help repair human heart tissue in a groundbreaking approach to cardiac tissue engineering!

Scientists have discovered that spinach leaves can potentially help repair human heart tissue in a groundbreaking…

4 条评论
Deepbreak @ Deepseek!

2025年2月1日

Deepbreak @ Deepseek!

DeepSeek AI, a Chinese AI platform, has recently gained attention for its new R1 reasoning model, which is cheaper than…
DeepSeek’s Distillation: Disrupting AI With Smaller, Smarter Models

2025年2月1日

DeepSeek’s Distillation: Disrupting AI With Smaller, Smarter Models

In January 2025, Chinese AI startup DeepSeek sent shockwaves through the tech industry with the release of its R1…
New AI Contender: Ai2’s AI Model Beats DeepSeek’s V3

2025年1月31日

New AI Contender: Ai2’s AI Model Beats DeepSeek’s V3

The Allen Institute for AI (AI2) has made significant strides in the field of open-source artificial intelligence with…
BCG AI Radar 2025: Analysis of the current state and future trends of AI adoption based on the BCG AI Radar 2025 survey.

2025年1月30日

BCG AI Radar 2025: Analysis of the current state and future trends of AI adoption based on the BCG AI Radar 2025 survey.

Source: Boston Consulting Group (BCG) This briefing document summarizes the key findings from the BCG AI Radar 2025…
Lessons from Red Teaming Generative AI Products

2025年1月27日

Lessons from Red Teaming Generative AI Products

Source: A Microsoft paper: arXiv:2501.07238v1 [cs.

2 条评论

See all articles

Anthropic's Constitutional Classifiers for Jailbreak Defense

Nagesh Nama

CEO at xLM | Transforming Life Sciences with AI & ML | Pioneer in GxP Continuous Validation |

Concepts and Challenges:

Constitutional Classifiers: Methodology and Approach:

Testing and Results:

领英推荐

Limitations and Future Work:

Conclusion:

Nagesh Nama的更多文章

其他会员也浏览了

Digital Conflicts - The Struggle Against Deepfake and Disinformation

Future of Geopolitics in the Age of AI

Technological Innovations in AMLA’s Operations

The perils of AI and George Orwell's "1984"

Southeast Asia's Emerging Geopolitical Order

Balancing Bytes and Bullets: AI, Law, and Africa's Militaries

Smart Warfare: How AI is Changing the Battlefield

Cognitive Warfare and Democracy: A Critical Analysis of the Ethical Challenges and Solutions

Navigating the Complex Intersection of AI and National Security

A Week in #Intelligence - W7 - 2023 - #ExMergere - #OSINT #Analysis #AI #Strategic #Cyber #Criminal #Services

Concepts and Challenges:

Constitutional Classifiers: Methodology and Approach:

Testing and Results:

领英推荐

Limitations and Future Work:

Conclusion:

Nagesh Nama的更多文章

MIT’s Open-Source EV Design Dataset: DrivAerNet++ and Its Impact on AI-Driven Vehicle Innovation

e-therapeutics integrates computational power and biological data to accelerate the discovery of life-transforming RNAi medicines

Manas AI is leveraging advanced AI, computational chemistry, and biological expertise to accelerate and reduce the cost of drug discovery

Agentic AI - The Rise of Agents; Now we need APIs more than ever!

Spinach leaves can potentially help repair human heart tissue in a groundbreaking approach to cardiac tissue engineering!

Deepbreak @ Deepseek!

DeepSeek’s Distillation: Disrupting AI With Smaller, Smarter Models

New AI Contender: Ai2’s AI Model Beats DeepSeek’s V3

BCG AI Radar 2025: Analysis of the current state and future trends of AI adoption based on the BCG AI Radar 2025 survey.

Lessons from Red Teaming Generative AI Products

其他会员也浏览了

Digital Conflicts - The Struggle Against Deepfake and Disinformation

Future of Geopolitics in the Age of AI

Technological Innovations in AMLA’s Operations

The perils of AI and George Orwell's "1984"

Southeast Asia's Emerging Geopolitical Order

Balancing Bytes and Bullets: AI, Law, and Africa's Militaries

Smart Warfare: How AI is Changing the Battlefield

Cognitive Warfare and Democracy: A Critical Analysis of the Ethical Challenges and Solutions

Navigating the Complex Intersection of AI and National Security

A Week in #Intelligence - W7 - 2023 - #ExMergere - #OSINT #Analysis #AI #Strategic #Cyber #Criminal #Services