Anthropic's Constitutional Classifiers for Jailbreak Defense

Anthropic's Constitutional Classifiers for Jailbreak Defense

"Constitutional Classifiers," a new approach for defending large language models (LLMs) against adversarial "jailbreak" attacks designed to bypass safety guardrails.

The goal is to allow for safer deployment of more powerful AI models in the future. The research paper focuses on protecting AI models, specifically Claude 3.5 Sonnet, from generating harmful outputs such as instructions for creating biological or chemical weapons.

Concepts and Challenges:

  • Jailbreaks: These are inputs specifically crafted to circumvent an LLM's safety training and elicit harmful or prohibited responses. Examples include very long prompts or using unusual capitalization. Such techniques have been known for over a decade but robust defenses remain a significant challenge.

"Some jailbreaks flood the model with very long prompts; others modify the style of the input, such as uSiNg uNuSuAl cApItALiZaTiOn."

  • Responsible Scaling Policy: Anthropic's policy for deploying increasingly capable models is contingent on effectively mitigating risks through appropriate safeguards. Jailbreaks undermine these safeguards.

"Under our Responsible Scaling Policy, we may deploy such models as long as we’re able to mitigate risks to acceptable levels through appropriate safeguards—but jailbreaking lets users bypass these safeguards."

  • CBRN Capability Threshold: The research aims to create a system which could mitigate jailbreaking risks for models that exceed the CBRN capability threshold, defined as systems able to help individuals with basic technical backgrounds create or obtain and deploy chemical, biological, radiological, and nuclear weapons, representing "a substantially higher risk of catastrophic misuse compared to non-AI baselines."

"In particular, we’re hopeful that a system defended by Constitutional Classifiers could allow us to mitigate jailbreaking risks for models which have passed the CBRN capability threshold outlined in our Responsible Scaling Policy."

  • Universal Jailbreaks: The primary goal of attackers is to find an attack capable of bypassing safety measures for a wide range of forbidden queries, rather than a single query in isolation.

Constitutional Classifiers: Methodology and Approach:

  • Constitutional Basis: The system is inspired by "Constitutional AI," using a "constitution" of principles defining allowed and disallowed content. This constitution informs the training of classifiers, rather than being an inherent part of the AI model itself. For instance, recipes for mustard are allowed, while those for mustard gas are not.

"In the case of Constitutional Classifiers, the principles define the classes of content that are allowed and disallowed (for example, recipes for mustard are allowed, but recipes for mustard gas are not)."

  • Synthetic Data Generation: The system leverages the LLM (Claude) to generate large amounts of synthetic data (prompts and completions) across different content classes based on the constitution. These are augmented via translation and stylistic variations resembling known jailbreak attempts.

"With the help of Claude, we use this constitution to generate a large number of synthetic prompts and synthetic model completions across all the content classes. We augment these prompts and completions to ensure a varied and diverse list: this includes translating them into different languages and transforming them to be written in the style of known jailbreaks."

  • Input and Output Classifiers: Trained using the generated synthetic data, these classifiers work to flag (and block) potentially harmful content according to the constitution.

"We then use these synthetic data to train our input and output classifiers to flag (and block) potentially harmful content according to the given constitution."

  • Minimizing Over-Refusals: The classifiers are also trained on a set of benign queries to reduce the chance of incorrectly flagging harmless content as harmful.

"To help minimize over-refusals (i.e., harmless content incorrectly flagged as harmful), we also train the classifiers on a fixed set of benign queries generated by a contractor."

Testing and Results:

  • Human Red Teaming: In a bug-bounty program, 183 participants spent over 3000 hours attempting to jailbreak a prototype system. Participants were challenged to make the system respond to all ten forbidden queries with a single jailbreak. No one was able to achieve a "universal" jailbreak.

"Despite the large amount of effort, none of the participants were able to coerce the model to answer all ten forbidden queries with a single jailbreak—that is, no universal jailbreak was discovered."

  • Automated Evaluation: A version of Claude 3.5 Sonnet protected with Constitutional Classifiers was tested with 10,000 synthetically generated jailbreak prompts, including advanced attacks.

Without classifiers, the jailbreak success rate was 86%. With classifiers, the success rate dropped to 4.4%—a reduction of over 95% in successful attempts.
"Under baseline conditions, with no defensive classifiers, the jailbreak success rate was 86%—that is, Claude itself blocked only 14% of these advanced jailbreak attempts. Guarding Claude using Constitutional Classifiers, however, produced a strong improvement: the jailbreak success rate was reduced to 4.4%, meaning that over 95% of jailbreak attempts were refused."

  • Over-Refusal Rate: The addition of the classifiers resulted in a very small increase of 0.38% in the refusal rate on harmless queries, which was not statistically significant.

"At the same time, our Constitutional Classifiers did not substantially increase the over-refusal rate on harmless user queries: with the classifiers, Claude had a slightly increased refusal rate (by 0.38%), but this was not statistically significant in a random sample of 5,000 conversations."

  • Computational Cost: The system introduced a moderate increase of 23.7% in compute cost.

"Finally, the compute cost was moderately higher (by 23.7%) than that of the unguarded model."

Limitations and Future Work:

  • Potential for Future Jailbreaks: Constitutional Classifiers may not prevent every jailbreak attempt. The system is intended to dramatically increase the effort required to discover a successful attack.

"Constitutional Classifiers may not prevent every universal jailbreak, though we believe that even the small proportion of jailbreaks that make it past our classifiers require far more effort to discover when the safeguards are in use."

  • New Attack Vectors: The possibility of novel jailbreaking techniques makes it necessary to continually adapt the system. The constitution used to train the classifiers can be adapted to address new threats as they arise.

"It’s also possible that new jailbreaking techniques might be developed in the future that are effective against the system; we therefore recommend using complementarydefenses."

  • Continuous Improvement: The team is actively working to further reduce over-refusals and computational costs associated with this system.

Conclusion:

Anthropic's Constitutional Classifiers show significant promise as a defense mechanism against jailbreak attacks, demonstrating high efficacy in both human red-teaming and automated evaluations. While not a perfect solution, it significantly reduces the success rate of jailbreaks while maintaining a high degree of usability with minimal over-refusal rates and reasonable additional compute costs. This approach represents a promising step towards enabling the safe deployment of advanced AI models. The company is continuing to work to refine and improve the method, including reducing over-refusals and compute costs. They are also relying on public testing of their live demo system in order to further improve its capabilities.








要查看或添加评论,请登录

Nagesh Nama的更多文章

其他会员也浏览了