Anthropic's Constitutional Classifiers for Jailbreak Defense
Nagesh Nama
CEO at xLM | Transforming Life Sciences with AI & ML | Pioneer in GxP Continuous Validation |
"Constitutional Classifiers," a new approach for defending large language models (LLMs) against adversarial "jailbreak" attacks designed to bypass safety guardrails.
The goal is to allow for safer deployment of more powerful AI models in the future. The research paper focuses on protecting AI models, specifically Claude 3.5 Sonnet, from generating harmful outputs such as instructions for creating biological or chemical weapons.
Concepts and Challenges:
"Some jailbreaks flood the model with very long prompts; others modify the style of the input, such as uSiNg uNuSuAl cApItALiZaTiOn."
"Under our Responsible Scaling Policy, we may deploy such models as long as we’re able to mitigate risks to acceptable levels through appropriate safeguards—but jailbreaking lets users bypass these safeguards."
"In particular, we’re hopeful that a system defended by Constitutional Classifiers could allow us to mitigate jailbreaking risks for models which have passed the CBRN capability threshold outlined in our Responsible Scaling Policy."
Constitutional Classifiers: Methodology and Approach:
"In the case of Constitutional Classifiers, the principles define the classes of content that are allowed and disallowed (for example, recipes for mustard are allowed, but recipes for mustard gas are not)."
"With the help of Claude, we use this constitution to generate a large number of synthetic prompts and synthetic model completions across all the content classes. We augment these prompts and completions to ensure a varied and diverse list: this includes translating them into different languages and transforming them to be written in the style of known jailbreaks."
"We then use these synthetic data to train our input and output classifiers to flag (and block) potentially harmful content according to the given constitution."
"To help minimize over-refusals (i.e., harmless content incorrectly flagged as harmful), we also train the classifiers on a fixed set of benign queries generated by a contractor."
Testing and Results:
"Despite the large amount of effort, none of the participants were able to coerce the model to answer all ten forbidden queries with a single jailbreak—that is, no universal jailbreak was discovered."
领英推荐
Without classifiers, the jailbreak success rate was 86%. With classifiers, the success rate dropped to 4.4%—a reduction of over 95% in successful attempts.
"Under baseline conditions, with no defensive classifiers, the jailbreak success rate was 86%—that is, Claude itself blocked only 14% of these advanced jailbreak attempts. Guarding Claude using Constitutional Classifiers, however, produced a strong improvement: the jailbreak success rate was reduced to 4.4%, meaning that over 95% of jailbreak attempts were refused."
"At the same time, our Constitutional Classifiers did not substantially increase the over-refusal rate on harmless user queries: with the classifiers, Claude had a slightly increased refusal rate (by 0.38%), but this was not statistically significant in a random sample of 5,000 conversations."
"Finally, the compute cost was moderately higher (by 23.7%) than that of the unguarded model."
Limitations and Future Work:
"Constitutional Classifiers may not prevent every universal jailbreak, though we believe that even the small proportion of jailbreaks that make it past our classifiers require far more effort to discover when the safeguards are in use."
"It’s also possible that new jailbreaking techniques might be developed in the future that are effective against the system; we therefore recommend using complementarydefenses."
Conclusion:
Anthropic's Constitutional Classifiers show significant promise as a defense mechanism against jailbreak attacks, demonstrating high efficacy in both human red-teaming and automated evaluations. While not a perfect solution, it significantly reduces the success rate of jailbreaks while maintaining a high degree of usability with minimal over-refusal rates and reasonable additional compute costs. This approach represents a promising step towards enabling the safe deployment of advanced AI models. The company is continuing to work to refine and improve the method, including reducing over-refusals and compute costs. They are also relying on public testing of their live demo system in order to further improve its capabilities.