Rapid Response: Mitigating LLM Jailbreaks with a Few Examples

Rapid Response: Mitigating LLM Jailbreaks with a Few Examples

Today's paper introduces "Rapid Response", a method that protects Large Language Models (LLMs) from misuse. Instead of trying to build perfectly robust defenses, the paper proposes quickly adapting to new jailbreak attempts after observing just a few examples. This represents a shift from static defenses to dynamic response systems that can evolve to counter emerging threats.

Method Overview

The core approach centers around "jailbreak proliferation" - a technique that automatically generates additional examples of jailbreak attempts similar to observed ones. When a new jailbreak attack is detected, the system uses an LLM to generate many variations of that attack, which are then used to strengthen the model's defenses.

The paper introduces RapidResponseBench, a benchmark to evaluate how well different rapid response techniques work. It tests both in-distribution attacks (similar to observed examples) and out-of-distribution attacks (novel variations), while also measuring the impact on legitimate queries.

Five different rapid response methods are implemented and evaluated, with the most effective being "Guard Fine-tuning" - fine-tuning an input classifier using the proliferated jailbreak examples along with benign prompts. This helps the system learn to recognize and block similar attacks while maintaining normal functionality for legitimate users.

Results

As it comes to the results, Guard Fine-tuning, achieved the best results:

  • Reduced attack success rate by a factor greater than 240 on in-distribution jailbreaks
  • Reduced attack success rate by a factor greater than 15 on out-of-distribution variants
  • Required only one example of each jailbreaking strategy to achieve significant improvements
  • Maintained low false positive rates on benign queries
  • Showed better performance with increased proliferation attempts and more capable proliferation models

Conclusion

The paper demonstrates that rapid response is a promising alternative to static defenses for protecting LLMs. By quickly adapting to new threats using jailbreak proliferation and fine-tuning techniques, systems can significantly reduce the success rate of both known and novel attack variants while maintaining usability for legitimate users. For more information please consult the?full paper.

Congrats to the authors for their work!

Peng, Alwin, et al. "Rapid Response: Mitigating LLM Jailbreaks with a Few Examples." arXiv preprint arXiv:2411.07494 (2024).

要查看或添加评论,请登录