Rapid Response: Mitigating LLM Jailbreaks with a Few Examples
Today's paper introduces "Rapid Response", a method that protects Large Language Models (LLMs) from misuse. Instead of trying to build perfectly robust defenses, the paper proposes quickly adapting to new jailbreak attempts after observing just a few examples. This represents a shift from static defenses to dynamic response systems that can evolve to counter emerging threats.
Method Overview
The core approach centers around "jailbreak proliferation" - a technique that automatically generates additional examples of jailbreak attempts similar to observed ones. When a new jailbreak attack is detected, the system uses an LLM to generate many variations of that attack, which are then used to strengthen the model's defenses.
The paper introduces RapidResponseBench, a benchmark to evaluate how well different rapid response techniques work. It tests both in-distribution attacks (similar to observed examples) and out-of-distribution attacks (novel variations), while also measuring the impact on legitimate queries.
Five different rapid response methods are implemented and evaluated, with the most effective being "Guard Fine-tuning" - fine-tuning an input classifier using the proliferated jailbreak examples along with benign prompts. This helps the system learn to recognize and block similar attacks while maintaining normal functionality for legitimate users.
Results
As it comes to the results, Guard Fine-tuning, achieved the best results:
Conclusion
The paper demonstrates that rapid response is a promising alternative to static defenses for protecting LLMs. By quickly adapting to new threats using jailbreak proliferation and fine-tuning techniques, systems can significantly reduce the success rate of both known and novel attack variants while maintaining usability for legitimate users. For more information please consult the?full paper.
Congrats to the authors for their work!
Peng, Alwin, et al. "Rapid Response: Mitigating LLM Jailbreaks with a Few Examples." arXiv preprint arXiv:2411.07494 (2024).