Sail: Self-Improving Efficient Online Alignment of Large Language Models
Most existing #LLM #alignment & #RLHF rely on offline data or an Oracle teacher model, constrained by data quality and model limits. This often results in suboptimal performance with new, real-world data.
How do we go beyond the limitations of static data? How can we achieve better-quality responses? Can the model achieve self-improvement via self-selection of preferences, eliminating the costly human feedback bottleneck?
Online generation of responses using the model itself is the key!
However, existing online RLHF overlooks the interdependence between data and the model. The response used to (implicitly) fit a reward that guides model updates is generated by the model itself.
In a prior paper, PARL, we correctly model this interdependence using bilevel optimization: an upper-level reward optimization relying on the optimal model π* which is the solution of a lower-level RL problem.
However, while bilevel optimization is a principled approach to online RLHF, it suffers from computational tractability issues and requires estimating hyper-gradients.
Introducing SAIL, which transforms the bilevel problem into single-level optimization. It turns out, compared to DPO gradient updates, SAIL has an additional term that induces exploration.
领英推荐
SAIL comes as a unified framework, offering user-defined online adaptability. You can select static or dynamic responses as well as static or dynamic preferences.
SAIL dramatically improves alignment with reduced computational demands.
This work paves the way for more resilient and adaptable language models that better reflect evolving human preferences. We're excited to see where this leads us! Read more about SAIL and its implications here: https://arxiv.org/abs/2406.15567 .
SAIL with my awesome collaborators Mucong Ding Souradip Chakraborty Vibhu Agrawal Zora Che Alec Koppel 梦迪 王 Amrit Singh Bedi .