登录查看更多内容

Sail: Self-Improving Efficient Online Alignment of Large Language Models

Furong Huang

Associate Professor at University of Maryland College Park

发布日期: 2024年7月10日

Most existing #LLM #alignment & #RLHF rely on offline data or an Oracle teacher model, constrained by data quality and model limits. This often results in suboptimal performance with new, real-world data.

The responses to the prompt and the preferences are either from an offline dataset or from an Oracle teacher/reference model.

How do we go beyond the limitations of static data? How can we achieve better-quality responses? Can the model achieve self-improvement via self-selection of preferences, eliminating the costly human feedback bottleneck?

Online generation of responses using the model itself is the key!

The responses are generated by the model itself. The model also self-critiques.

However, existing online RLHF overlooks the interdependence between data and the model. The response used to (implicitly) fit a reward that guides model updates is generated by the model itself.

In a prior paper, PARL, we correctly model this interdependence using bilevel optimization: an upper-level reward optimization relying on the optimal model π* which is the solution of a lower-level RL problem.

Bilevel formation avoids using suboptimal data generated from previous round for (implicit) reward learning.

However, while bilevel optimization is a principled approach to online RLHF, it suffers from computational tractability issues and requires estimating hyper-gradients.

Introducing SAIL, which transforms the bilevel problem into single-level optimization. It turns out, compared to DPO gradient updates, SAIL has an additional term that induces exploration.

Pascal Biese 3 个月前

??Top ML Papers of the Week

DAIR.AI 4 个月前

? When Accuracy Isn't Enough - Don't Make This Mistake

Pascal Biese 4 个月前

SAIL comes as a unified framework, offering user-defined online adaptability. You can select static or dynamic responses as well as static or dynamic preferences.

SAIL dramatically improves alignment with reduced computational demands.

This work paves the way for more resilient and adaptable language models that better reflect evolving human preferences. We're excited to see where this leads us! Read more about SAIL and its implications here: https://arxiv.org/abs/2406.15567 .

SAIL with my awesome collaborators Mucong Ding Souradip Chakraborty Vibhu Agrawal Zora Che Alec Koppel 梦迪王 Amrit Singh Bedi .

要查看或添加评论，请登录

Furong Huang的更多文章

A Proud Advisor's Introduction of Her Graduating Students (The Year of 2025)

2024年11月21日

A Proud Advisor's Introduction of Her Graduating Students (The Year of 2025)

As I reflect on my journey as a faculty member over the past 7 years, I am overwhelmed with pride and gratitude. What…

5 条评论
Join the Challenge!

2024年9月17日

Join the Challenge!

We're excited to announce the NeurIPS competition "Erasing the Invisible: A Stress-Test Challenge for Image…

Sail: Self-Improving Efficient Online Alignment of Large Language Models

Furong Huang

Associate Professor at University of Maryland College Park

领英推荐

Furong Huang的更多文章

社区洞察

其他会员也浏览了

?? Improving RAG with Self-Feedback

?? LLMs Struggle With Causality

When to Use GraphRAG

Introducing Mixtral-8x22B: The new open model from Mistral outperforms all existing open LLMs ??

The LLM Inc

How to Link LLM to External Data Using RAG?

Are Long-LLMs A Necessity For Long-Context Tasks?

Eliminating hallucinations (fast!) in Large Language Models with Finite State Machines

LangChain's Importance in Building RAG Systems for LLMs

领英推荐

Furong Huang的更多文章

A Proud Advisor's Introduction of Her Graduating Students (The Year of 2025)

Join the Challenge!

社区洞察

其他会员也浏览了

?? Improving RAG with Self-Feedback

?? LLMs Struggle With Causality

When to Use GraphRAG

Introducing Mixtral-8x22B: The new open model from Mistral outperforms all existing open LLMs ??

The LLM Inc

How to Link LLM to External Data Using RAG?

Are Long-LLMs A Necessity For Long-Context Tasks?

Eliminating hallucinations (fast!) in Large Language Models with Finite State Machines

LangChain's Importance in Building RAG Systems for LLMs