ç™»å½•æŸ¥çœ‹æ›´å¤šå†…å®¹

LLM Alignment: Direct Preference Optimization

Jayant Kumar

Principal ML Scientist at Adobe | Technical Advisor at Preffect | Multimodal AI | Large language models and Knowledge Graph applications

å‘å¸ƒæ—¥æœŸ: 2024å¹´7æœˆ13æ—¥

In the realm of language models (LMs), alignment is essential to ensure that the outputs generated by these models meet human preferences and expectations. Direct Preference Optimization (DPO) is a groundbreaking algorithm developed by Stanford researchers that simplifies the alignment process compared to traditional methods like Reinforcement Learning from Human Feedback (RLHF). In this article, I delve into this topic to share my understanding, based on an insightful talk given by Lewis Tunstall and Edward Beeching from Hugging Face about their work on Zephyr.

Why Align Language Models?

Alignment in language models involves fine-tuning models so that their outputs align with human values and preferences. This process is essential to make LMs useful for practical applications, such as chatbots and virtual assistants.

Initially, language models are pre-trained on vast datasets to predict the next token in a sequence. While powerful, these models often need additional tuning to ensure their responses align with human expectations, especially in specific contexts like customer service.

Traditional Alignment Techniques

Supervised Fine-Tuning (SFT)

Supervised fine-tuning involves training models on a curated dataset of questions and answers. This step helps models generate more contextually appropriate responses but may still carry biases from the training data.

Reinforcement Learning from Human Feedback (RLHF)

RLHF, pioneered by OpenAI, involves training a model using human feedback to rank responses. This process includes:

Generating multiple responses to a prompt.
Having human labelers rank these responses.
Training a reward model to predict the preferred responses based on these rankings.
Fine-tuning the model using reinforcement learning algorithms like Proximal Policy Optimization (PPO).

Nice example on Alignment (Taken from talk by HF team)

Direct Preference Optimization (DPO)

DPO eliminates the need for a separate reward model and reinforcement learning. Instead, it integrates the preference learning directly into the language model's training process, making it simpler and more efficient.

é¢†è‹±æŽ¨è

Introduction to Large Language Models

Blockchain Council 8 ä¸ªæœˆå‰

How to Use LLAMA 3?

Blockchain Council 7 ä¸ªæœˆå‰

Training-Free Long-Context Scaling of Large Language Models

Training-Free Long-Context Scaling of Large Languageâ€¦

Ashish Patel ???? 9 ä¸ªæœˆå‰

How DPO Works

Prompt and Response Pairing: Provide a prompt and two responsesâ€”one preferred and one not.
Log Probability Ratios: Compute the log probabilities for the preferred and non-preferred responses.
Optimization: Use these probabilities to adjust the model weights through backpropagation, encouraging the model to favor preferred responses.

Benefits of DPO

Simplicity: Reduces complexity by eliminating the need for a separate reward model.
Efficiency: Faster convergence and alignment compared to RLHF.
Differentiable: Fully differentiable, allowing for straightforward optimization using standard backpropagation techniques.

Practical Applications

Implementation Example

Hugging Face team implemented DPO using the Mistral 7B base model. By leveraging synthetic feedback datasets and fine-tuning, they achieved a model competitive with much larger models on chat benchmarks.

Industry Adoption

DPO has become a popular alignment technique in the open-source community, with libraries like Hugging Face's TRL (Transformers Reinforcement Learning) and AXOLOTL supporting its implementation. Researchers continue to explore and expand DPO's capabilities, including online and iterative training methods.

Enhancements and Alternatives

Researchers are continuously seeking to improve alignment techniques. Notable advancements include:

Identity Preference Optimization (IPO): Adds regularization to prevent overfitting.
Kinnaman Traverse Optimization (KTO): Simplifies preference data collection by decoupling good and bad responses.
Iterative DPO by Snorkel: The model is continually improved through successive rounds of preference-based training. This method enhances alignment by incorporating ongoing feedback, making it possible to refine models progressively and effectively.

The research community is actively exploring new datasets and methodologies to refine DPO further. These efforts aim to make LMs even more reliable and aligned with human values, enhancing their practical utility.

Conclusion

Direct Preference Optimization represents a significant leap forward in aligning language models with human preferences. Its simplicity, efficiency, and effectiveness make it a valuable tool for developing advanced, user-aligned LMs. As research progresses, we can expect even more innovative solutions to emerge, further bridging the gap between machine intelligence and human expectations.

Ravindra Sadaphule

8 ä¸ªæœˆ

Insightful article, Jayant Kumar. DPO significantly simplifies architecture by eliminating value networks using the Terry Bradly equation. Here is my blog on DPO: https://medium.com/state-of-the-art-technology/direct-preference-optimization-a-leap-forward-in-reinforcement-learning-7ac126f99387

èµž

å›žå¤

1 æ¬¡å›žåº”

è¦æŸ¥çœ‹æˆ–æ·»åŠ è¯„è®ºï¼Œè¯·ç™»å½•

Jayant Kumarçš„æ›´å¤šæ–‡ç«

DeepSeek-R1: A Pure RL-based Reasoning Model

2025å¹´1æœˆ26æ—¥

DeepSeek-R1: A Pure RL-based Reasoning Model

I summarize the key steps involved in creating the DeepSeek models, from the foundational development of DeepSeek-R1 toâ€¦

1 æ¡è¯„è®º
LLaVA-OneVision

2024å¹´9æœˆ21æ—¥

LLaVA-OneVision

The LLaVA-NeXT series represents a groundbreaking evolution in large multimodal models with each iteration bringingâ€¦

2 æ¡è¯„è®º
GraphRAG: Powerful but Expensive and Slow Solution

2024å¹´7æœˆ29æ—¥

GraphRAG: Powerful but Expensive and Slow Solution

Microsoft's GraphRAG architecture represents a significant advancement in Retrieval-Augmented Generation (RAG) systems,â€¦

2 æ¡è¯„è®º
SIGIR Day 1 - Keynotes and Industry Papers

2024å¹´7æœˆ16æ—¥

SIGIR Day 1 - Keynotes and Industry Papers

Day 1 started with the opening remarks from general/program chairs. Some key insights are as follows: RecSys has theâ€¦
Behind the Rankings: LLM Model Evaluation in Benchmark Datasets

2024å¹´4æœˆ20æ—¥

Behind the Rankings: LLM Model Evaluation in Benchmark Datasets

Over the past few days, there's been a flurry of posts discussing the newly unveiled Llama 3 model and its impressiveâ€¦
Navigating the Shifting Tides: Reflections on the Rollercoaster Ride of 2023

2023å¹´12æœˆ31æ—¥

Navigating the Shifting Tides: Reflections on the Rollercoaster Ride of 2023

The Unfolding Drama in Early 2023: Unrealistic Projections, Layoffs, and the Pressure to Innovate As the curtains roseâ€¦

1 æ¡è¯„è®º
AI Horizons: A Closer Look at the Five Big AI Bets in 2023

2023å¹´12æœˆ22æ—¥

AI Horizons: A Closer Look at the Five Big AI Bets in 2023

As we navigate the ever-evolving landscape of artificial intelligence, it's natural to wonder â€“ which bets are payingâ€¦

1 æ¡è¯„è®º
BERT as a service

2020å¹´5æœˆ17æ—¥

BERT as a service

There are multiple ways of leveraging the open source BERT model for your NLP work, for example, via huggingfaceâ€¦
Custom Object Detector

2018å¹´12æœˆ2æ—¥

Custom Object Detector

Recently I had a chance to try Tensorflow object detection API to develop a custom object detector - an objectâ€¦

2 æ¡è¯„è®º
Learning by Teaching

2015å¹´8æœˆ22æ—¥

Learning by Teaching

I had heard before that the best way to learn anything is to try to teach it to others. If you can explain a topic ofâ€¦

3 æ¡è¯„è®º

See all articles

LLM Alignment: Direct Preference Optimization

Jayant Kumar

Principal ML Scientist at Adobe | Technical Advisor at Preffect | Multimodal AI | Large language models and Knowledge Graph applications

Why Align Language Models?

Traditional Alignment Techniques

Supervised Fine-Tuning (SFT)

Reinforcement Learning from Human Feedback (RLHF)

Direct Preference Optimization (DPO)

é¢†è‹±æŽ¨è

How DPO Works

Benefits of DPO

Practical Applications

Implementation Example

Industry Adoption

Enhancements and Alternatives

Conclusion

Jayant Kumarçš„æ›´å¤šæ–‡ç«

ç¤¾åŒºæ´žå¯Ÿ

å…¶ä»–ä¼šå‘˜ä¹Ÿæµè§ˆäº†

AutoGen: Empowering Large Language Models â€” Simplified

Multimodal Large Language Models (LLMs): From data management to training

LLMs in production: Lessons from the trenches

Small Language Models (SLMs): The Future of Business Efficiency and Innovation

Understanding LLM Fine-Tuning

Language Compression: A Strategic Tool for AI Communication

How to adopt a LLM Model for Your Application

Supervised Fine-Tuning (SFT) in DeepSeek V3

Unveiling the Future: Top Trends in Large Language Model (LLM) Research

Catastrophic Forgetting in LLMs

Why Align Language Models?

Traditional Alignment Techniques

Supervised Fine-Tuning (SFT)

Reinforcement Learning from Human Feedback (RLHF)

Direct Preference Optimization (DPO)

é¢†è‹±æŽ¨è

How DPO Works

Benefits of DPO

Practical Applications

Implementation Example

Industry Adoption

Enhancements and Alternatives

Conclusion

Jayant Kumarçš„æ›´å¤šæ–‡ç«

DeepSeek-R1: A Pure RL-based Reasoning Model

LLaVA-OneVision

GraphRAG: Powerful but Expensive and Slow Solution

SIGIR Day 1 - Keynotes and Industry Papers

Behind the Rankings: LLM Model Evaluation in Benchmark Datasets

Navigating the Shifting Tides: Reflections on the Rollercoaster Ride of 2023

AI Horizons: A Closer Look at the Five Big AI Bets in 2023

BERT as a service

Custom Object Detector

Learning by Teaching

ç¤¾åŒºæ´žå¯Ÿ

å…¶ä»–ä¼šå‘˜ä¹Ÿæµè§ˆäº†

AutoGen: Empowering Large Language Models â€” Simplified

Multimodal Large Language Models (LLMs): From data management to training

LLMs in production: Lessons from the trenches

Small Language Models (SLMs): The Future of Business Efficiency and Innovation

Understanding LLM Fine-Tuning

Language Compression: A Strategic Tool for AI Communication

How to adopt a LLM Model for Your Application

Supervised Fine-Tuning (SFT) in DeepSeek V3

Unveiling the Future: Top Trends in Large Language Model (LLM) Research

Catastrophic Forgetting in LLMs

é¢†è‹±æŽ¨è

å…¶ä»–ä¼šå‘˜ä¹Ÿæµè§ˆäº†