登录查看更多内容

LoRA Learns Less and Forgets Less

Vlad Bogolin

AI/ML Engineer & Researcher | Large Language Models (LLMs)

发布日期: 2024年5月19日

Today's paper explores the performance of Low-Rank Adaptation (LoRA) compared to full finetuning for large language models on coding and math tasks. LoRA is a parameter-efficient fine-tuning method that trains low-rank perturbations to the model's weight matrices instead of the full weights.

LoRA Overview

LoRA involves freezing the pretrained weight matrices of a language model and only training low-rank perturbations to selected weight matrices. Specifically, for a weight matrix W_pretrained (seize d x k), LoRA computes the finetuned weights as:

W_finetuned = W_pretrained + ΔW

Where ΔW = AB is a low-rank matrix product with A (size d x r) and B (size r x k) being smaller matrices of specified rank r. The user chooses which W_pretrained to adapt (“target modules”) and the rank r. This reduces the number of trainable parameters significantly compared to full finetuning of W_pretrained.

The key idea is that by training fewer parameters, LoRA can provide a form of regularization that prevents the model from deviating too much from its initial pretrained behavior on non-target tasks, while still allowing specialization on the target task. In this paper, the authors explore the and thoroughly compare the performance of LoRA vs full fine-tuning on coding and math tasks.

Results

LoRA substantially underperforms full finetuning in terms of accuracy on target domain evaluations like HumanEval for coding and GSM8K for math. The gap is more pronounced for coding tasks.

However, LoRA does exhibit less forgetting of general capabilities compared to full finetuning when evaluated on benchmarks like HellaSwag, WinoGrande and ARC-Challenge that test language understanding and reasoning.

Sebastian Raschka, PhD 4 个月前

Mastering the Ingestion Phase of Retriever Augmented…

Snigdha Kakkar 5 个月前

DSPy: A New Framework - Program Your Foundation…

Clarifai 7 个月前

The authors characterize a learning-forgetting tradeoff curve, showing that while LoRA learns less than full finetuning on the target task, it can sometimes achieve comparable target performance with less forgetting of general skills.

LoRA also provides stronger regularization compared to techniques like dropout and weight decay, helping maintain more diverse output generations.

Conclusion

The paper shows that LoRA performs worse than full finetuning on both coding and math tasks. Additionally, it shows that LoRA maintains the finetuned model's behavior similar to the base model, with less source-domain forgetting and more varied outputs during inference, so a better trade-off between forgetting and learning new information can be achieved. For more information please consult the?full paper.

Congrats to the authors for their work!

Biderman, Dan, et al. "LoRA Learns Less and Forgets Less." ArXiv Preprint ArXiv:2405.09673, 2024.

LoRA Learns Less and Forgets Less

Vlad Bogolin

AI/ML Engineer & Researcher | Large Language Models (LLMs)

LoRA Overview

Results

领英推荐

Conclusion

AI Paper of the Day

900 位关注者

更多精彩文章

社区洞察

其他会员也浏览了

Course: Introduction to LLMs in Python

Introducing CodeLlama 70B: A 70 billion-parameter model achieving SOTA performance in code generation.

Langchain

Interesting Content in AI, Software, Business, and Tech- 5/31/2023

Top 5 Open-Source LangChain Alternatives to Use in 2024

The Implications of Prompt Interfaces (Vol. 7)

Creating a Web App for Inventory Management with Claude Sonnet 3.5 and GPT-4o

LangChain Models

Why do ML Projects Fail?

NuminaMath 7B TIR: A New Era in AI-Powered Mathematical Problem-Solving

LoRA Overview

Results

领英推荐

Conclusion

AI Paper of the Day

900 位关注者

MLLM as Retriever: Interactively Learning Multimodal Retrieval for Embodied Agents

2024年10月12日

Aria: An Open Multimodal Native Mixture-of-Experts Model

2024年10月11日

Pixtral 12B

2024年10月10日

VideoGuide: Improving Video Diffusion Models without Training Through a Teacher's Guide

2024年10月9日

LLMs Know More Than They Show: On the Intrinsic Representation of LLM Hallucinations

2024年10月8日

Unraveling Cross-Modality Knowledge Conflict in Large Vision-Language Models

2024年10月7日

LLaVA-Critic: Learning to Evaluate Multimodal Models

2024年10月6日

Loong: Generating Minute-level Long Videos with Autoregressive Language Models

2024年10月5日

Movie Gen: A Cast of Media Foundation Models

2024年10月4日

TPI-LLM: Serving 70B-scale LLMs Efficiently on Low-resource Edge Devices

2024年10月3日

社区洞察

其他会员也浏览了

Course: Introduction to LLMs in Python

Introducing CodeLlama 70B: A 70 billion-parameter model achieving SOTA performance in code generation.

Langchain

Interesting Content in AI, Software, Business, and Tech- 5/31/2023

Top 5 Open-Source LangChain Alternatives to Use in 2024

The Implications of Prompt Interfaces (Vol. 7)

Creating a Web App for Inventory Management with Claude Sonnet 3.5 and GPT-4o

LangChain Models

Why do ML Projects Fail?

NuminaMath 7B TIR: A New Era in AI-Powered Mathematical Problem-Solving