LoRA Learns Less and Forgets Less
Credit: https://arxiv.org/pdf/2405.09673

LoRA Learns Less and Forgets Less

Today's paper explores the performance of Low-Rank Adaptation (LoRA) compared to full finetuning for large language models on coding and math tasks. LoRA is a parameter-efficient fine-tuning method that trains low-rank perturbations to the model's weight matrices instead of the full weights.

LoRA Overview

LoRA involves freezing the pretrained weight matrices of a language model and only training low-rank perturbations to selected weight matrices. Specifically, for a weight matrix W_pretrained (seize d x k), LoRA computes the finetuned weights as:

W_finetuned = W_pretrained + ΔW

Where ΔW = AB is a low-rank matrix product with A (size d x r) and B (size r x k) being smaller matrices of specified rank r. The user chooses which W_pretrained to adapt (“target modules”) and the rank r. This reduces the number of trainable parameters significantly compared to full finetuning of W_pretrained.

The key idea is that by training fewer parameters, LoRA can provide a form of regularization that prevents the model from deviating too much from its initial pretrained behavior on non-target tasks, while still allowing specialization on the target task. In this paper, the authors explore the and thoroughly compare the performance of LoRA vs full fine-tuning on coding and math tasks.

Results

LoRA substantially underperforms full finetuning in terms of accuracy on target domain evaluations like HumanEval for coding and GSM8K for math. The gap is more pronounced for coding tasks.

However, LoRA does exhibit less forgetting of general capabilities compared to full finetuning when evaluated on benchmarks like HellaSwag, WinoGrande and ARC-Challenge that test language understanding and reasoning.

The authors characterize a learning-forgetting tradeoff curve, showing that while LoRA learns less than full finetuning on the target task, it can sometimes achieve comparable target performance with less forgetting of general skills.

LoRA also provides stronger regularization compared to techniques like dropout and weight decay, helping maintain more diverse output generations.

Conclusion

The paper shows that LoRA performs worse than full finetuning on both coding and math tasks. Additionally, it shows that LoRA maintains the finetuned model's behavior similar to the base model, with less source-domain forgetting and more varied outputs during inference, so a better trade-off between forgetting and learning new information can be achieved. For more information please consult the?full paper.

Congrats to the authors for their work!

Biderman, Dan, et al. "LoRA Learns Less and Forgets Less." ArXiv Preprint ArXiv:2405.09673, 2024.

要查看或添加评论,请登录

社区洞察

其他会员也浏览了