Fine Tuning LLM: Parameter Efficient Fine Tuning (PEFT) — LoRA & QLoRA — Part 1
Vishwas N.
Sr.Solutions Engineer at Innatemetrics | Reinventing AI Acceleration for Enterprise(R2V2.ai) | Playing the Big Boy Sport(Startups)
In this blog, we'll explore Parameter Efficient Fine Tuning (PEFT) and two important PEFT methods: LoRA and QLoRA. These techniques allow us to fine-tune large language models (LLMs) for specific tasks with minimal resources and infrastructure.
Motivation
In the world of AI and NLP, achieving desired results from LLMs involves three main approaches:
1. Prompt Engineering: Crafting prompts to get desired responses.
2. Creating a New Model: Training a model from scratch, which is resource-intensive.
3. Fine-Tuning Existing Models: Adapting pre-trained models to specific tasks, which can be costly and resource-intensive.
PEFT offers a way to fine-tune models efficiently, reducing the need for extensive resources while maintaining high performance.
Parameter Efficient Fine Tuning (PEFT)
PEFT is a technique that minimizes the need for extensive resources by fine-tuning models with fewer parameters. Two popular PEFT methods are LoRA and QLoRA.
Low-Rank Adaptation (LoRA)
LoRA is a modular approach to fine-tuning models. Instead of modifying the entire network, LoRA adds a set of low-rank parameters to the network. This reduces the memory footprint and makes the process more efficient.
How LoRA Works:
1. Freeze Original Parameters: The pre-trained parameters of the original model are frozen.
2. Add Low-Rank Parameters: New parameters (WA and WB) are added to the network. These parameters have lower dimensions (dxr and rxd), where 'd' is the dimension of the original parameters and 'r' is the chosen low-rank dimension.
3. Compute Results: The results of the original network and the low-rank network are computed and used to generate the final result.
4. Train Low-Rank Parameters: The WA and WB weights are adjusted based on the loss function during backpropagation.
Benefits of LoRA:
- Reduces the number of parameters that need to be trained.
- Avoids catastrophic forgetting by not altering the original weights.
领英推荐
- Achieves high modularity by preserving the trained LoRA adapter as distinct modules.
Quantized Low-Rank Adaptation (QLoRA)
QLoRA extends LoRA by quantizing the weight values of the original network from high-resolution data types (like Float32) to lower-resolution data types (like int4). This further reduces memory demands and speeds up calculations.
Key Optimizations in QLoRA:
1. 4-bit NF4 Quantization: Normalizes and quantizes weights to a 4-bit data type, reducing the memory footprint.
2. Double Quantization: Quantizes the quantization constants to further optimize memory usage.
3. Unified Memory Paging: Utilizes NVIDIA's unified memory feature to manage memory overflow issues.
How QLoRA Works:
1. Normalization & Quantization: Adjusts weights to a zero mean and constant unit variance, then maps them to 16 numbers.
2. Dequantization: Reverses the quantization process to recover the original weights.
3. Double Quantization: Quantizes the quantization constants to further reduce memory usage.
Conclusion
LoRA and QLoRA are powerful techniques for parameter-efficient fine-tuning. They allow us to adapt LLMs to specific tasks with minimal resources and infrastructure, making them highly efficient and cost-effective.
In the next part, we'll implement QLoRA. Until then, have fun with LLMs!
Here is a very Big Picture of the Finetuning Methods: