The Origination of Eight Major Methods For FineTuning an LLM
I delve into origination as well as brief description of eight methods that use targeted parameter for fine-tuning of Large-Language Models (LLMs). I discuss in detail Gradient-based, LoRA, QLoRA, and four others as advanced variations of ULMFiT: selecting a small subset of the available parameters in a trained LLM.
Overview of Large Language?Models
Large Language Models (LLMs) are leading the AI movement.
These LLMs vary widely in the tasks that they can accomplish, but all of them, currently, are described in terms of the number of parameters and the amount of text they were trained on.
Fine-tuning LLMs has emerged as a crucial technique to adapt these models to specific tasks and improve their performance.
I review the evolution of targeted parameter fine-tuning of LLMs, describe in detail five of these fine-tuning methods, and ponder where we might be headed in fine-tuning.
Early Fine-tuning Methods
In the early days, fine-tuning was considered a finesse or trick to boost performance in data science competitions, such as Kaggle.
The earliest fine-tuning methods were simple and straightforward. They involved taking a pre-trained Language Model, where the term at the time was NLP (Natural Langage Processing), and fine-tuning it on a small dataset of labeled data. The goal was to improve the LLM’s performance on the labeled data by adjusting the parameters of the model.
As LLMs grew in size and were trained on vast amounts of text, they began to exhibit a general understanding of language tasks, including spelling, grammar, and contextual
relationships between words.
However, LLMs did poorly or lacked the ability to perform tasks outside the realm of text comprehension, such as coding, image-related tasks, or mathematical calculations. This limitation sparked the need for further training, or fine-tuning to equip LLMs with additional skills.
1. Universal Language Model Finetuning (ULMFiT)
One of the first papers I read, was published in May 2018, on a fine-tuning method they called Universal Language Model Finetuning (ULMFiT).
ULMFiT establishes a baseline of techniques from which all of the other four fine-tuning methods can be described and compared.
Let's delve into the three major steps of ULMFiT:
The latest LLM architecture methods, namely transformers, and attention, cause ULMFiT step 3 to be unnecessary.
ULMFiT Step 2: Target Model Layer (Parameter) Fine-tuning
It is step 2, of ULMFiT, selecting a subset of parameters, which is the parent of the other three more sophisticated fine-tuning methods LoRA, QLoRA, and the umbrella of four other fine-tuning methods, HiggingFace PEFT.
“Selective Parameter Subset fine-tuning” was born from the behavior of the neural network layers of a vision model, the lower layer parameters held crude or general patterns of the image, while the top layer parameters held more complete patterns of the image on which the vision model was trained.
The goal of “Selective Parameter Subset fine-tuning” was that by fine-tuning an NLP in the same manner as a vision model, the NLP retains its general linguistic knowledge (features in a vision model) while adapting at higher levels to the target task.
The ULMFiT paper found three techniques for improving NLPs, that resulted in “Selective Parameter Subset fine-tuning”:
ULMFiT was established by training a language model on a large corpus and then fine-tuning it with task-specific data.
2. Gradient-based parameter importance ranking or Random Forest-based importance ranking
Similar to ULMFiT, parameter importance ranking (PIR) chooses a subset of the parameters to update during the fine-tuning process.
The heart of PIR is its parameter selection process. It analyzes the task at hand and assigns importance ranking to all parameters in the model. It then selects the ones that have the highest impact on the task’s performance.
ULMFiT freezes layers, leaving the parameters in those layers unchanged. PIR freezes non-selected parameters, no matter what layer the parameters are in. By freezing parameters, not layers, the PIR leaves LLM’s knowledge in areas that aren’t directly relevant to the task.
The selection of parameters is THE crucial step in PIR. Through task analysis and importance ranking, while still unproven, PIR identifies the parameters that have the highest impact on the task’s performance.
Two prominent ways to implement PIR are:
In practice, PIR has been abandoned due to the number of parameters in an LLM.
3. LoRA: Low Ranking Adaptation?… of Parameters for LLMs, A Current?Method
From transforming a high-rank matrix to a low-rank approximation.
Before discussing LoRA, I go through a brief refresher on methods for reducing a matrix to a low-rank approximation. Please skip this section if you don't feel you need it.
A weight matrix can be represented as a product of smaller matrices using a technique called matrix decomposition. There are several different types of matrix decomposition, but the most common are singular value decomposition (SVD), principal component analysis (PCA), and low-rank approximation.
SVD ( singular value decomposition) decomposes a matrix into three smaller matrices: a diagonal matrix of singular values, a matrix of left singular vectors, and a matrix of right singular vectors. The central matrix (V) singular values represent the magnitudes of the principal components, the left singular vectors represent the directions of the principal components, and the right singular vectors represent the weights of the principal components.
PCA (principal component analysis) is a special case of SVD where the singular values are all unique. PCA is used to reduce the dimensionality of a dataset by projecting the data onto the first few principal components.
The low-rank approximation is a technique for approximating a matrix as a product of smaller matrices with a lower rank. This can be done by using a variety of methods, such as singular value truncation (SVD), principal component analysis, and matrix factorization.
I can not find anywhere in the paper which low-rank approximation was used to get their results. If I had to guess, the authors used SVD.
Back to the subject: LoRA finetuning
This particular LoRA (Low-Rank Parametrized Update Matrices) is applied to neural networks, specifically, the attention layers of LLMs. LoRA principles can apply to other deep learning models that use dense layers., such as transformer layers. However, because those layers are dense (independent layers of parameters), not sparse, the matrices don't approximate well to lower ranks.
We limit our study to only adapting the attention weights for downstream tasks and freeze the MLP modules (so they are not trained in downstream tasks) both for simplicity. https://arxiv.org/abs/2106.09685
Note: The transformer architecture consists of an encoder and a decoder. The encoder takes the input data and transforms it into a sequence of hidden states. The decoder then takes the hidden states and generates the output data. The encoder and decoder in the transformer architecture are made up of MLP modules. The MLP modules in the encoder learn the relationships between the input words, while the MLP modules in the decoder learn the relationships between the hidden states and the output words.
Note: When I say a transformer "consists of an encoder and a decoder", I am referring to the internals of the LLM, not the interactive chat mode.
The premise of LoRA is that during fine-tuning the updates to the weights can be represented with a low-rank matrix of the original weights of the attention layers of the previously trained LLM.
The authors are making the assumption that the information necessary to fine-tune a task can be stored in a smaller amount of parameters.
The authors decide to apply LoRA to the attention weights and freeze the other 173.4 Billion parameters in the MLP modules. Using the authors' version, they start off by targeting approximately 0.01 or 1% of GPT-3 parameters.
Note: GPT-3 has 1,658,880,000 (1 billion) attention parameters spread among 135,000 layers, each attention layer has 96 attention heads, and each attention head has 128 attention parameters..
The large number of attention parameters in GPT-3 is one of the reasons why it is able to generate human-quality text. The attention layers allow the model to learn long-range dependencies in the input data, which allows it to generate output that is both grammatically correct and semantically meaningful.
领英推荐
However, the large number of attention parameters also makes GPT-3 a very computationally expensive model to train and run. It took OpenAI 355 years and $4.6 million to train GPT-3 on 500 petabytes of text data from the web.
If you only target parameters in the attention layer, your fine-tuning of GPT-3 should cost you about $46 thousand on 5 petabytes of text data.
The selection of parameters through Low-Rank Matrices is performed automatically. LoRA looks like THE answer to “Fine-Tuning”, unfortunately, it fails in a lot of situations:
Implementations of LoRA include:
GitHub?—?microsoft/LoRA: Code for loralib, an implementation of “LoRA: Low-Rank Adaptation of Large… This repo contains the source code of the Python package loralib and several examples of how to integrate it with…github.com
GitHub?—?cloneofsimo/lora: Using Low-rank adaptation to quickly fine-tune diffusion models. Using LoRA to fine tune on illustration dataset?: $W = W_0 + \alpha \Delta W$, where $\alpha$ is the merging ratio…github.com
Quantization
Quantization is not a fine-tuning method but it is included in QLoRA and other fine-tuning methods (PEFT).
Quantization is a technique that can be used to reduce the size of an LLM. Quantization reduces the precision of the model’s parameters, which in turn takes less space.
Quantization usually reduces the accuracy of an LLM. The trick is to reduce the precision of parameters and activations, and even eliminate some parameters and only lose a small amount of the accuracy of the LLM's responses.
If the LLM is much smaller, it can fit on smaller devices, such as your laptop or even mobile phones.
In general, how does quantization work?
We have an LLM with 100 billion parameters. Each of these parameters is signified by a 32-bit floating-point number.
Imagine replacing these floating-point numbers?—?the parameters?—?with 8-bit integers. We immediately reduce the size by a factor of 4.
The authors of this paper report that they are able to quantize models ranging in size from 410M to 52B with minimal degradation in performance. They reported that they were able to quantize the GPT-3 model with only a 0.1% loss in accuracy.
Based on this result, you can slim a trained or fine-tuned LLM by a factor 4 or 8 (4-bit weights).
In summary, Quantization can be used to reduce the size of any LLM. Each smaller LLM can be trained and then fine-tuned by any method for a specific task.
Quantization is a major breakthrough, as it opens up the possibility of using a stack of LLMs for a wider range of tasks on standard computing platforms.
Implementations of Quantization include:
I will continue to follow Quantization and Distillation as you can expect they both will continue to be active and significant fields of further development.
4. QLoRA?: QLoRA: Efficient Finetuning of Quantized LLMs [Submitted on 23 May?2023]
The paper and implementation put forth a new method QLoRA that builds on the shoulders of LoRA and quantization.
QLoRA uses bitsandbytes for quantization and is integrated with Huggingface’s PEFT (ed. LoRA) and transformers libraries.?—?University of Washington’s UW NLP group.
We are seeing a new generation of fine-tuning methods built from open-source API (or package) building blocks.
The following is a first-blush understanding of the QLoRA new quantatization method’s three key features:
1. 4-bit NormalFloat (NF4):
Recall quantization is a process in which we take a selected set of parameters in an LLM, such as 32-bit floating-point numbers, and transform them into a smaller set of output values such as 4-bit integers.
“Quantile Quantization” is fairly common. There is a function to do just that in Python. In “Quantile Quantization”, we divide the data into bins in such a way that each bin contains an equal number of values. In the case of neural network weights, we’d divide the range of weight values into bins, each containing an equal number of weights.
However, there are some limitations to “Quantile Quantization”. The major one is that it can be computationally expensive to estimate the quantiles.
The paper has a solution for “Quantile Quantization”, the NormalFloat (NF4) data type, which builds on the Quantile Quantization concept. This technique is optimal when our input tensors come from a distribution that is fixed except for a constant (such as a zero-centered normal distribution). In such cases, the input tensors have the same quantiles, which makes it easier to estimate them.
The paper also takes advantage that most pre-trained neural networks have weights that follow a zero-centered normal distribution. By scaling these weights, we can transform them to fit a fixed distribution.
Note: The pre-trained neural networks have weights that follow a zero-centered normal distribution makes QLoRA unsuitable for LLMs whose weights are not a normal distribution. A heavily skewed distribution would be an example.
After transforming the weights, we can create the NF4 datatype. With NF4, the authors realize a factor of eight reduction for a 32-bit floating weight.
2. Double quantization
The paper introduces a second technique called “Double Quantization”. The idea here is to quantize quantization constants that is bin quantization constants again.
3. Paged optimizers
To efficiently handle memory spikes during training, QLoRA employs a method called paged optimizers. Memory spikes occur when the model needs to access a large amount of data simultaneously. Paged optimizers divide the model into pages and only load the necessary pages for the current training iteration. This prevents memory depletion and ensures smooth execution.
I have tried to provide an accurate summary of the QLoRA paper.
Each of QLoRA's three features involves some intricate details, like the specific equations used to estimate the quantiles and create the NF4 data type, and the considerations in Double Quantization to avoid performance degradation.
QLoRA has been shown to be effective on a variety of tasks, including instruction following and chatbot performance.
“QLoRA outperforms all previously openly released models on the Vicuna benchmark, reaching 99.3% of the performance level of ChatGPT while only requiring 24 hours of finetuning on a single GPU.”?—?University of Washington’s UW NLP?group.
QLoRA is the latest method to finetune LLMs that would otherwise be too large to fit on a single GPU. Essentially reducing them to 1/16 the size.
5–8. PEFT: Parameter-efficient Fine-tuning?..?methods
PEFT is the umbrella for targeted parameter selection methods. The HuggingFace implementation so far has:
5. Prefix Tuning: Prefix-Tuning: Optimizing Continuous Prompts for Generation, P-Tuning v2: Prompt Tuning Can Be Comparable to Fine-tuning Universally Across Scales and Tasks
6. P-Tuning: GPT Understands, Too
7. Prompt Tuning: The Power of Scale for Parameter-Efficient Prompt Tuning
Where Are We Headed in Fine-tuning?
We will see the continued development of new fine-tuning methods. We can expect a new fine-tuning method will appear every week or month through this and next year. These methods will be more sophisticated and complex than the methods that are currently available.
We are starting to see LLM specialists, smaller, specialized LLMs that are added to an existing LLM to improve its performance on a specific task.
Can n-brain LLMs be far away, getting closer to layered sub-brains of the mammal brain?