Finetuning large language models using novel techniques (PEFT).
Srikanth Machiraju
AI Architect @ Microsoft | AI Specialist & Innovator | Published Author AI/ML Expert Crafting the Future of Artificial Intelligence with Groundbreaking Innovations and Visionary Thought Leadership
Motivation
The motivation for writing this article is two-fold. With the advent of large language and vision models, developing general-purpose AI-enabled applications has become relatively easy for enterprises. However, there are challenges that enterprises face when developing customized large models.
The above challenges should be considered while fine-tuning large language or vision models. This article discusses a few novel techniques for optimizing the fine-tuning process.
Introduction
Large language models like GPT are pre-trained with web-scale data. The pre-trained models are general-purpose "few short learners" that apply to a variety of text-generation tasks. To generate specific responses, we can use prompt engineering techniques like "few shot learnings". Using "few short learnings", we can share a few examples of input and expected output pairs to guide the model in the right direction (these pairs act like hints to models to refresh their acquired knowledge). The alternative to prompt engineering is finetuning using curated datasets. However, both prompt engineering and finetuning have their own set of challenges.
Prompting is a data efficiency technique, in recent research it is found that a prompt is equivalent to 100s of data points. Prompt engineering is less helpful where there is dependency among tasks and multi-step reasoning is required. This can be solved to an extent using advanced prompt engineering techniques like chain-of-thought-prompting and tree-of-thought prompting. However, one may hit the token limit or spend more money in the process of searching for good prompts.
Finetuning is preferred for tuning an LLM for a specific downstream task and to reduce the cost of adding prompts to every input. The challenge with fine-tuning is the cost of tuning itself. The amount of GPU time and memory required to fine-tune an LLM depends on the size of the model, the number of parameters, the size of the sequence, and the dataset. LLMs have billions of parameters (GPT-3–175 billion). The recent addition GPT-4 has 1.75 trillion parameters. Hence, training and deploying individually fine-tuned models for discrete downstream tasks by tuning all those parameters is expensive and non-practical for enterprises and start-ups. Checkpointing LLMs during the finetuning process is also a concern since in most LLMs the checkpoint model is the same size as the original model.
This is where PEFT is helpful. PEFT stands for Parameter Efficient Fine Tuning. The approach recommended by PEFT is to freeze the parameters of the large models and finetune only a smaller subset of parameters to reduce the training and memory cost.
PEFT also solves another key problem associated with neural networks called Catastrophic forgetting. Catastrophic forgetting is observed in neural networks when a neural network that is specialized in one task forgets to do the task when trained on a second task.
PEFT is applied in environments where there are resource constraints or not enough data to train. There are several approaches to implementing PEFT—LoRa, QLoRA, Adapters, Prompt Tuning, Prefix Tuning, and P-Tuning to name a few. Let us look at some of these methods.
LoRA: Low-Rank Adaption of large-language models.
The authors of LoRA believe that traditional weight updates to large language models are of low rank. Hence, they propose to freeze the pre-trained weights and add a trainable low-rank weight matrix to all the layers of the transformers. This significantly reduces the number of training parameters and memory requirements with no additional latency during inference. For GPT3 the number of trainable parameters is reduced by 10k times and GPU memory requirements by 3 times.
In the below equation that is related to forward pass in a transformer model, Wo is the current weight matrix that is frozen. B and A are low-rank weight matrices that are fine-tuned. The dimensions of matrices B and A are n X r, r X n, correspondingly where r is the rank.
The Rank value in LoRA configuration plays a key role in the optimization process the smaller the Rank value the fewer weights to tune and more optimization.
There is little evidence of any model performance loss reported at the time of this writing. Anyscale has conducted some analysis of model performance loss compared to baseline and full-scale training, they observed some performance loss for specialized tasks like mathematical reasoning but for others there was none.
This approach can be applied to any transformer model in general. Here are a few reference implementations.
领英推荐
IA3 (Infused Adapters by Inhibiting and Amplifying Inner Activations)
Using Adapters is very straightforward, the pre-trained weights of the model are frozen, and learned vectors are introduced in the attention and feed-forward transformer layers as shown in the image. During the training process, only the weights of the adapter layer are updated. Compared to LoRA the number of weight parameters is less. Another key advantage of using adapters is that they are portable so they can be reused across multiple downstream tasks.
This has proven to be very memory efficient and consumes fewer resources, for tasks specific finetuning with no impact on model performance and inference latency.
Prompt Tuning, Prefix Tuning, and P Tuning.
Prompt tuning, prefix Tuning, and P-tuning are similar concepts and fundamentally different from the well-known concept of Prompt Engineering. In prompt engineering, we guide LLMs to generate the task-specific response by providing examples (a few short learning) called Prompts. In general, prompts can be broadly classified into hard prompts and soft prompts. Human-identified prompts that we all know are called hard prompts. In contrast, soft prompts are tunable embeddings that are added to the inputs.
In research conducted by Google, soft prompts have outperformed hard prompts. Depending on how and where these tunable embeddings are added to input, different techniques have evolved. One contrasting difference between this and other techniques explained above is that the tunable weights are added to the inputs.
The common idea behind all these different techniques is that the model weights are frozen and only the soft prompt tokens are fine-tuned. Let us look at each of these techniques.
Link to P Tuning V2 paper, here.
Prompt-tuning is not just applicable to large language models, it can also be considered for other transformer-based models like vision, and vision-language models that work on sequential data. Soft prompts can be a sequence of texts or blocks of pixels.
Deep learning models lack interpretability, which also extends to prompt tuning techniques. When good soft prompts are found that generate task-specific responses compared to hard prompts, the embeddings of these soft prompts are not in human-readable format and hence cannot be explained.
Summary
In summary, pre-trained transformer models are composed of billions of parameters. They can be used without finetuning for a majority of tasks like next-word prediction, text analysis, and classification. Prompt engineering can help only in guiding the transformer models to generate specific responses by providing examples of input, and output pairs. In this approach, the weights of the models are not changed. However, this approach has certain drawbacks like token length, and the effort required in the identification of good prompts. Fine-tuning helps reduce the effort by tuning the weights of the model using customized datasets, this avoids adding large prompts to every request. But fine-tuning needs lots of data, computing, and memory resources to train.
PEFT is an approach that can be considered while fine-tuning large models, it allows you to train the models faster with fewer computed resources and consumes a smaller memory footprint, with a very minimal (almost negligible) impact on model quality.
Research around fine-tuning of large models is rapidly evolving, recent advancements include
Fine-tuning models for task-specific learning and multi-task learning is going to become more effective with fewer compute/memory requirements and data. Extracting the right knowledge from a pre-trained model with optimized resources is the overall goal of these approaches.
?References