Finetuning large language models using novel techniques (PEFT).
Image source: https://pin.it/2Y6Gm55

Finetuning large language models using novel techniques (PEFT).

Motivation

The motivation for writing this article is two-fold. With the advent of large language and vision models, developing general-purpose AI-enabled applications has become relatively easy for enterprises. However, there are challenges that enterprises face when developing customized large models.

  1. Cost of customization: Large models are pre-trained with web-scale data. The owners of these models take measures to clean and fine-tune the models using RL and other optimization techniques. Yet, the models are proven to be biased, hallucinate, and may not generate task-specific responses when applied in specific enterprise scenarios. This necessitates fine-tuning with custom datasets. That said, the cost of fine-tuning large models is considerably high with cloud resources.
  2. Availability of data: It is a known fact that fine-tuning only works well when cleaned data are abundant. Unfortunately, when finetuning a model for a specific task (say a generative model fine-tuned to answer queries on HR policies in a non-toxic, unbiased fashion.) the amount of data passed to the model is not web-scale.

The above challenges should be considered while fine-tuning large language or vision models. This article discusses a few novel techniques for optimizing the fine-tuning process.


Introduction

Large language models like GPT are pre-trained with web-scale data. The pre-trained models are general-purpose "few short learners" that apply to a variety of text-generation tasks. To generate specific responses, we can use prompt engineering techniques like "few shot learnings". Using "few short learnings", we can share a few examples of input and expected output pairs to guide the model in the right direction (these pairs act like hints to models to refresh their acquired knowledge). The alternative to prompt engineering is finetuning using curated datasets. However, both prompt engineering and finetuning have their own set of challenges.

Prompting is a data efficiency technique, in recent research it is found that a prompt is equivalent to 100s of data points. Prompt engineering is less helpful where there is dependency among tasks and multi-step reasoning is required. This can be solved to an extent using advanced prompt engineering techniques like chain-of-thought-prompting and tree-of-thought prompting. However, one may hit the token limit or spend more money in the process of searching for good prompts.

Finetuning is preferred for tuning an LLM for a specific downstream task and to reduce the cost of adding prompts to every input. The challenge with fine-tuning is the cost of tuning itself. The amount of GPU time and memory required to fine-tune an LLM depends on the size of the model, the number of parameters, the size of the sequence, and the dataset. LLMs have billions of parameters (GPT-3–175 billion). The recent addition GPT-4 has 1.75 trillion parameters. Hence, training and deploying individually fine-tuned models for discrete downstream tasks by tuning all those parameters is expensive and non-practical for enterprises and start-ups. Checkpointing LLMs during the finetuning process is also a concern since in most LLMs the checkpoint model is the same size as the original model.

Image showing the difference between pre-trained and finetuned models (Image by Author)

This is where PEFT is helpful. PEFT stands for Parameter Efficient Fine Tuning. The approach recommended by PEFT is to freeze the parameters of the large models and finetune only a smaller subset of parameters to reduce the training and memory cost.

PEFT also solves another key problem associated with neural networks called Catastrophic forgetting. Catastrophic forgetting is observed in neural networks when a neural network that is specialized in one task forgets to do the task when trained on a second task.

PEFT is applied in environments where there are resource constraints or not enough data to train. There are several approaches to implementing PEFT—LoRa, QLoRA, Adapters, Prompt Tuning, Prefix Tuning, and P-Tuning to name a few. Let us look at some of these methods.


LoRA: Low-Rank Adaption of large-language models.

The authors of LoRA believe that traditional weight updates to large language models are of low rank. Hence, they propose to freeze the pre-trained weights and add a trainable low-rank weight matrix to all the layers of the transformers. This significantly reduces the number of training parameters and memory requirements with no additional latency during inference. For GPT3 the number of trainable parameters is reduced by 10k times and GPU memory requirements by 3 times.

In the below equation that is related to forward pass in a transformer model, Wo is the current weight matrix that is frozen. B and A are low-rank weight matrices that are fine-tuned. The dimensions of matrices B and A are n X r, r X n, correspondingly where r is the rank.

Forward pass in LoRA (Source:

The Rank value in LoRA configuration plays a key role in the optimization process the smaller the Rank value the fewer weights to tune and more optimization.

There is little evidence of any model performance loss reported at the time of this writing. Anyscale has conducted some analysis of model performance loss compared to baseline and full-scale training, they observed some performance loss for specialized tasks like mathematical reasoning but for others there was none.

This approach can be applied to any transformer model in general. Here are a few reference implementations.

Finetuning benefits using LoRA for GPT-2 model (Image source:

  • Finetuning LLMs using HF and PEFT library can be found here.QLoRA is an extension of LoRA that applies quantization to model weights. The proposal is to reduce the model weight’s precision from 16-bit to 4-bit Normal float (A new datatype) that significantly reduces the memory footprint without majorly sacrificing the model performance. A reference implementation of QLoRA can be found here.


IA3 (Infused Adapters by Inhibiting and Amplifying Inner Activations)

Using Adapters is very straightforward, the pre-trained weights of the model are frozen, and learned vectors are introduced in the attention and feed-forward transformer layers as shown in the image. During the training process, only the weights of the adapter layer are updated. Compared to LoRA the number of weight parameters is less. Another key advantage of using adapters is that they are portable so they can be reused across multiple downstream tasks.

An image showing infusion of adapters into transformer block (Source:

This has proven to be very memory efficient and consumes fewer resources, for tasks specific finetuning with no impact on model performance and inference latency.


Prompt Tuning, Prefix Tuning, and P Tuning.

Prompt tuning, prefix Tuning, and P-tuning are similar concepts and fundamentally different from the well-known concept of Prompt Engineering. In prompt engineering, we guide LLMs to generate the task-specific response by providing examples (a few short learning) called Prompts. In general, prompts can be broadly classified into hard prompts and soft prompts. Human-identified prompts that we all know are called hard prompts. In contrast, soft prompts are tunable embeddings that are added to the inputs.

In research conducted by Google, soft prompts have outperformed hard prompts. Depending on how and where these tunable embeddings are added to input, different techniques have evolved. One contrasting difference between this and other techniques explained above is that the tunable weights are added to the inputs.

The common idea behind all these different techniques is that the model weights are frozen and only the soft prompt tokens are fine-tuned. Let us look at each of these techniques.

Table showing the difference between various prompt tuning techniques. (Image by Author)

Link to P Tuning V2 paper, here.

Prompt-tuning is not just applicable to large language models, it can also be considered for other transformer-based models like vision, and vision-language models that work on sequential data. Soft prompts can be a sequence of texts or blocks of pixels.

Deep learning models lack interpretability, which also extends to prompt tuning techniques. When good soft prompts are found that generate task-specific responses compared to hard prompts, the embeddings of these soft prompts are not in human-readable format and hence cannot be explained.


Summary

In summary, pre-trained transformer models are composed of billions of parameters. They can be used without finetuning for a majority of tasks like next-word prediction, text analysis, and classification. Prompt engineering can help only in guiding the transformer models to generate specific responses by providing examples of input, and output pairs. In this approach, the weights of the models are not changed. However, this approach has certain drawbacks like token length, and the effort required in the identification of good prompts. Fine-tuning helps reduce the effort by tuning the weights of the model using customized datasets, this avoids adding large prompts to every request. But fine-tuning needs lots of data, computing, and memory resources to train.

PEFT is an approach that can be considered while fine-tuning large models, it allows you to train the models faster with fewer computed resources and consumes a smaller memory footprint, with a very minimal (almost negligible) impact on model quality.

Research around fine-tuning of large models is rapidly evolving, recent advancements include

Fine-tuning models for task-specific learning and multi-task learning is going to become more effective with fewer compute/memory requirements and data. Extracting the right knowledge from a pre-trained model with optimized resources is the overall goal of these approaches.


?References


要查看或添加评论,请登录

社区洞察

其他会员也浏览了