Fine Tuning of Large Language Models. ( Instruction Finetuning & Parameter Efficient Finetuning)

Fine Tuning of Large Language Models. ( Instruction Finetuning & Parameter Efficient Finetuning)

Tuning the LLM is much more powerful than the base LLM. Let's see what is finetuning of LLM and how it works.

In this article I have put light on instruction finetuning and Parameter Efficient Finetuning method (LoRA).


Prerequisites :

For this blog I am assuming that you are already aware of Transformer Architecture, Generative AI, Large Language Models, Prompt Engineering.


Cost of Training LLMs from scratch :

  • Large Language? Models (LLMs) are trained on general purpose text data for general purpose tasks like text generation, summarization, question-answering, code generation, translation etc.
  • This LLMs are trained on enormous data which is in the range of terabytes and with hundreds of billions of trainable parameters. To train this much of data and large models it requires thousands of advanced GPUs. This much of training costs in the range of 2-3 digit millions of dollars.
  • GPT 3 model was trained on 45 TB of data which is around 500B tokens with 175B model parameters using thousands of advanced GPUs for several weeks(around 34 days as per some sources) at the time of 2020 and it cost around 4.6 million dollars only for computation.
  • Size and computation of such LLMs is getting more and more as progress is done in the field.
  • Also important point is this huge computation directly leads to carbon emission.


Why Finetuning of pretrained LLMs :

  • Even after LLMs are trained on massive amount of data for variety of tasks like, translation, text generation, summarization, QnA, code generation, Inferencing, It may not give relevant or compatible results on domain specific data or tasks.
  • Some complex domains such as medical, BFSI, law, organizations/government policies, highly technical tasks needs to be compliant with standards. Baseline LLMs may fail on this type of domain tasks or it may hallucinate and give wrong or inconsistent outputs.
  • Fine tuning of pretrained LLMs on domain or task specific data allows LLM to specialized in specific task while keeping its general intelligence as it is.
  • When we finetune model on task specific data then it updates or adjusts LLM parameters in order to be expert on given task. Its just like cricket batsman practicing some specific shot to be expert in it while they already know how to do general batting.


Advantages of Finetuning of LLM :

  • Fine Tuned model will give more consistent, relevant and accurate results for specified task.
  • RAG can be minimized or eliminate to get fast results as LLM already have domain knowledge. It will minimize cost as RAG or additional context to prompt will lead to more token size and number of tokens input is directly proportional to cost. Finetuned model may not required additional context or RAG pipeline so minimize overall cost.
  • Finetuned model can be store in local or on premise storage making it more private and secure from data leakage.
  • Minimizing RAG ultimately improves latency of response from LLM as input tokens reduces and LLM already learned domain knowledge.


Challenges in Finetuning of LLM :

  • Supervised or label data preparation can be a time consuming task
  • One time compute cost and resources is required as per amount of finetuning is to be done and data size.
  • More technical expertise required for making finetuning successful.



Difference between Prompt Engineering, RAG & Finetuning :

Prompt Engineering :

  • Prompt engineering is a technique of efficiently communicating with LLM.
  • In prompt engineering we pass efficient and well structured prompt or input query to the LLM in order to get relevant and desired output.


Retrieval Augmented Generation (RAG) :

  • In RAG framework we extract relevant context from external source and then use it along with query as a prompt to get result.
  • This allows us to connect to external or additional and latest data source to provide more context to input prompt.



Finetuning of pretrained LLMs :

  • In finetuning we retrain baseline LLM on domain or task specific data to adjust the model parameters.
  • Finetuned model will give more relevant output without using additional context in input prompt

In Prompt Engineering and RAG Input prompt size is more as we need to pass additional contextual information or examples in prompt to get desired and optimal results. This increases cost as LLM costings are based on number of input and output tokens.
In case of domain specific tasks if using baseline LLMs, we may need to use long prompts or RAG which ultimately increase input prompt. This will lead to more cost in long run, so finetuning of LLM is good option to teach model in single time investment, as finetuned model may not need heavy or more additional information with input prompt.


Types of Finetuning LLMs:

Instruction Finetuning :

  • In Instruction Finetuning, we provide labelled or supervised instruction data or prompts-completion pairs to the model and retrain/finetune model on it to alter the behavior and knowledge of LLM as per our task and domain.

  • Instructions or prompt-completion data should be in proper template.

For example :

### Instruction: 
Classify sentiment of given review as Positive, Neutral or Negative:

### Input:
"I enjoyed the movie, it was fantastic!"

### Response:
Sentiment: Positive        


Finetuning Process :

  • First we prepare task/domain specific instruction data to finetune our model.
  • Then we pass it to LLM to learn on this new data.
  • At each epoch loss is calculated between LLM completion and ground truth label/output. To minimize this loss, weights or parameters are get updated at each epoch and after certain epochs we will get finetuned model which will be expert on specified task or domain information.
  • Formula for weights update is,

W' = W + ΔW 
Where, W = original pretrained weight matrix  
ΔW = Weight Update Matrix        

  • We can finetune model on single or multiple tasks at a time using relevant instruction data.



Catastrophic forgetting :

  • Catastrophic forgetting in LLM is a phenomena of forgetting prior knowledge after retraining or finetuning? on new data.
  • It occurs because during finetuning model has to adjusts parameters to learn and adapt on new data.
  • It's just like human brain usually forget school/college syllabus after some years, as our brain is constantly evolving on new things(data) on daily basis.


How to handle catastrophic forgetting :

  • Use regularization methods like drop out, weight decay.
  • Consistently monitor performance of LLM on older task while finetuning to avoid overfitting on new data.
  • Use multi task instructions and train it simultaneously.
  • Use Parameter Efficient Fine-tuning ( PEFT )



Parameter Efficient Fine-tuning ( PEFT ) :

  • PEFT when compared to full finetuning is more efficient and effective solution for finetuning of LLM.
  • In full finetuning entire model is used with all of its weights and layers for retraining, which is computationally very expensive as it has to store lot of trainable parameters in memory which are around hundreds of GBs along with gradients, optimizer states, forward activations which need additional memory apart from trainable parameters. This is computationally costly as well as time consuming.
  • Also there are chances of catastrophic forgetting as all model weights are enabled for updation.
  • PEFT on the other hand uses only small part of pretrained model for finetuning and freezes rest of the model weights as it is untouched, which makes it computationally more efficient and also avoids catastrophic forgetting.
  • There are several PEFT Methods, which categorized into 3 parts.
  • 1) Selective : In this small subset of initial LLM parameters are selected for finetuning by freezing rest of the parameters.
  • 2) Reparameterization : In this model weights are reparameterize using low rank representation like LoRA.
  • 3) Additive : In this some additional layers of parameters which also called as adapters are added on top of baseline model and only those new parameters are get trained during finetuning by keeping baseline model parameters freeze.


Low Rank Adaptation ( LoRA ) :

  • In LoRA instead of updating original weights directly, we trace the changes in weights.
  • This is done by decomposing new weight matrices into low rank or low dimensional matrices.
  • In full finetuning weights update is done by using formula,

W' = W + ΔW 
Where, ΔW = α (-?Lw)        

  • We can also keep the weight update matrix separately and calculate output using alternate formula,

h = Wx + ΔWx
Where, h = outputx = inputs
W = original pretrained weight matrix
ΔW = Weight Update Matrix        

  • In full finetuning this W and ΔW are full rank matrices
  • In LoRA, idea is to use low intrinsic dimensions which can efficiently represent new data without loosing important information.
  • This is done by decomposing new weight matrix for adapted task into lower rank/dimensional matrices.
  • It is based on concept that multiplying two small matrices return the larger matrix
  • For example,If we multiply 1x5 matrix by 5x1 matrix, we will get 5x5 matrix

Here rank of decomposed matrix is 1.

  • Initially this low rank decomposed matrices are initialized as zeros and it gets updated as learning progresses.

  • B and A are the decomposed matrices with rank r.
  • Multiplying B and A will give the matrix with same size as of W.
  • So new weights becomes W + A*B
  • Rank of the LoRA decides the precision of the finetuned model on specific task.
  • If we use very low rank then model may underfit on task specific data and if we use very high rank then there may be unnecessarily more parameters to train than actually needed, which decreases efficiency.
  • For small and simple data, lower rank can be a good choice, and for huge and complex task data higher rank can work better. At the end it’s a hyperparameter and we need to play around it to get best value of rank.
  • As per studies rank beyond 16 gives similar performance.
  • As per LoRA paper and researchers, LoRA method is 3 times more efficient.


QLoRA ( Quantized LoRA ) :

  • QLoRA is variant of LoRA in which first 4 bit Quantization is done on LLM and then LoRA adapters are applied as usual.
  • QLoRA reduces model size by reducing parameter precision.
  • QLoRA is much more computationally efficient, It requires less memory, increases speed and it can be run on single GPU.


Thank You !

Piotr Malicki

NSV Mastermind | Enthusiast AI & ML | Architect Solutions AI & ML | AIOps / MLOps / DataOps | Innovator MLOps & DataOps for Web2 & Web3 Startup | NLP Aficionado | Unlocking the Power of AI for a Brighter Future??

1 年

Great article! Finetuning of large language models is indeed a game-changer. ?

回复

要查看或添加评论,请登录

Tejas Bankar的更多文章

社区洞察

其他会员也浏览了