Continuing My AI Engineering Journey: Exploring Fine-Tuning and Quantizing Language Models
As part of my AI engineering bootcamp, I've explored fine-tuning LLMs, this article is my attempt to stitch together insights from various sources. Coming from a background of statistics, concept of quantization was a bit a puzzle for me initially. It could be the same for many of us, so I've made an effort to break down these concepts into digestable pieces.
What is Fine-tuning?
There are three primary forms of fine-tuning:
Let's take different analogies to understand these forms of fine-tuning
In this article we will cover, Supervised Fine-tuning (SFT) which involves adapting a pre-trained Language Model to perform specific tasks more effectively by training it further with a special set of examples.
Why you should Fine-Tune?
When you should use RAG or Fine-tuning?
RAG is ideal when you want to add knowledge or facts for example, LLM was trained a while ago and you want your model to know things that are happening today.
We should go for fine-tuning for changing behavior of LLMs:
Also, utilising RAG and fine-tuning in tandem can significantly amplify your model’s capabilities. This powerful combination enables the model to not only stay current with factual accuracy but also excel in performance across a variety of tasks and interactions
How can we fine-tune a LLM ?
First, we start with a base model—a pre-trained model that already knows a lot. We can think of that as a university student who has done a bachelor's course but now needs specific training tailored to particular job or task.
Often, we focus on something called instruction-tuning which means we specifically train the model to be good at following detailed instructions, much like training it to handle custom requests.
ProTip: Useful tip for starting your fine-tune journey is to pick a model that has already been instruction-tuned which ensures model is already somewhat adept at following instructions before you further customise it for your custom needs.
So, by selecting an instruction-tuned model as our starting point, you make your fine-tuning process more effective.
Challenges of Fine Tuning LLMs
Making Large Language Models Manageable
What we are going to do this in article is to show how to run these models on consumer hardware leveraging an emerging technique called quantization for making this possible.
Before diving into quantization, we should recall that in a neural network, we have weights (neural network parameters) and activations (values that propagate through the neural network). So, In a neural network, you can quantize weights which is parameters and activations.
We will focus only on weights (parameters) in this article
Typically, parameters of a pre-trained model are in 32 bit format
What is Quantization?
Why Quantization?
SOTA methods for model compression are:
? Idea is to store the parameters in a lower precision. For example, storing a 32 bit floating point number may be from a model that you want to deploy using 16 bit floating point integer. This halves the model size with almost identical inference outcome.
? It is achieved by converting numerical values to a different data type
Following's how quantization works given an input tensor X to a 2-bit data type explained by TimDettmers
There are many recent SOTA quantization methods:
Even more recent SOTA quantization methods for 2-bit quantizations:
In brief, all of these methods are designed to make LLMs smaller, therefore faster and minimising performance degradation.
All of them are "open-source"
Some quantization methods require calibration which is basically running inference on a dataset and optimise quantization parameters to minimise quantization error.
If you apply these quantization methods to other models (other than LLMs), you may need to make adjustments to the quantization methods.
Some of the quantization methods can be applied "Out of the box":
In Hugging face, you can also find powerful quantized model distributors such as TheBloke that distributes to the community quantized weights in different formats (GGUF, GPTQ and AWQ) which most of the time require some pre-calibration that might be costly for anyone to run it.
Following are some methods that are popularly used and we will see how to quantize the 32 bit weights to 16-bit weights
? Allows for flexible use with LoRA adapters
? Does not provide any inference benefits though
Fine-Tuned Quantized Models
Benefits of fine tuning a quantized model:
In this post, we are discussing second use case which would leverage PEFT (Parameter efficient fine tuning) which helps significantly reduce the number of trainable parameters of a model while keeping the same performance as fine-tuning.
LoRa (Low-Rank Adaption for Large Language models) is one of the most adopted PEFT methods
LoRA is inspired by the following research paper that discusses the intrinsic dimensionality of large language models. Essentially, it suggests that for fine-tuning, you don't need to adjust every single parameter in these models. Instead, you can just modify a small subset of the model's weights and still get great results for specific tasks.
How LoRA works?
QLoRA
From LoRa to QLoRA
For a 70B foundational model below, fine-tuning would require 840GB of GPU memory which is basically 34 consumer GPUs i.e very expensive but with LoRA it makes the gradient much smaller, optimiser much smaller and few weights of the adapters, overall we have 17.6 bits per parameter, now we are down to 8 consumer GPUs. Still very expensive, obviously largest memory foot print is here with weights, that's exactly what we will do with QLoRA.
In QLoRA, we start by compressing a transformer model into a smaller size, specifically using 4-bit quantization, and then we lock it in place. We add adapters on top of this compressed model. When fine-tuning, we only adjust these adapters, not the original model parts that are frozen.
This approach means that each parameter effectively uses only about 5.2 bits, significantly reducing the memory needed. The whole setup requires just 46 GB of GPU memory and can run on two standard consumer GPUs.
So, now we can fine-tune a model which would have needed hardware worth $250,000 to fine-tune on a setup worth $3,000. Much cheaper and memory efficient.
Task Fine-tuning Mistral-7b-Instruct on a summarisation task
We fine-tuned the Mistral-7b-Instruct model for a summarization task using Google Colab, which provided a T4-GPU for computation.
Loading Open LLM
In this post, we will start loading the model Mistral-7B-Instruct-v0.2 as you always want to pull the instruct model as we discussed in this post
If you go and check the model card on hugging face, it says tensor type is BF16 i.e 16 bit floating precision
Load the model
When we are loading the model, we are going to use a quantization strategy to load the model on GPU.
Idea is to basically using a 16-bit representation of our model we're going to use is to compress to 4-bit representation of our model.
So, we store our model in 4 bit precision and during inference, when we're going to do computations we will scale them up to 16 bit precision (De-quantization)
Configure the quantization libraries
# Configure the quantization libraries
bnb_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_quant_type="nf4",
bnb_double_quant=True,
bnb_4bit_compute_dtype=torch.float16,
)
Load model and tokenizer
Using transformers lib from HF to load a pre-trained causal language model
model_id = "mistralai/Mistral-7B-Instruct-v0.2"
model = AutoModelForCausalLM.from_pretrained( model_id, quantization_config=bnb_config, device_map='auto', )
tokenizer = AutoTokenizer.from_pretrained(model_id)
# Setting the pad_token to unk_token means that the tokenizer will use the unknown token for padding
tokenizer.pad_token = tokenizer.unk_token
# Setting padding_side to "right" means that padding tokens will be added at the end of sequences
tokenizer.padding_side = "right"
Model construction
print(model)
We can see that we have Mistral model with a 32 decoder layers and all of them are in linear 4 bit format now.
Dataset Preparation
We'll be using samsun dataset which contains ~16K conversations in messenger format as well as their summaries.
Training has 14,732 rows , test has 819 rows, and validation has 818 rows. For our convenience, we will just take 1000, 50 and 50 in each to see fine-tuning in action.
from datasets import load_dataset
dataset_name = "samsum"
dataset = load_dataset(dataset_name)
dataset["train"] = dataset["train"].select(range(1000))
dataset["test"] = dataset["test"].select(range(50))
dataset["validation"] = dataset["validation"].select(range(50))
dataset["train"][0]
Let's look at one example of training data which has id, dialogue and summary
{'id': '13818513',
'dialogue': "Amanda: I baked cookies. Do you want some?\r\nJerry: Sure!\r\nAmanda: I'll bring you tomorrow :-)",
'summary': 'Amanda baked cookies and will bring Jerry some tomorrow.'}
Then, we define the following functions create_prompt and generate_response for creating the prompt given a row from the dataset and generate response given a prompt, model and tokenizer
def create_prompt(sample, include_response = True):
"""
Parameters:
- sample: dict representing row of dataset
- include_response: bool
Functionality:
This function should build the Python str `full_prompt`.
If `include_response` is true, it should include the summary -
else it should not contain the summary (useful for prompting) and testing
Returns:
- full_prompt: str
"""
# Extract the text to be summarized from the sample dictionary
text_to_summarize = sample['dialogue']
# Start constructing the prompt
full_prompt = "[INST]Provide a summary of the following text:\n\n[INPUT_TEXT_START]\n"
full_prompt += text_to_summarize
full_prompt += "\n[INPUT_TEXT_END]\n\n[/INST]\n\n"
# Include the summary if include_response is True
if include_response:
summary = sample['summary'] # Assuming 'summary' key holds the summary in the sample
full_prompt += "SUMMARY: " + summary
return full_prompt
def generate_response(prompt, model, tokenizer):
"""
Parameters:
- prompt: str representing formatted prompt
- model: model object
- tokenizer: tokenizer object
Functionality:
This will allow our model to generate a response to a prompt!
Returns:
- str response of the model
"""
# convert str input into tokenized input
encoded_input = tokenizer(prompt, return_tensors="pt")
# send the tokenized inputs to our GPU
model_inputs = encoded_input.to('cuda')
# generate response and set desired generation parameters
generated_ids = model.generate(
**model_inputs,
max_new_tokens=256,
do_sample=True,
pad_token_id=tokenizer.eos_token_id
)
# decode output from tokenized output to str output
decoded_output = tokenizer.batch_decode(generated_ids)
# return only the generated response (not the prompt) as output
return decoded_output[0].split("[/INST]")[-1]
Setting up PEFT LoRA
from peft import prepare_model_for_kbit_training
model.config.use_cache = False
model = prepare_model_for_kbit_training(model)
def print_trainable_parameters(model):
"""
Prints the number of trainable parameters in the model.
"""
trainable_params = 0
all_param = 0
for _, param in model.named_parameters():
all_param += param.numel()
if param.requires_grad:
trainable_params += param.numel()
print(
f"trainable params: {trainable_params} || all params: {all_param} || trainable%: {100 * trainable_params / all_param}"
)
from peft import LoraConfig, get_peft_model
# set our rank (higher value is more memory/better performance)
lora_r = 16
# set our dropout (default value)
lora_dropout = 0.1
# rule of thumb: alpha should be (lora_r * 2)
lora_alpha = 32
# construct our LoraConfig with the above hyperparameters
peft_config = LoraConfig(
lora_alpha=lora_alpha,
lora_dropout=lora_dropout,
r=lora_r,
bias="none",
task_type="CAUSAL_LM"
)
model = get_peft_model(
model,
peft_config
)
print_trainable_parameters(model)
trainable params: 6815744 || all params: 3758886912 || trainable%: 0.18132346515244138
We notice here that less than 1% of parameters are trainable
Setting up Training
from transformers import TrainingArguments
# Configure the training arguments for the model
args = TrainingArguments(
output_dir = "mistral7binstruct_summarize",
#num_train_epochs=5,
max_steps = 50, # comment out this line if you want to train in epochs
per_device_train_batch_size = 1,
warmup_steps = 0.03,
logging_steps=10,
#evaluation_strategy="epoch",
evaluation_strategy="steps",
eval_steps=25, # comment out this line if you want to evaluate at the end of each epoch
learning_rate=2e-4,
lr_scheduler_type='constant',
)
from trl import SFTTrainer
max_seq_length = 2048 # Maximum sequence length for model inputs
# Initialize the supervised fine-tuning trainer with the specified arguments
trainer = SFTTrainer(
model=model,
peft_config=peft_config, # Parameter-efficient fine-tuning configuration max_seq_length=max_seq_length,
tokenizer=tokenizer,
packing=True,
formatting_func=create_prompt,
args=args,
train_dataset=dataset["train"],
eval_dataset=dataset["validation"]
)
When we fine-tune a language model, we aim to optimize its performance, which we can measure through "loss." Loss quantifies how far off a model's predictions are from the actual answers. Lower loss values indicate better model performance, showing that the model's predictions are becoming more accurate.
Now, Push the model to Hugging Face hub
from huggingface_hub import notebook_login
notebook_login()
trainer.push_to_hub("rajkstats/mistral-7binstruct-summary-100s")
Next, we will merge the smaller LoRa model into the base model. Afterwards, it will unload the adapter as it is no longer needed.
merged_model = model.merge_and_unload()
Now, let's take a look at one example to see how model works
print(dataset["test"][3]["dialogue"])
Will: hey babe, what do you want for dinner tonight?
Emma: gah, don't even worry about it tonight
Will: what do you mean? everything ok?
Emma: not really, but it's ok, don't worry about cooking though, I'm not hungry
Will: Well what time will you be home?
Emma: soon, hopefully
Will: you sure? Maybe you want me to pick you up?
Emma: no no it's alright. I'll be home soon, i'll tell you when I get home.
Will: Alright, love you.
Emma: love you too.
Let's look at the base model response:
Emma won't be home for dinner tonight, she has a problem, but she doesn't want Will to worry. She'll let him know when she's home. She'll be coming soon.
And the fine-tuned model leveraging generating_response function we had written:
generate_response(create_prompt(dataset["test"][3], include_response=False),
merged_model,
tokenizer)
Emma is not feeling well. She will be home soon. She doesn't want Will to cook anything for dinner.
we can see that the model performs the task better than the original un-fine-tuned model - though there is still work to do.
We can see that the model performs the task better than the original un-fine-tuned modelHowever, there remains room for further refinement and optimization.
References:
Experienced AI Red Team Specialist. Gen AI risk, safety, and security. Evals. Currently working on things I can't talk about :)
10 个月Awesome
?? ?? ??
??? Building, ?? shipping, and ?? sharing the best AI Engineering bootcamp on the internet | ???? Teaching LLM concepts and code weekly on YouTube #unautomatable
10 个月WTG Raj!!