登录查看更多内容

Continuing My AI Engineering Journey: Exploring Fine-Tuning and Quantizing Language Models

Raj Kumar

AI/ML Engineer | Led Personalization at Koo

发布日期: 2024年5月6日

As part of my AI engineering bootcamp, I've explored fine-tuning LLMs, this article is my attempt to stitch together insights from various sources. Coming from a background of statistics, concept of quantization was a bit a puzzle for me initially. It could be the same for many of us, so I've made an effort to break down these concepts into digestable pieces.

What is Fine-tuning?

Fine-tuning is the process of adapting a pre-trained existing LLM to a specific dataset and task through additional supervised training
Basically you start with a pre-trained model that knows a lot already and then you teach it about a more particular task by training it with further specific examples

There are three primary forms of fine-tuning:

Task Training: Training the behavior of LLM response
Constraining I-O Schema: Training the format of LLM response
Language Training: Training the interpretation of new words

Let's take different analogies to understand these forms of fine-tuning

Task Training: Imagine you have a smart home assistant say Google Assistant that controls your lights at home. If you decide to host a lot of parties, you might want to train it to create the perfect party atmosphere—turning lights down when it sense a party is starting.
Constraining I-O Schema: Imagine a programmer setting up a tool to only give results in a specific computer code format, such as JSON, Docker, or YAML.
Language Training: This is like teaching the tool new words from specific jobs like law, medicine, or banking that it didn't know before, making it smarter in those areas.

In this article we will cover, Supervised Fine-tuning (SFT) which involves adapting a pre-trained Language Model to perform specific tasks more effectively by training it further with a special set of examples.

Why you should Fine-Tune?

Improve performance: Fine-tuning can significantly improve your model performance. For instance, for lawyers fine-tuning a model on court cases documents helps the model perform better on generating legal language.
Customization: Fine-tuning allows you to tailor the model to your specific needs.
Cost-effective: Fine-tuning is a cost-effective approach that saves time and resources by enhancing an existing model's capabilities to meet specific needs, rather than building a new one from scratch
Privacy: Fine-tuning enhances privacy by keeping your sensitive data within your control, reducing reliance on external systems and ensuring that confidential information remains secure in your environment.

When you should use RAG or Fine-tuning?

RAG is ideal when you want to add knowledge or facts for example, LLM was trained a while ago and you want your model to know things that are happening today.

We should go for fine-tuning for changing behavior of LLMs:

Setting the Style, Tone, and Format: Tailor your model to align with a specific voice or style, making it suitable for branded content or particular communication standards
Enhancing Reliability: Improve the consistency with which the model produces the desired outputs, reducing unpredictability in its responses
Addressing complex prompts: Equip the model to better understand and respond to intricate or multi-step instructions that require a nuanced approach
Managing Edge cases: Helps the model handle unique or unusual scenarios in specific ways, enhancing its ability to deal with exceptions
Learning New Skills: Performing a new skill or task that’s hard to articulate in a prompt

Also, utilising RAG and fine-tuning in tandem can significantly amplify your model’s capabilities. This powerful combination enables the model to not only stay current with factual accuracy but also excel in performance across a variety of tasks and interactions

How can we fine-tune a LLM ?

First, we start with a base model—a pre-trained model that already knows a lot. We can think of that as a university student who has done a bachelor's course but now needs specific training tailored to particular job or task.

Often, we focus on something called instruction-tuning which means we specifically train the model to be good at following detailed instructions, much like training it to handle custom requests.

ProTip: Useful tip for starting your fine-tune journey is to pick a model that has already been instruction-tuned which ensures model is already somewhat adept at following instructions before you further customise it for your custom needs.

So, by selecting an instruction-tuned model as our starting point, you make your fine-tuning process more effective.

Challenges of Fine Tuning LLMs

Hardware Limitations: Full fine-tuning of LLMs often require substantial computational resources that may not be available on consumer hardware which means individuals or even researchers at universities may find it impractical to do without the access of compute
Cost of Deployment: Maintaining and deploying separate fine-tune models for each specific task can become costly since each model may require its own storage and operational infra which can quickly add up

Making Large Language Models Manageable

When leveraging open-source models, our goal is to efficiently store, fine-tune and deploy them. However, size of these models often poses significant challenges and downloading these problems is problematic because of their extensive storage requirements
Fortunately, there is a trend towards creating smaller, more efficient models. This may not only reduce the burden of storage but also enhance transparency, improve accuracy and strengthen security

What we are going to do this in article is to show how to run these models on consumer hardware leveraging an emerging technique called quantization for making this possible.

Before diving into quantization, we should recall that in a neural network, we have weights (neural network parameters) and activations (values that propagate through the neural network). So, In a neural network, you can quantize weights which is parameters and activations.

We will focus only on weights (parameters) in this article

Typically, parameters of a pre-trained model are in 32 bit format

What is Quantization?

Quantization refers to the process of mapping a larger set to a smaller set of values

Why Quantization?

Model Compression: It enables us to shrink models to a smaller size, so that anyone can run it on their own computer with little to no performance degradation.
Quantized models are easier to distribute. For example, Llama2 70B (280 GB storage) if stored in 4-bit precision, reduces to 40 GB which is 7x reduction.

SOTA methods for model compression are:

Pruning: Simply removing the layers from the models that do not have much importance on model's decision
Knowledge Distillation: Trains a smaller student model using the outputs from a larger teacher (instructor) model efficiently. This process, though effective, requires substantial computing power and can be very costly, especially with larger models.
Quantization

? Idea is to store the parameters in a lower precision. For example, storing a 32 bit floating point number may be from a model that you want to deploy using 16 bit floating point integer. This halves the model size with almost identical inference outcome.

? It is achieved by converting numerical values to a different data type

Following's how quantization works given an input tensor X to a 2-bit data type explained by TimDettmers

Normalise X into the range of [-1.0,1.0] by dividing with absmax
Find the closest value in data type (rounding for integers; in general binary search)

There are many recent SOTA quantization methods:

LLM.int8() (only 8 bit) - Aug 2022 - Dettmers et al.
QLoRA (only 4-bit) - May 2023 - Dettmers et al.
Activation aware weight Quantization - June 2023 - Lin et al.
GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers - Oct 2022 - Frantar et al.
SmoothQuant - Nov 2022 - Xiao et al.

Even more recent SOTA quantization methods for 2-bit quantizations:

QuIP# - Jul 2023 - Tseng et al.
HQQ - Nov 2023 - Badri et al.
AQLM - Feb 2024 - Egiazarian et al.

In brief, all of these methods are designed to make LLMs smaller, therefore faster and minimising performance degradation.

All of them are "open-source"

Some quantization methods require calibration which is basically running inference on a dataset and optimise quantization parameters to minimise quantization error.

If you apply these quantization methods to other models (other than LLMs), you may need to make adjustments to the quantization methods.

Some of the quantization methods can be applied "Out of the box":

Linear quantization
LLM int8() (only 8-bit)
QLoRA (only 4-bit)
HQQ (only 2-bit)

In Hugging face, you can also find powerful quantized model distributors such as TheBloke that distributes to the community quantized weights in different formats (GGUF, GPTQ and AWQ) which most of the time require some pre-calibration that might be costly for anyone to run it.

Following are some methods that are popularly used and we will see how to quantize the 32 bit weights to 16-bit weights

bitsandbyte

? Allows for flexible use with LoRA adapters

? Does not provide any inference benefits though

LoRa
QLoRA (only 4-bit)

Fine-Tuned Quantized Models

Benefits of fine tuning a quantized model:

Recover the accuracy from quantization
Tailor your model for specific use-cases and applications

In this post, we are discussing second use case which would leverage PEFT (Parameter efficient fine tuning) which helps significantly reduce the number of trainable parameters of a model while keeping the same performance as fine-tuning.

LoRa (Low-Rank Adaption for Large Language models) is one of the most adopted PEFT methods

LoRA is inspired by the following research paper that discusses the intrinsic dimensionality of large language models. Essentially, it suggests that for fine-tuning, you don't need to adjust every single parameter in these models. Instead, you can just modify a small subset of the model's weights and still get great results for specific tasks.

How LoRA works?

First, Let's recall what rank of a matrix basically tell us, how many independent row or column vectors exist in matrix
Low-rank means matrices with rank smaller than number of dimensions and adaption is fine-tuning of models

Update matrices: LoRA decomposes a large weight matrix into two smaller, low-rank matrices.
New matrices can be trained to adapt to the new data while keeping the overall number of changes low
Freezes the pre-trained model weights
To produce the final results, both the original and the adapted weights are combined.
Number of trainable parameters in a LoRA model depends on the size of the low-rank update matrices, which is determined mainly by the rank r and the shape of the original weight matrix

QLoRA

Quantized model weights + low rank adapters
Quantize the precision of the weight parameters in the pre trained LLM to 4-bit precision
Reduces the memory footprint of the LLM, making it possible to finetune it on a single GPU
QLoRA uses NF4 (4-bit Normal Float), double quantization and paged optimizers combined with LoRA to replicate 16-bit full fine-tuning performance at a 17x smaller memory footprint

From LoRa to QLoRA

For a 70B foundational model below, fine-tuning would require 840GB of GPU memory which is basically 34 consumer GPUs i.e very expensive but with LoRA it makes the gradient much smaller, optimiser much smaller and few weights of the adapters, overall we have 17.6 bits per parameter, now we are down to 8 consumer GPUs. Still very expensive, obviously largest memory foot print is here with weights, that's exactly what we will do with QLoRA.

In QLoRA, we start by compressing a transformer model into a smaller size, specifically using 4-bit quantization, and then we lock it in place. We add adapters on top of this compressed model. When fine-tuning, we only adjust these adapters, not the original model parts that are frozen.

This approach means that each parameter effectively uses only about 5.2 bits, significantly reducing the memory needed. The whole setup requires just 46 GB of GPU memory and can run on two standard consumer GPUs.

So, now we can fine-tune a model which would have needed hardware worth $250,000 to fine-tune on a setup worth $3,000. Much cheaper and memory efficient.

Task Fine-tuning Mistral-7b-Instruct on a summarisation task

We fine-tuned the Mistral-7b-Instruct model for a summarization task using Google Colab, which provided a T4-GPU for computation.

Loading Open LLM

In this post, we will start loading the model Mistral-7B-Instruct-v0.2 as you always want to pull the instruct model as we discussed in this post

If you go and check the model card on hugging face, it says tensor type is BF16 i.e 16 bit floating precision

Load the model

When we are loading the model, we are going to use a quantization strategy to load the model on GPU.

Idea is to basically using a 16-bit representation of our model we're going to use is to compress to 4-bit representation of our model.

So, we store our model in 4 bit precision and during inference, when we're going to do computations we will scale them up to 16 bit precision (De-quantization)

Configure the quantization libraries

# Configure the quantization libraries 

bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_double_quant=True,
    bnb_4bit_compute_dtype=torch.float16,
)

Load model and tokenizer

Using transformers lib from HF to load a pre-trained causal language model

model_id = "mistralai/Mistral-7B-Instruct-v0.2" 

model = AutoModelForCausalLM.from_pretrained( model_id, quantization_config=bnb_config, device_map='auto', )

tokenizer = AutoTokenizer.from_pretrained(model_id)

# Setting the pad_token to unk_token means that the tokenizer will use the unknown token for padding

tokenizer.pad_token = tokenizer.unk_token

# Setting padding_side to "right" means that padding tokens will be added at the end of sequences

tokenizer.padding_side = "right"

Model construction

print(model)

We can see that we have Mistral model with a 32 decoder layers and all of them are in linear 4 bit format now.

Dataset Preparation

We'll be using samsun dataset which contains ~16K conversations in messenger format as well as their summaries.

Training has 14,732 rows , test has 819 rows, and validation has 818 rows. For our convenience, we will just take 1000, 50 and 50 in each to see fine-tuning in action.


from datasets import load_dataset

dataset_name = "samsum"
dataset = load_dataset(dataset_name)

dataset["train"] = dataset["train"].select(range(1000))
dataset["test"] = dataset["test"].select(range(50))
dataset["validation"] = dataset["validation"].select(range(50))

dataset["train"][0]

Let's look at one example of training data which has id, dialogue and summary

{'id': '13818513',
 'dialogue': "Amanda: I baked  cookies. Do you want some?\r\nJerry: Sure!\r\nAmanda: I'll bring you tomorrow :-)",
 'summary': 'Amanda baked cookies and will bring Jerry some tomorrow.'}

Then, we define the following functions create_prompt and generate_response for creating the prompt given a row from the dataset and generate response given a prompt, model and tokenizer

def create_prompt(sample, include_response = True):
  """
  Parameters:
    - sample: dict representing row of dataset
    - include_response: bool

  Functionality:
    This function should build the Python str `full_prompt`.

    If `include_response` is true, it should include the summary -
    else it should not contain the summary (useful for prompting) and testing

  Returns:
    - full_prompt: str
  """

  # Extract the text to be summarized from the sample dictionary
  text_to_summarize = sample['dialogue']

  # Start constructing the prompt
  full_prompt = "[INST]Provide a summary of the following text:\n\n[INPUT_TEXT_START]\n"
  full_prompt += text_to_summarize
  full_prompt += "\n[INPUT_TEXT_END]\n\n[/INST]\n\n"

  # Include the summary if include_response is True
  if include_response:
      summary = sample['summary']  # Assuming 'summary' key holds the summary in the sample
      full_prompt += "SUMMARY: " + summary



  return full_prompt

def generate_response(prompt, model, tokenizer):
  """
  Parameters:
    - prompt: str representing formatted prompt
    - model: model object
    - tokenizer: tokenizer object

  Functionality:
    This will allow our model to generate a response to a prompt!

  Returns:
    - str response of the model
  """

  # convert str input into tokenized input
  encoded_input = tokenizer(prompt,  return_tensors="pt")

  # send the tokenized inputs to our GPU
  model_inputs = encoded_input.to('cuda')

  # generate response and set desired generation parameters
  generated_ids = model.generate(
      **model_inputs,
      max_new_tokens=256,
      do_sample=True,
      pad_token_id=tokenizer.eos_token_id
  )

  # decode output from tokenized output to str output
  decoded_output = tokenizer.batch_decode(generated_ids)

  # return only the generated response (not the prompt) as output
  return decoded_output[0].split("[/INST]")[-1]

Apply some post-processing on the 4-bit model to enable training
Freeze all our layers
Cast the layer-norm in float32 for stability
Also, cast the output of the last layer in float32 for the same reasons

Setting up PEFT LoRA

from peft import prepare_model_for_kbit_training 
model.config.use_cache = False 
model = prepare_model_for_kbit_training(model)

def print_trainable_parameters(model):
    """
    Prints the number of trainable parameters in the model.
    """
    trainable_params = 0
    all_param = 0
    for _, param in model.named_parameters():
        all_param += param.numel()
        if param.requires_grad:
            trainable_params += param.numel()
    print(
        f"trainable params: {trainable_params} || all params: {all_param} || trainable%: {100 * trainable_params / all_param}"
    )

from peft import LoraConfig, get_peft_model

# set our rank (higher value is more memory/better performance)
lora_r = 16

# set our dropout (default value)
lora_dropout = 0.1

# rule of thumb: alpha should be (lora_r * 2)
lora_alpha = 32

# construct our LoraConfig with the above hyperparameters
peft_config = LoraConfig(
    lora_alpha=lora_alpha,
    lora_dropout=lora_dropout,
    r=lora_r,
    bias="none",
    task_type="CAUSAL_LM"
)

model = get_peft_model(
    model,
    peft_config
)

print_trainable_parameters(model)

trainable params: 6815744 || all params: 3758886912 || trainable%: 0.18132346515244138

We notice here that less than 1% of parameters are trainable

Setting up Training

from transformers import TrainingArguments

# Configure the training arguments for the model
args = TrainingArguments(
  output_dir = "mistral7binstruct_summarize",
  #num_train_epochs=5,
  max_steps = 50, # comment out this line if you want to train in epochs
  per_device_train_batch_size = 1,
  warmup_steps = 0.03,
  logging_steps=10,
  #evaluation_strategy="epoch",
  evaluation_strategy="steps",
  eval_steps=25, # comment out this line if you want to evaluate at the end of each epoch
  learning_rate=2e-4,
  lr_scheduler_type='constant',
)

from trl import SFTTrainer

max_seq_length = 2048 # Maximum sequence length for model inputs

# Initialize the supervised fine-tuning trainer with the specified arguments
trainer = SFTTrainer(
  model=model,
  peft_config=peft_config, # Parameter-efficient fine-tuning configuration     max_seq_length=max_seq_length,
  tokenizer=tokenizer,
  packing=True,
  formatting_func=create_prompt,
  args=args,
  train_dataset=dataset["train"],
  eval_dataset=dataset["validation"]
)

When we fine-tune a language model, we aim to optimize its performance, which we can measure through "loss." Loss quantifies how far off a model's predictions are from the actual answers. Lower loss values indicate better model performance, showing that the model's predictions are becoming more accurate.

At epoch 25, the model had a training loss of 1.732500 and a validation loss of 1.537009.
By epoch 50, both losses had decreased, with training loss at 1.491500 and validation loss at 1.453956.

Now, Push the model to Hugging Face hub

from huggingface_hub import notebook_login

notebook_login()

trainer.push_to_hub("rajkstats/mistral-7binstruct-summary-100s")

Next, we will merge the smaller LoRa model into the base model. Afterwards, it will unload the adapter as it is no longer needed.

merged_model = model.merge_and_unload()

Now, let's take a look at one example to see how model works

print(dataset["test"][3]["dialogue"])

Will: hey babe, what do you want for dinner tonight?

Emma:  gah, don't even worry about it tonight

Will: what do you mean? everything ok?

Emma: not really, but it's ok, don't worry about cooking though, I'm not hungry

Will: Well what time will you be home?

Emma: soon, hopefully

Will: you sure? Maybe you want me to pick you up?

Emma: no no it's alright. I'll be home soon, i'll tell you when I get home. 

Will: Alright, love you. 

Emma: love you too.

Let's look at the base model response:

Emma won't be home for dinner tonight, she has a problem, but she doesn't want Will to worry. She'll let him know when she's home. She'll be coming soon.

And the fine-tuned model leveraging generating_response function we had written:

generate_response(create_prompt(dataset["test"][3], include_response=False),
                  merged_model,
                  tokenizer)

Emma is not feeling well. She will be home soon. She doesn't want Will to cook anything for dinner.

we can see that the model performs the task better than the original un-fine-tuned model - though there is still work to do.

We can see that the model performs the task better than the original un-fine-tuned modelHowever, there remains room for further refinement and optimization.

References:

John V.

Experienced AI Red Team Specialist. Gen AI risk, safety, and security. Evals. Currently working on things I can't talk about :)

10 个月

Awesome

1 次回应

AI Makerspace

10 个月

?? ?? ??

1 次回应

?????? "Dr. Greg" Loughnane

??? Building, ?? shipping, and ?? sharing the best AI Engineering bootcamp on the internet | ???? Teaching LLM concepts and code weekly on YouTube #unautomatable

10 个月

WTG Raj!!

1 次回应

查看更多评论

要查看或添加评论，请登录

Raj Kumar的更多文章

AI Meets Pharmacy: The Making of PharmAssistAI

2024年5月18日

AI Meets Pharmacy: The Making of PharmAssistAI

This week I presented my demo day project at AI engineering bootcamp I've attended recently with AI Makerspace. After…

17 条评论
My Journey of Building the First Custom GPT for GPT Store

2024年4月22日

My Journey of Building the First Custom GPT for GPT Store

Hello everyone, Are you a GPT plus subscriber? Then, you've most likely noticed the option of building custom GPTs…

1 条评论
RShiny Contest 2020: Re-work of GitDiscoverer.com

2020年3月27日

RShiny Contest 2020: Re-work of GitDiscoverer.com

Hey there, I developed my first Rshiny web app with gitdiscoverer (old version) as part of RStudio Shiny Contest 2019…
The Python Package, R users need : rpy2

2018年4月24日

The Python Package, R users need : rpy2

Recently, I came across a situation where I would have to write a R library which is not available in Python. I was…

1 条评论

What is Quantization?

Why Quantization?

Setting up PEFT LoRA

Setting up Training

References:

Raj Kumar的更多文章

AI Meets Pharmacy: The Making of PharmAssistAI

My Journey of Building the First Custom GPT for GPT Store

RShiny Contest 2020: Re-work of GitDiscoverer.com

The Python Package, R users need : rpy2