Step-by-Step Guide to Fine-Tuning Mistral 7B for Indian Languages
Sreeshti Singh
Cloud Consultant at E2E Networks - 6th largest IAAS platform in India | NSE Listed | High Performance cloud platform | Migrate to E2E Cloud and save up to 50%
What Is Fine-Tuning?
Fine-tuning a model involves adjusting a pre-trained model so it can perform better on a specific task. This process is akin to customizing a general tool to suit a particular job more effectively.?
Initially, the model is trained on a large, diverse dataset to learn a wide range of features and patterns. During fine-tuning, this model is further trained on a smaller, task-specific dataset, which helps it refine its knowledge and improve its predictions or performance on tasks closely related to this dataset.
This technique leverages the broad understanding the model has already developed, allowing it to apply this knowledge with greater precision to a narrower task, thereby enhancing its accuracy and efficiency in specific applications.
In this blog post, we will show you the step-by-step process to fine-tune the model Mistral 7B on an Indic-language dataset. We’ll be using the indic_glue dataset from Hugging Face. The dataset has many different modules for various Indic Languages. We are going to select the Telugu language module to fine-tune our model.
E2E Networks: An Overview?
Since fine-tuning an LLM model requires significant compute resources, we will need a powerful GPU that can handle our requirements. E2E Networks offers a wide range of cloud GPU nodes like the NVIDIA H100, A100, and V100 series amongst others.
Head over to the E2E Networks’ website to sign up for the GPU offerings. For this blog post, we shall be spinning up a V100 GPU node.?
Step-by-Step Process to Fine-Tune Mistral 7B on a Telugu-Language Dataset
First, install all the necessary libraries in your Python environment
%pip install -U bitsandbytes transformers peft accelerate trl datasets
Next, import the modules that are going to be needed for the fine-tuning.
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig,HfArgumentParser,TrainingArguments,pipeline, logging
from peft import LoraConfig, PeftModel, prepare_model_for_kbit_training, get_peft_model
import os,torch
from datasets import load_dataset
from trl import SFTTrainer
Log in to your Hugging Face account.
!huggingface-cli login --token 'YOUR_HUGGING_FACE_TOKEN'
Initialize some variables and load the dataset.
base_model = "mistralai/Mistral-7B-v0.1"
dataset_name = "indic_glue"
new_model = "mistral_7b_telugu"
We load the training dataset and the validation dataset separately.
from datasets import load_dataset
train_dataset = load_dataset('indic_glue','actsa-sc.te', split='train')
eval_dataset = load_dataset('indic_glue','actsa-sc.te', split='validation')
Here’s how the dataset looks:
train_dataset['text']
['??????? ?????????? ???? ?????????? ???????? ??????? ???????? ?????? ???????? ????????? ????????? ???? ??????????????.',
'?????, ??????? ???????????, ????????????? ???? ???????? ???? ???? ??????? ????????? ???????.',
'???? ????????? ??????, ???? ???????????? ?????????? ????????????? ?????.',
'?? ???????? ???? ????????? ??????? ???????? ?????????.',
'???????????????? ??????? ?????????????? ?????? ???? ??????? ????? ???? ???????? ???????????????? ?????? ????? ?????? ??????? ????????? ??????????????.',
'??? ????? ???????? ??????? ????????? ????? ?? ??????? ?????? ?????????? ??????????? ?????????? ??????????????.',
...
'????????? 122? ???????? ???? ???????? ??????? ???????? ??????????????.',
'????, ???? ????????? ?????? ?????? ???????.',
'?????? ???????????? ?????, ????????? ?????? ??????? ?? ?????????? ????? ????? ?????????? ??????????? ?????????.',
'???????? ???????????? ?????? ???? ? ????????? ?????? ??????? ??????? ???? ????? ????????????? ???????? ??????????? ?????? ?????? ???????????? ?????????.',
...]
Load the base model.
bnb_config = BitsAndBytesConfig(
load_in_4bit= True,
bnb_4bit_quant_type= "nf4",
bnb_4bit_compute_dtype= torch.bfloat16,
bnb_4bit_use_double_quant= False,
)
model = AutoModelForCausalLM.from_pretrained(
base_model,
quantization_config=bnb_config,
torch_dtype=torch.bfloat16,
device_map="auto",
trust_remote_code=True,
)
model.config.use_cache = False # silence the warnings
model.config.pretraining_tp = 1
model.gradient_checkpointing_enable()
Load the tokenizer.
领英推荐
tokenizer = AutoTokenizer.from_pretrained(base_model, trust_remote_code=True)
tokenizer.padding_side = 'right'
tokenizer.pad_token = tokenizer.eos_token
tokenizer.add_eos_token = True
tokenizer.add_bos_token, tokenizer.add_eos_token
Next, we outline a procedure for enhancing a machine learning model by applying a specialized fine-tuning technique known as PEFT (Parameter-Efficient Fine-Tuning), specifically utilizing Low-Rank Adaptation (LoRA) to optimize the model for a particular task.
LoRA is a technique used to fine-tune large pre-trained models in a parameter-efficient manner. Instead of updating all the model parameters during the fine-tuning process, LoRA focuses on modifying only a small subset. It does this by introducing low-rank matrices to adapt specific weight matrices within the model, typically in the attention mechanism of Transformer-based architectures.
The key idea is to keep the original pre-trained weights mostly unchanged while using these additional, smaller matrices to capture the adjustments needed for the model to perform well on a specific task. This approach significantly reduces the number of parameters that need to be trained, making the fine-tuning process faster and less resource-intensive, while still leveraging the powerful capabilities of the original large model.
model = prepare_model_for_kbit_training(model)
peft_config = LoraConfig(
lora_alpha=16,
lora_dropout=0.1,
r=64,
bias="none",
task_type="CAUSAL_LM",
target_modules=["q_proj", "k_proj", "v_proj", "o_proj","gate_proj"]
)
model = get_peft_model(model, peft_config)
Now we define a set of training arguments for configuring the training process using the Hugging Face Transformers library.
These arguments specify various parameters such as the directory to save results (`output_dir`), the number of training epochs (`num_train_epochs`), the batch size per device (`per_device_train_batch_size`), and the optimizer to use (`optim`) with a specific focus on memory efficiency (`paged_adamw_32bit`).
It also sets the frequency of saving the model and logging information (`save_steps` and logging_steps), the learning rate, weight decay for regularization, and whether to use mixed precision training (`fp16`, bf16).
training_arguments = TrainingArguments(
output_dir="./results",
num_train_epochs=1,
per_device_train_batch_size=4,
gradient_accumulation_steps=1,
optim="paged_adamw_32bit",
save_steps=25,
logging_steps=25,
learning_rate=2e-4,
weight_decay=0.001,
fp16=False,
bf16=False,
max_grad_norm=0.3,
max_steps=-1,
warmup_ratio=0.03,
logging_dir="./logs",
group_by_length=True,
lr_scheduler_type="constant",
)
The TRL library from Hugging Face features an accessible API designed for easily developing and training Supervised Fine-Tuning (SFT) models tailored to your specific dataset with just a few lines of code. To facilitate this, we will supply the SFT Trainer with essential elements including the model, dataset, LoRA configuration, tokenizer, and parameters for training. This setup ensures a streamlined process for fine-tuning models to achieve optimal performance on targeted tasks without extensive coding requirements.
trainer = SFTTrainer(
model=model,
train_dataset=train_dataset,
eval_dataset= eval_dataset,
peft_config=peft_config,
max_seq_length= None,
dataset_text_field="text",
tokenizer=tokenizer,
args=training_arguments,
packing= False,
)
Now we are ready to train our model
trainer.train()
[1082/1082 50:13, Epoch 1/1]
Step
Training Loss
25
1.131000
50
1.111600
75
1.055100
100
1.057000
125
1.029700
150
1.006700
175
0.979600
200
0.981900
225
0.945400
250
0.932800
275
0.939000
300
0.934500
325
0.909600
350
0.922100
375
0.920900
400
0.911200
425
0.883600
450
0.879200
475
0.913900
500
0.874800
525
0.871300
550
0.857300
575
0.845700
600
0.859300
625
0.870500
650
0.845700
675
0.854000
700
0.821600
725
0.866700
750
0.852200
775
0.852300
800
0.835900
825
0.838200
850
0.841400
875
0.827100
900
0.824800
925
0.831100
950
0.811400
975
0.795400
1000
0.837800
1025
0.811300
1050
0.776900
1075
0.789600
TrainOutput(global_step=1082, training_loss=0.8960528351683273, metrics={'train_runtime': 3025.3236, 'train_samples_per_second': 1.431, 'train_steps_per_second': 0.358, 'total_flos': 3.622690119047578e+16, 'train_loss': 0.8960528351683273, 'epoch': 1.0})
Save the trained model into our workspace.
trainer.model.save_pretrained(new_model)
Now we load the base model, and our newly trained adapters on top of it, so that we can test out fine-tuning.
model_fine_tuned = PeftModel.from_pretrained(model, new_model)
Create a pipeline for text-generation.
pipe = pipeline(
"text-generation",
model=model_fine_tuned,
tokenizer = tokenizer,
torch_dtype=torch.bfloat16,
device_map="auto"
)
Let’s give it a simple prompt - ‘Write a paragraph in Telugu’.
sequences = pipe(
f"???????? ???? ??????",
do_sample=True,
max_new_tokens=100,
temperature=0.7,
top_k=50,
top_p=0.95,
num_return_sequences=1,
)
print(sequences[0]['generated_text'])
Output:
???? ??????? ???? ????? ?????? ??? ?????????????. ???? ????? ??????? ????????? ??????? ???? ????? ??????????. ???? ????? ????????????? ???? ????? ??????? ?????????, ???????????? ?????, ????? ????????????. ???? ????? ??????? ????? ???????????????, ??? ???? ???????????????? ??????????????????. ???? ????? ??????? ???????????? ?????, ???? ????? ??????? ???? ??????????????????
Translation:
We have always had big dreams. We have always tried to fulfill those dreams ourselves. What are we doing for that? We love, care for, and help our loved ones. We have love for our loved ones, and we give them blessings. Do we love small people and care for them? We love and care for our small loved ones as much as our big loved ones.