Evaluating GPT-2 Language Model: A Step-By-Step Guide
David Adamson MSc.
Founder - Abriella Care. / AI Solutions Expert / eCommerce / Software Engineering #nlp #machinelearning #artificalintelligence #mentalhealth
In the realm of Natural Language Processing (NLP), the second iteration of the Generative Pretrained Transformer model, popularly known as GPT-2, has been an instrumental tool for many researchers and developers. Its wide array of applications ranges from text generation, translation, summarisation, to question-answering. However, once you have fine-tuned GPT-2 on your specific task or dataset, it's crucial to understand how to effectively evaluate its performance.
In this article, I will walk you through a comprehensive guide on how to evaluate a GPT-2 Language Model. The process involves fine-tuning the model, implementing an evaluation metric, and finally interpreting the evaluation output.
Let's dive in.
Pre-processing your Dataset
Before we can train our model, we need to prepare the dataset, then create our training and validation datasets to be used.
Our dialogue dataset is stored in a JSON file. We'll preprocess this data, converting it to a .txt format where each line represents an interaction, either a 'User' input or an 'Assistant' output. (You will need to adjust this script for your own dataset, or you can use the same one I used which can be found here).
import json
def preprocess_intents_json(intents_file):
? ? with open(intents_file, "r") as f:
? ? ? ? data = json.load(f)
? ??
? ? preprocessed_data = []
? ??
? ? for intent in data["intents"]:
? ? ? ? for pattern in intent["patterns"]:
? ? ? ? ? ? preprocessed_data.append(f"User: {pattern}\n")
? ? ? ? ? ? for response in intent["responses"]:
? ? ? ? ? ? ? ? preprocessed_data.append(f"Assistant: {response}\n")
? ??
? ? return "".join(preprocessed_data)
def save_preprocessed_data(preprocessed_data, output_file):
? ? with open(output_file, "w") as f:
? ? ? ? f.write(preprocessed_data)
intents_file = "intents/intents.json"
output_file = "intents/mental_health_data.txt"
preprocessed_data = preprocess_intents_json(intents_file)
save_preprocessed_data(preprocessed_data, output_file)
This script reads a JSON file (intents_file) which contains the dialogue data. Each dialogue is a sequence of 'patterns' (user inputs) and 'responses' (assistant outputs), grouped by 'intent'. Each of these dialogues is converted into a text format of 'User: [input]' and 'Assistant: [output]'. This text representation of the dialogue is then saved to the specified output_file (in this case, intents/mental_health_data.txt).
Creating the Training and Validation Datasets
The next step is to split our pre-processed dataset into the training and validation datasets. These datasets should also be saved as .txt files, with each line representing a separate data sample. For language modelling tasks like the one GPT-2 is used for, these files should contain a large amount of diverse text data.
We can split the dataset using the following script:
import numpy as np
# Read the entire dataset into a list
with open('intents/mental_health_data.txt', 'r') as f:
? ? data = f.readlines()
# Shuffle the dataset
np.random.seed(1)
np.random.shuffle(data)
# Split the dataset into training and validation sets (80% - 20%)
split_index = int(len(data) * 0.8)
train_data = data[:split_index]
val_data = data[split_index:]
# Save the training and validation sets as separate files
with open('intents/train_data.txt', 'w') as f:
? ? f.writelines(train_data)
with open('intents/validation_data.txt', 'w') as f:
? ? f.writelines(val_data)
This script first reads the entire preprocessed dataset from the .txt file into a list. It then shuffles the data to ensure that the training and validation sets are representative of the overall data distribution.
After shuffling the data, it splits it into a training set (the first 80% of the data) and a validation set (the remaining 20% of the data). These percentages are typical for many machine learning tasks, but can be adjusted based on your specific needs.
Finally, the script saves the training and validation sets to separate .txt files. These files will be used as input to the fine_tune_gpt2 function we'll create later on.
Once complete, you've covered all the steps required to prepare the data.
Evaluating GPT2
Next we'll start the evaluation process. The Python script provided here utilises the transformers library by Hugging Face, which offers an array of powerful tools for working with transformer-based models like GPT-2. The script performs fine-tuning of the GPT-2 model on a given dataset, and then computes an evaluation metric to measure the performance of the fine-tuned model.
Here's the provided code for fine-tuning GPT-2 and evaluating it:
from transformers import GPT2LMHeadModel, GPT2Tokenizer, GPT2Config, EvalPrediction
from transformers import TextDataset, DataCollatorForLanguageModeling
from transformers import Trainer, TrainingArguments
import numpy as np
from scipy.special import softmax
from sklearn.metrics import log_loss
def compute_metrics(p: EvalPrediction):
? ? logits = p.predictions
? ? labels = p.label_ids
? ? probabilities = softmax(logits, axis=-1)
? ? loss = log_loss(labels.flatten(), probabilities.reshape(-1, probabilities.shape[-1]), labels=[i for i in range(logits.shape[-1])])
? ? perplexity = np.exp(loss)
? ? return {"perplexity": perplexity}
def fine_tune_gpt2(model_name, train_file, validation_file, output_dir):
? ? # Load GPT-2 model and tokenizer
? ? model = GPT2LMHeadModel.from_pretrained(model_name)
? ? tokenizer = GPT2Tokenizer.from_pretrained(model_name)
? ? # Load training dataset
? ? train_dataset = TextDataset(
? ? ? ? tokenizer=tokenizer,
? ? ? ? file_path=train_file,
? ? ? ? block_size=128)
? ? # Load validation dataset
? ? val_dataset = TextDataset(
? ? ? ? tokenizer=tokenizer,
? ? ? ? file_path=validation_file,
? ? ? ? block_size=128)
? ? # Create data collator for language modeling
? ? data_collator = DataCollatorForLanguageModeling(
? ? ? ? tokenizer=tokenizer, mlm=False)
? ? # Set training arguments
? ? training_args = TrainingArguments(
? ? output_dir=output_dir,
? ? overwrite_output_dir=True,
? ? num_train_epochs=3,
? ? per_device_train_batch_size=4,
? ? save_strategy='epoch',
? ? evaluation_strategy='epoch', ?
? ? load_best_model_at_end=True, ?
? ? metric_for_best_model='eval_loss', ?
? ? save_total_limit=2,
)
? ? # Train the model
? ? trainer = Trainer(
? ? ? ? model=model,
? ? ? ? args=training_args,
? ? ? ? data_collator=data_collator,
? ? ? ? train_dataset=train_dataset,
? ? ? ? eval_dataset=val_dataset,
? ? ? ? compute_metrics=compute_metrics,
? ? )
? ? trainer.train()
? ? # Save the fine-tuned model
? ? model.save_pretrained(output_dir)
? ? tokenizer.save_pretrained(output_dir)
# Fine-tune the model
fine_tune_gpt2("gpt2", "path_to_you_dir", "path_to_validation_data.txt", "output")
Code Walkthrough
Importing Necessary Libraries
The code first imports the necessary libraries and modules from the transformers package, such as GPT2LMHeadModel, GPT2Tokenizer, TrainingArguments, and Trainer. In addition to these, other necessary modules and functions from numpy, scipy, and sklearn are imported.
领英推荐
Defining Evaluation Metrics
The compute_metrics function is used to calculate the perplexity of the model's predictions. Perplexity is a common evaluation metric in language modeling and is used to measure how well the model predicts a sample.
Loading the Model and Tokenizer
In the fine_tune_gpt2 function, the GPT-2 model and its associated tokenizer are loaded using the from_pretrained method.
Fine-tuning the Model
The model is fine-tuned using the Trainer class provided by the transformers library. This class is designed to handle the training process, making it as simple as possible to train transformer models. The trainer uses the TrainingArguments for setting the hyperparameters for the training process, such as the number of epochs, batch size, and the evaluation strategy.
After training, the fine-tuned model and the tokenizer are saved using the save_pretrained method. These can be loaded later for further training, evaluation, or for making predictions.
The Output
Upon running the script, you should see an output similar to the one below:
| 6/18 [00:50<01:01, 5.15s/it]Saving model checkpoint to output\checkpoint-
Configuration saved in output\checkpoint-6\config.json
Model weights saved in output\checkpoint-6\pytorch_model.bin
67%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▋ | 12/18 [01:26<00:40, 6.71s/it]***** Running Evaluation *****
Num examples = 33
Batch size = 8
{'eval_loss': 3.0034492015838623, 'eval_perplexity': 49288.363346034486, 'eval_runtime': 20.3534, 'eval_samples_per_second': 1.621, 'eval_steps_per_second': 0.246, 'epoch': 2.0}
67%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▋ | 12/18 [01:46<00:40, 6.71s/it]Saving model checkpoint to output\checkpoint-12
Configuration saved in output\checkpoint-12\config.json
Model weights saved in output\checkpoint-12\pytorch_model.bin
Configuration saved in output\checkpoint-18\config.jsonModel weights saved in output\checkpoint-18\pytorch_model.bin
Deleting older checkpoint [output\checkpoint-6] due to args.save_total_limit
Training completed. Do not forget to share your model on huggingface.co/models =)
Loading best model from output\checkpoint-12 (score: 3.0034492015838623).
{'train_runtime': 169.978, 'train_samples_per_second': 0.424, 'train_steps_per_second': 0.106, 'train_loss': 1.620137956407335, 'epoch': 3.0}
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 18/18 [02:49<00:00, 9.44s/it]
Configuration saved in output\config.json
Model weights saved in output\pytorch_model.bin
tokenizer config file saved in output\tokenizer_config.json
Special tokens file saved in output\special_tokens_map.json6
The script prints some logs while training, which primarily denote the progress of the training process. After each epoch, the script prints the evaluation metrics including the 'eval_loss', 'eval_perplexity', and others. The 'eval_loss' represents the model's performance on the validation set, while 'eval_perplexity' signifies the perplexity of the model's predictions on the validation set.
From the provided output, we can infer that the model's performance improves over the course of the training process, as indicated by the decrease in 'eval_loss' and 'eval_perplexity'. The script saves the model weights and configurations after each epoch in the specified output directory. At the end of the training process, the script loads the model that performed the best on the validation set, as per the 'eval_loss'.
What can I do with this new model
Looking at the output from our training process, it appears that the best model is saved at checkpoint 12. As you can see from the output:
Loading best model from output\checkpoint-12 (score: 3.0034492015838623)
This model achieved the lowest evaluation loss, which makes it our top performer.
Having fine-tuned our GPT-2 model, we can now integrate it into our application. It's important to remember that, while our model might perform well during training and validation, real-world performance can sometimes differ. Therefore, it's crucial to continually monitor and evaluate the model's performance once it's deployed.
If the model's performance drops or if it starts to produce undesirable results, you may need to gather new training data to address these issues and retrain your model. It's also important to consider regularly updating your model with new data to ensure it stays current with any changes in the language or the domain it's being used in.
Training and fine-tuning language models is an ongoing process that requires regular evaluation and adjustment. However, the payoff can be significant, enabling highly engaging and human-like interactions in your applications.
Conclusion
Fine-tuning and evaluating transformer-based models like GPT-2 can be a complex task. However, with the powerful tools provided by libraries like transformers, it becomes significantly more approachable. This article should serve as a guide to fine-tune and evaluate GPT-2 models for your specific tasks. Remember that while 'eval_loss' and 'eval_perplexity' are commonly used metrics, the choice of evaluation metrics should depend on the specific requirements and nature of your task.
Remember, machine learning is a continuous journey, not a destination. Keep refining, experimenting, and improving.
Thanks as always for reading.
David.