登录查看更多内容

Evaluating GPT-2 Language Model: A Step-By-Step Guide

David Adamson MSc.

Founder - Abriella Care. / AI Solutions Expert / eCommerce / Software Engineering #nlp #machinelearning #artificalintelligence #mentalhealth

发布日期: 2023年5月30日

In the realm of Natural Language Processing (NLP), the second iteration of the Generative Pretrained Transformer model, popularly known as GPT-2, has been an instrumental tool for many researchers and developers. Its wide array of applications ranges from text generation, translation, summarisation, to question-answering. However, once you have fine-tuned GPT-2 on your specific task or dataset, it's crucial to understand how to effectively evaluate its performance.

In this article, I will walk you through a comprehensive guide on how to evaluate a GPT-2 Language Model. The process involves fine-tuning the model, implementing an evaluation metric, and finally interpreting the evaluation output.

Let's dive in.

Pre-processing your Dataset

Before we can train our model, we need to prepare the dataset, then create our training and validation datasets to be used.

Our dialogue dataset is stored in a JSON file. We'll preprocess this data, converting it to a .txt format where each line represents an interaction, either a 'User' input or an 'Assistant' output. (You will need to adjust this script for your own dataset, or you can use the same one I used which can be found here).

import json


def preprocess_intents_json(intents_file):
? ? with open(intents_file, "r") as f:
? ? ? ? data = json.load(f)
? ??
? ? preprocessed_data = []
? ??
? ? for intent in data["intents"]:
? ? ? ? for pattern in intent["patterns"]:
? ? ? ? ? ? preprocessed_data.append(f"User: {pattern}\n")
? ? ? ? ? ? for response in intent["responses"]:
? ? ? ? ? ? ? ? preprocessed_data.append(f"Assistant: {response}\n")
? ??
? ? return "".join(preprocessed_data)


def save_preprocessed_data(preprocessed_data, output_file):
? ? with open(output_file, "w") as f:
? ? ? ? f.write(preprocessed_data)


intents_file = "intents/intents.json"
output_file = "intents/mental_health_data.txt"


preprocessed_data = preprocess_intents_json(intents_file)
save_preprocessed_data(preprocessed_data, output_file)

This script reads a JSON file (intents_file) which contains the dialogue data. Each dialogue is a sequence of 'patterns' (user inputs) and 'responses' (assistant outputs), grouped by 'intent'. Each of these dialogues is converted into a text format of 'User: [input]' and 'Assistant: [output]'. This text representation of the dialogue is then saved to the specified output_file (in this case, intents/mental_health_data.txt).

Creating the Training and Validation Datasets

The next step is to split our pre-processed dataset into the training and validation datasets. These datasets should also be saved as .txt files, with each line representing a separate data sample. For language modelling tasks like the one GPT-2 is used for, these files should contain a large amount of diverse text data.

We can split the dataset using the following script:

import numpy as np


# Read the entire dataset into a list
with open('intents/mental_health_data.txt', 'r') as f:
? ? data = f.readlines()


# Shuffle the dataset
np.random.seed(1)
np.random.shuffle(data)


# Split the dataset into training and validation sets (80% - 20%)
split_index = int(len(data) * 0.8)
train_data = data[:split_index]
val_data = data[split_index:]


# Save the training and validation sets as separate files
with open('intents/train_data.txt', 'w') as f:
? ? f.writelines(train_data)


with open('intents/validation_data.txt', 'w') as f:
? ? f.writelines(val_data)

This script first reads the entire preprocessed dataset from the .txt file into a list. It then shuffles the data to ensure that the training and validation sets are representative of the overall data distribution.

After shuffling the data, it splits it into a training set (the first 80% of the data) and a validation set (the remaining 20% of the data). These percentages are typical for many machine learning tasks, but can be adjusted based on your specific needs.

Finally, the script saves the training and validation sets to separate .txt files. These files will be used as input to the fine_tune_gpt2 function we'll create later on.

Once complete, you've covered all the steps required to prepare the data.

Evaluating GPT2

Next we'll start the evaluation process. The Python script provided here utilises the transformers library by Hugging Face, which offers an array of powerful tools for working with transformer-based models like GPT-2. The script performs fine-tuning of the GPT-2 model on a given dataset, and then computes an evaluation metric to measure the performance of the fine-tuned model.

Here's the provided code for fine-tuning GPT-2 and evaluating it:

from transformers import GPT2LMHeadModel, GPT2Tokenizer, GPT2Config, EvalPrediction
from transformers import TextDataset, DataCollatorForLanguageModeling
from transformers import Trainer, TrainingArguments
import numpy as np
from scipy.special import softmax
from sklearn.metrics import log_loss


def compute_metrics(p: EvalPrediction):
? ? logits = p.predictions
? ? labels = p.label_ids
? ? probabilities = softmax(logits, axis=-1)
? ? loss = log_loss(labels.flatten(), probabilities.reshape(-1, probabilities.shape[-1]), labels=[i for i in range(logits.shape[-1])])
? ? perplexity = np.exp(loss)
? ? return {"perplexity": perplexity}



def fine_tune_gpt2(model_name, train_file, validation_file, output_dir):
? ? # Load GPT-2 model and tokenizer
? ? model = GPT2LMHeadModel.from_pretrained(model_name)
? ? tokenizer = GPT2Tokenizer.from_pretrained(model_name)


? ? # Load training dataset
? ? train_dataset = TextDataset(
? ? ? ? tokenizer=tokenizer,
? ? ? ? file_path=train_file,
? ? ? ? block_size=128)


? ? # Load validation dataset
? ? val_dataset = TextDataset(
? ? ? ? tokenizer=tokenizer,
? ? ? ? file_path=validation_file,
? ? ? ? block_size=128)


? ? # Create data collator for language modeling
? ? data_collator = DataCollatorForLanguageModeling(
? ? ? ? tokenizer=tokenizer, mlm=False)


? ? # Set training arguments
? ? training_args = TrainingArguments(
? ? output_dir=output_dir,
? ? overwrite_output_dir=True,
? ? num_train_epochs=3,
? ? per_device_train_batch_size=4,
? ? save_strategy='epoch',
? ? evaluation_strategy='epoch', ?
? ? load_best_model_at_end=True, ?
? ? metric_for_best_model='eval_loss', ?
? ? save_total_limit=2,
)



? ? # Train the model
? ? trainer = Trainer(
? ? ? ? model=model,
? ? ? ? args=training_args,
? ? ? ? data_collator=data_collator,
? ? ? ? train_dataset=train_dataset,
? ? ? ? eval_dataset=val_dataset,
? ? ? ? compute_metrics=compute_metrics,
? ? )


? ? trainer.train()


? ? # Save the fine-tuned model
? ? model.save_pretrained(output_dir)
? ? tokenizer.save_pretrained(output_dir)


# Fine-tune the model
fine_tune_gpt2("gpt2", "path_to_you_dir", "path_to_validation_data.txt", "output")

Code Walkthrough

Importing Necessary Libraries

The code first imports the necessary libraries and modules from the transformers package, such as GPT2LMHeadModel, GPT2Tokenizer, TrainingArguments, and Trainer. In addition to these, other necessary modules and functions from numpy, scipy, and sklearn are imported.

领英推荐

Embeddings in Natural Language Processing (NLP)

Sanjay Kumar MBA,MS,PhD 1 年前

Natural Language Processing (NLP)

Bluechip Technologies Asia 11 个月前

Beyond Words: The Future of Machine Learning with…

Uday K. 1 年前

Defining Evaluation Metrics

The compute_metrics function is used to calculate the perplexity of the model's predictions. Perplexity is a common evaluation metric in language modeling and is used to measure how well the model predicts a sample.

Loading the Model and Tokenizer

In the fine_tune_gpt2 function, the GPT-2 model and its associated tokenizer are loaded using the from_pretrained method.

Fine-tuning the Model

The model is fine-tuned using the Trainer class provided by the transformers library. This class is designed to handle the training process, making it as simple as possible to train transformer models. The trainer uses the TrainingArguments for setting the hyperparameters for the training process, such as the number of epochs, batch size, and the evaluation strategy.

After training, the fine-tuned model and the tokenizer are saved using the save_pretrained method. These can be loaded later for further training, evaluation, or for making predictions.

The Output

Upon running the script, you should see an output similar to the one below:

                                            | 6/18 [00:50<01:01,  5.15s/it]Saving model checkpoint to output\checkpoint-
Configuration saved in output\checkpoint-6\config.json
Model weights saved in output\checkpoint-6\pytorch_model.bin
 67%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▋                                                                | 12/18 [01:26<00:40,  6.71s/it]***** Running Evaluation *****
  Num examples = 33
  Batch size = 8
{'eval_loss': 3.0034492015838623, 'eval_perplexity': 49288.363346034486, 'eval_runtime': 20.3534, 'eval_samples_per_second': 1.621, 'eval_steps_per_second': 0.246, 'epoch': 2.0}
 67%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▋                                                                | 12/18 [01:46<00:40,  6.71s/it]Saving model checkpoint to output\checkpoint-12
Configuration saved in output\checkpoint-12\config.json
Model weights saved in output\checkpoint-12\pytorch_model.bin
Configuration saved in output\checkpoint-18\config.jsonModel weights saved in output\checkpoint-18\pytorch_model.bin
Deleting older checkpoint [output\checkpoint-6] due to args.save_total_limit


Training completed. Do not forget to share your model on huggingface.co/models =)


Loading best model from output\checkpoint-12 (score: 3.0034492015838623).
{'train_runtime': 169.978, 'train_samples_per_second': 0.424, 'train_steps_per_second': 0.106, 'train_loss': 1.620137956407335, 'epoch': 3.0}
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 18/18 [02:49<00:00,  9.44s/it] 
Configuration saved in output\config.json
Model weights saved in output\pytorch_model.bin
tokenizer config file saved in output\tokenizer_config.json
Special tokens file saved in output\special_tokens_map.json6

The script prints some logs while training, which primarily denote the progress of the training process. After each epoch, the script prints the evaluation metrics including the 'eval_loss', 'eval_perplexity', and others. The 'eval_loss' represents the model's performance on the validation set, while 'eval_perplexity' signifies the perplexity of the model's predictions on the validation set.

From the provided output, we can infer that the model's performance improves over the course of the training process, as indicated by the decrease in 'eval_loss' and 'eval_perplexity'. The script saves the model weights and configurations after each epoch in the specified output directory. At the end of the training process, the script loads the model that performed the best on the validation set, as per the 'eval_loss'.

What can I do with this new model

Looking at the output from our training process, it appears that the best model is saved at checkpoint 12. As you can see from the output:

Loading best model from output\checkpoint-12 (score: 3.0034492015838623)

This model achieved the lowest evaluation loss, which makes it our top performer.

Having fine-tuned our GPT-2 model, we can now integrate it into our application. It's important to remember that, while our model might perform well during training and validation, real-world performance can sometimes differ. Therefore, it's crucial to continually monitor and evaluate the model's performance once it's deployed.

If the model's performance drops or if it starts to produce undesirable results, you may need to gather new training data to address these issues and retrain your model. It's also important to consider regularly updating your model with new data to ensure it stays current with any changes in the language or the domain it's being used in.

Training and fine-tuning language models is an ongoing process that requires regular evaluation and adjustment. However, the payoff can be significant, enabling highly engaging and human-like interactions in your applications.

Conclusion

Fine-tuning and evaluating transformer-based models like GPT-2 can be a complex task. However, with the powerful tools provided by libraries like transformers, it becomes significantly more approachable. This article should serve as a guide to fine-tune and evaluate GPT-2 models for your specific tasks. Remember that while 'eval_loss' and 'eval_perplexity' are commonly used metrics, the choice of evaluation metrics should depend on the specific requirements and nature of your task.

Remember, machine learning is a continuous journey, not a destination. Keep refining, experimenting, and improving.

Thanks as always for reading.

David.

Artificial Intelligence

460 位关注者

要查看或添加评论，请登录

David Adamson MSc.的更多文章

Rediscovering Focus: Why I Stepped Back From Writing and What Comes Next

2025年1月8日

Rediscovering Focus: Why I Stepped Back From Writing and What Comes Next

Happy new year everyone! First of all, I owe you an apology. It’s been a while since my last article, and for those of…

1 条评论
Automating Success: AI's Role in Streamlining Business Operations and Productivity.

2024年9月27日

Automating Success: AI's Role in Streamlining Business Operations and Productivity.

Introduction Have you ever sat down in front of your computer or place of work, got to work, then checked the time only…

4 条评论
Digital Minimalism in Education: Reducing Distractions for Enhanced Learning

2024年9月16日

Digital Minimalism in Education: Reducing Distractions for Enhanced Learning

A Personal Perspective As we continue our exploration of digital minimalism, this week, we’re shifting our focus from…
Digital Minimalism in the workplace

2024年9月6日

Digital Minimalism in the workplace

Introduction: A Long-Awaited Return Holy water melon on a stick Batman!..

4 条评论
Digital Minimalism and Mental Health: Balancing Social Media in the Modern Age

2024年3月25日

Digital Minimalism and Mental Health: Balancing Social Media in the Modern Age

Introduction: As we delve deeper into our series on digital minimalism, a crucial aspect we must address is its impact…

1 条评论
Digital Minimalism: A Generation X Perspective

2024年3月20日

Digital Minimalism: A Generation X Perspective

Introduction: Following the positive reception of my previous article on digital minimalism, I'm excited to delve…
Top 5 Ways to Digital Minimalism: Simplify Your Online Life

2024年3月4日

Top 5 Ways to Digital Minimalism: Simplify Your Online Life

Before we begin: As a software engineer, my life orbits around technology. Daily, I'm immersed in a digital universe…
Predictive Analytics in Business: AI as a Forecasting Tool

2024年2月1日

Predictive Analytics in Business: AI as a Forecasting Tool

In today’s rapidly evolving business landscape, leveraging advanced technologies like Artificial Intelligence (AI) for…

6 条评论
???? Celebrating a New Milestone: Master's in Computer Science with Artificial Intelligence - Distinction ????

2024年1月18日

???? Celebrating a New Milestone: Master's in Computer Science with Artificial Intelligence - Distinction ????

I’m Excited to announce a significant personal and professional milestone: I have officially graduated from Keele…

3 条评论
New Year, New Insights: Sentiment Analysis with NLP for Emotional Well-being

2024年1月4日

New Year, New Insights: Sentiment Analysis with NLP for Emotional Well-being

Happy New Year to all! As we embrace the start of another year, it's not just a time for celebration and reflection…

1 条评论

See all articles

Evaluating GPT-2 Language Model: A Step-By-Step Guide

David Adamson MSc.

Founder - Abriella Care. / AI Solutions Expert / eCommerce / Software Engineering #nlp #machinelearning #artificalintelligence #mentalhealth

Pre-processing your Dataset

Creating the Training and Validation Datasets

Evaluating GPT2

Code Walkthrough

Importing Necessary Libraries

领英推荐

Defining Evaluation Metrics

Loading the Model and Tokenizer

Fine-tuning the Model

The Output

What can I do with this new model

Conclusion

Artificial Intelligence

460 位关注者

David Adamson MSc.的更多文章

社区洞察

其他会员也浏览了

Tuning Large Language Models - A Guide for Beginners

Chunking Strategies for LLMs: A Deep Dive

Revolutionizing Language Models with Retrieval-Augmented Generation (RAG)

Comparing “O1 Pro Mode” Reasoning Models and GPT-4o Models

Adaptation of Domain Data with Large Language Model (LLM) using Various Approaches

Mastering Large Language Models: Essential Skills for Success in NLP

Cosine Similarity in Large Language Models (LLMs)

Fine-Tuning Strategies for Large Language Models (LLMs)

Understanding Retrieval-Augmented Generation (RAG): A Comprehensive Guide

Pre-processing your Dataset

Creating the Training and Validation Datasets

Evaluating GPT2

Code Walkthrough

Importing Necessary Libraries

领英推荐

Defining Evaluation Metrics

Loading the Model and Tokenizer

Fine-tuning the Model

The Output

What can I do with this new model

Conclusion

Artificial Intelligence

460 位关注者

David Adamson MSc.的更多文章

Rediscovering Focus: Why I Stepped Back From Writing and What Comes Next

Automating Success: AI's Role in Streamlining Business Operations and Productivity.

Digital Minimalism in Education: Reducing Distractions for Enhanced Learning

Digital Minimalism in the workplace

Digital Minimalism and Mental Health: Balancing Social Media in the Modern Age

Digital Minimalism: A Generation X Perspective

Top 5 Ways to Digital Minimalism: Simplify Your Online Life

Predictive Analytics in Business: AI as a Forecasting Tool

???? Celebrating a New Milestone: Master's in Computer Science with Artificial Intelligence - Distinction ????

New Year, New Insights: Sentiment Analysis with NLP for Emotional Well-being

社区洞察

其他会员也浏览了

Tuning Large Language Models - A Guide for Beginners

Chunking Strategies for LLMs: A Deep Dive

Revolutionizing Language Models with Retrieval-Augmented Generation (RAG)

Comparing “O1 Pro Mode” Reasoning Models and GPT-4o Models

Adaptation of Domain Data with Large Language Model (LLM) using Various Approaches

Mastering Large Language Models: Essential Skills for Success in NLP

Cosine Similarity in Large Language Models (LLMs)

Fine-Tuning Strategies for Large Language Models (LLMs)

Understanding Retrieval-Augmented Generation (RAG): A Comprehensive Guide