Building a Custom Question-Answering System with GPT-2
Author: Rakesh Sheshadri
I am not doing this on GPU, a simple CPU is all you require!
Train with few thousand records!
Training on a CPU: For training a GPT-2 model can run efficiently on a CPU, especially if you’re working with a small dataset. The training loop is straightforward and does not require extensive computational resources. For simple applications and smaller datasets, CPUs can handle the training without significant delays.
In this article, we will explore how to create a simple question-answering system using the GPT-2 model from Hugging Face’s Transformers library. We will cover two main parts: training a GPT-2 model on a custom dataset and then using that trained model to answer questions. This will give you a solid foundation for building more complex natural language processing applications.
Part 1: Training the GPT-2 Model
Setting Up the Environment
Before we start coding, make sure you have the necessary libraries installed. You can do this using pip:
pip install torch transformers
Defining the Dataset
We begin by creating a simple dataset containing questions and their corresponding answers. This dataset will be used to fine-tune our GPT-2 model.
import torch
from transformers import GPT2Tokenizer, GPT2LMHeadModel
from torch.utils.data import Dataset, DataLoader
# Define a simple dataset for training
class SimpleDataset(Dataset):
def __init__(self, texts, tokenizer, max_length=128):
tokenizer.pad_token = tokenizer.eos_token # Set padding token to EOS
self.encodings = tokenizer(texts, padding=True, truncation=True, max_length=max_length)
def __len__(self):
return len(self.encodings["input_ids"])
def __getitem__(self, idx):
return {k: torch.tensor(v[idx]) for k, v in self.encodings.items()}
Loading Pre-trained Model and Tokenizer
Next, we load the pre-trained GPT-2 model and tokenizer. This gives us a solid starting point for our training.
# Load pretrained GPT-2 model and tokenizer
tokenizer = GPT2Tokenizer.from_pretrained("gpt2")
model = GPT2LMHeadModel.from_pretrained("gpt2")
Preparing the Data Loader
We’ll create a list of question-answer pairs and prepare the DataLoader for batching our data.
qa_data = [
("What is AstroSynth?", "AstroSynth is an innovative program designed to synthesize oils from celestial materials."),
# Add more Q&A pairs...
]
dataset = SimpleDataset(qa_data, tokenizer)
dataloader = DataLoader(dataset, batch_size=2, shuffle=True)
Training the Model
Now we can set up the training loop. In this mock training phase, we will iterate over our DataLoader, perform a forward pass, and optimize the model.
optimizer = torch.optim.AdamW(model.parameters(), lr=5e-5)
# Set model to training mode
model.train()
# Training loop (mock training)
for batch in dataloader:
input_ids = batch["input_ids"]
attention_mask = batch["attention_mask"]
optimizer.zero_grad()
outputs = model(input_ids, attention_mask=attention_mask, labels=input_ids)
loss = outputs.loss
loss.backward()
optimizer.step()
print(f"Loss: {loss.item()}") # Print loss for monitoring
# Save the trained model
model.save_pretrained("trained_gpt2_model")
tokenizer.save_pretrained("trained_gpt2_model") # Save the tokenizer as well
Part 2: Using the Trained Model for Question Answering
After training our model, we can load it and use it to answer questions.
Loading the Trained Model
领英推荐
# Load the trained model and tokenizer
model = GPT2LMHeadModel.from_pretrained("trained_gpt2_model")
tokenizer = GPT2Tokenizer.from_pretrained("trained_gpt2_model")
model.eval()
Creating a Function to Ask Questions
We will create a function that takes a question as input and generates an answer using our trained model.
def ask_question(question):
input_text = f"Question: {question} Answer:"
input_ids = tokenizer.encode(input_text, return_tensors="pt")
attention_mask = torch.ones(input_ids.shape, dtype=torch.long)
with torch.no_grad():
output = model.generate(
input_ids,
attention_mask=attention_mask,
max_length=50,
num_return_sequences=1,
no_repeat_ngram_size=2,
do_sample=True,
top_k=50,
top_p=0.95,
temperature=0.7
)
answer = tokenizer.decode(output[0], skip_special_tokens=True)
return answer.split("Answer:")[-1].strip() # Return the generated answer
Step 2: Creating the Files
Create two Python files in your project directory: train_gpt2.py and ask_gpt2.py.
File 1: train_gpt2.py
import torch
from transformers import GPT2Tokenizer, GPT2LMHeadModel
from torch.utils.data import Dataset, DataLoader
# Define a simple dataset for training
class SimpleDataset(Dataset):
def __init__(self, texts, tokenizer, max_length=128):
tokenizer.pad_token = tokenizer.eos_token # Set padding token to EOS
self.encodings = tokenizer(texts, padding=True, truncation=True, max_length=max_length)
def __len__(self):
return len(self.encodings["input_ids"])
def __getitem__(self, idx):
return {k: torch.tensor(v[idx]) for k, v in self.encodings.items()}
# Load pretrained GPT-2 model and tokenizer
tokenizer = GPT2Tokenizer.from_pretrained("gpt2")
model = GPT2LMHeadModel.from_pretrained("gpt2")
# Prepare question-answer data
qa_data = [
("What is AstroSynth?", "AstroSynth is an innovative program designed to synthesize oils from celestial materials."),
("How does AstroSynth work?", "The AstroSynth program utilizes advanced technologies to extract and synthesize oils from meteorites."),
# Add more Q&A pairs...
]
dataset = SimpleDataset(qa_data, tokenizer)
dataloader = DataLoader(dataset, batch_size=2, shuffle=True)
# Set up the optimizer
optimizer = torch.optim.AdamW(model.parameters(), lr=5e-5)
# Set model to training mode
model.train()
# Training loop (mock training)
for batch in dataloader:
input_ids = batch["input_ids"]
attention_mask = batch["attention_mask"]
# Zero the gradients
optimizer.zero_grad()
# Forward pass
outputs = model(input_ids, attention_mask=attention_mask, labels=input_ids)
loss = outputs.loss
# Backward pass
loss.backward()
# Update weights
optimizer.step()
print(f"Loss: {loss.item()}") # Print loss for monitoring
# Save the trained model
model.save_pretrained("trained_gpt2_model")
tokenizer.save_pretrained("trained_gpt2_model") # Save the tokenizer as well
File 2: ask_gpt2.py
import torch
from transformers import GPT2Tokenizer, GPT2LMHeadModel
# Load the trained model and tokenizer
model = GPT2LMHeadModel.from_pretrained("trained_gpt2_model")
tokenizer = GPT2Tokenizer.from_pretrained("trained_gpt2_model")
model.eval()
# Function to ask a question and get an answer
def ask_question(question):
input_text = f"Question: {question} Answer:"
input_ids = tokenizer.encode(input_text, return_tensors="pt")
attention_mask = torch.ones(input_ids.shape, dtype=torch.long)
with torch.no_grad():
output = model.generate(
input_ids,
attention_mask=attention_mask,
max_length=50,
num_return_sequences=1,
no_repeat_ngram_size=2,
do_sample=True,
top_k=50,
top_p=0.95,
temperature=0.7
)
answer = tokenizer.decode(output[0], skip_special_tokens=True)
return answer.split("Answer:")[-1].strip() # Return the generated answer
# Example questions
print(ask_question("What is the capital of France?"))
print(ask_question("How does AstroSynth use artificial intelligence?"))
Step 3: Running the Code
python train_gpt2.py
Ask Questions:
python ask_gpt2.py
Example Questions
You can now ask questions and get answers:
print(ask_question("What is the capital of France?"))
print(ask_question("How does AstroSynth use artificial intelligence?"))
Conclusion
In this tutorial, we demonstrated how to train a GPT-2 model for a specific task: answering questions based on a custom dataset. We also covered how to load the trained model and make predictions. With this foundation, you can further enhance your model by adding more data, fine-tuning hyperparameters, or even experimenting with different architectures.
Feel free to explore and expand on this work. The possibilities are endless!
Full Stack Developer | Deep Learning Enthusiast | Problem Solver at LeetCode | Building Intelligent Web Solutions
5 个月How can we make our models more conversational? Additionally, what is the perplexity score of the trained model?