Building a Custom Question-Answering System with GPT-2

Building a Custom Question-Answering System with GPT-2

Author: Rakesh Sheshadri

I am not doing this on GPU, a simple CPU is all you require!

Train with few thousand records!

Training on a CPU: For training a GPT-2 model can run efficiently on a CPU, especially if you’re working with a small dataset. The training loop is straightforward and does not require extensive computational resources. For simple applications and smaller datasets, CPUs can handle the training without significant delays.

In this article, we will explore how to create a simple question-answering system using the GPT-2 model from Hugging Face’s Transformers library. We will cover two main parts: training a GPT-2 model on a custom dataset and then using that trained model to answer questions. This will give you a solid foundation for building more complex natural language processing applications.

Part 1: Training the GPT-2 Model

Setting Up the Environment

Before we start coding, make sure you have the necessary libraries installed. You can do this using pip:

pip install torch transformers        

Defining the Dataset

We begin by creating a simple dataset containing questions and their corresponding answers. This dataset will be used to fine-tune our GPT-2 model.

import torch
from transformers import GPT2Tokenizer, GPT2LMHeadModel
from torch.utils.data import Dataset, DataLoader

# Define a simple dataset for training
class SimpleDataset(Dataset):
    def __init__(self, texts, tokenizer, max_length=128):
        tokenizer.pad_token = tokenizer.eos_token  # Set padding token to EOS
        self.encodings = tokenizer(texts, padding=True, truncation=True, max_length=max_length)

    def __len__(self):
        return len(self.encodings["input_ids"])

    def __getitem__(self, idx):
        return {k: torch.tensor(v[idx]) for k, v in self.encodings.items()}        

Loading Pre-trained Model and Tokenizer

Next, we load the pre-trained GPT-2 model and tokenizer. This gives us a solid starting point for our training.

# Load pretrained GPT-2 model and tokenizer
tokenizer = GPT2Tokenizer.from_pretrained("gpt2")
model = GPT2LMHeadModel.from_pretrained("gpt2")        

Preparing the Data Loader

We’ll create a list of question-answer pairs and prepare the DataLoader for batching our data.

qa_data = [
    ("What is AstroSynth?", "AstroSynth is an innovative program designed to synthesize oils from celestial materials."),
    # Add more Q&A pairs...
]

dataset = SimpleDataset(qa_data, tokenizer)
dataloader = DataLoader(dataset, batch_size=2, shuffle=True)        

Training the Model

Now we can set up the training loop. In this mock training phase, we will iterate over our DataLoader, perform a forward pass, and optimize the model.

optimizer = torch.optim.AdamW(model.parameters(), lr=5e-5)

# Set model to training mode
model.train()

# Training loop (mock training)
for batch in dataloader:
    input_ids = batch["input_ids"]
    attention_mask = batch["attention_mask"]

    optimizer.zero_grad()
    outputs = model(input_ids, attention_mask=attention_mask, labels=input_ids)
    loss = outputs.loss
    loss.backward()
    optimizer.step()

    print(f"Loss: {loss.item()}")  # Print loss for monitoring

# Save the trained model
model.save_pretrained("trained_gpt2_model")
tokenizer.save_pretrained("trained_gpt2_model")  # Save the tokenizer as well        

Part 2: Using the Trained Model for Question Answering

After training our model, we can load it and use it to answer questions.

Loading the Trained Model

# Load the trained model and tokenizer
model = GPT2LMHeadModel.from_pretrained("trained_gpt2_model")
tokenizer = GPT2Tokenizer.from_pretrained("trained_gpt2_model")
model.eval()        

Creating a Function to Ask Questions

We will create a function that takes a question as input and generates an answer using our trained model.

def ask_question(question):
    input_text = f"Question: {question} Answer:"
    input_ids = tokenizer.encode(input_text, return_tensors="pt")
    attention_mask = torch.ones(input_ids.shape, dtype=torch.long)

    with torch.no_grad():
        output = model.generate(
            input_ids,
            attention_mask=attention_mask,
            max_length=50,
            num_return_sequences=1,
            no_repeat_ngram_size=2,
            do_sample=True,
            top_k=50,
            top_p=0.95,
            temperature=0.7
        )

    answer = tokenizer.decode(output[0], skip_special_tokens=True)
    return answer.split("Answer:")[-1].strip()  # Return the generated answer        

Step 2: Creating the Files

Create two Python files in your project directory: train_gpt2.py and ask_gpt2.py.

File 1: train_gpt2.py

import torch
from transformers import GPT2Tokenizer, GPT2LMHeadModel
from torch.utils.data import Dataset, DataLoader

# Define a simple dataset for training
class SimpleDataset(Dataset):
    def __init__(self, texts, tokenizer, max_length=128):
        tokenizer.pad_token = tokenizer.eos_token  # Set padding token to EOS
        self.encodings = tokenizer(texts, padding=True, truncation=True, max_length=max_length)

    def __len__(self):
        return len(self.encodings["input_ids"])

    def __getitem__(self, idx):
        return {k: torch.tensor(v[idx]) for k, v in self.encodings.items()}

# Load pretrained GPT-2 model and tokenizer
tokenizer = GPT2Tokenizer.from_pretrained("gpt2")
model = GPT2LMHeadModel.from_pretrained("gpt2")

# Prepare question-answer data
qa_data = [
    ("What is AstroSynth?", "AstroSynth is an innovative program designed to synthesize oils from celestial materials."),
    ("How does AstroSynth work?", "The AstroSynth program utilizes advanced technologies to extract and synthesize oils from meteorites."),
    # Add more Q&A pairs...
]

dataset = SimpleDataset(qa_data, tokenizer)
dataloader = DataLoader(dataset, batch_size=2, shuffle=True)

# Set up the optimizer
optimizer = torch.optim.AdamW(model.parameters(), lr=5e-5)

# Set model to training mode
model.train()

# Training loop (mock training)
for batch in dataloader:
    input_ids = batch["input_ids"]
    attention_mask = batch["attention_mask"]

    # Zero the gradients
    optimizer.zero_grad()

    # Forward pass
    outputs = model(input_ids, attention_mask=attention_mask, labels=input_ids)
    loss = outputs.loss

    # Backward pass
    loss.backward()

    # Update weights
    optimizer.step()

    print(f"Loss: {loss.item()}")  # Print loss for monitoring

# Save the trained model
model.save_pretrained("trained_gpt2_model")
tokenizer.save_pretrained("trained_gpt2_model")  # Save the tokenizer as well        

File 2: ask_gpt2.py

import torch
from transformers import GPT2Tokenizer, GPT2LMHeadModel

# Load the trained model and tokenizer
model = GPT2LMHeadModel.from_pretrained("trained_gpt2_model")
tokenizer = GPT2Tokenizer.from_pretrained("trained_gpt2_model")
model.eval()

# Function to ask a question and get an answer
def ask_question(question):
    input_text = f"Question: {question} Answer:"
    input_ids = tokenizer.encode(input_text, return_tensors="pt")
    attention_mask = torch.ones(input_ids.shape, dtype=torch.long)

    with torch.no_grad():
        output = model.generate(
            input_ids,
            attention_mask=attention_mask,
            max_length=50,
            num_return_sequences=1,
            no_repeat_ngram_size=2,
            do_sample=True,
            top_k=50,
            top_p=0.95,
            temperature=0.7
        )

    answer = tokenizer.decode(output[0], skip_special_tokens=True)
    return answer.split("Answer:")[-1].strip()  # Return the generated answer

# Example questions
print(ask_question("What is the capital of France?"))
print(ask_question("How does AstroSynth use artificial intelligence?"))        

Step 3: Running the Code

  1. Train the Model:

  • Open your terminal or command prompt.
  • Navigate to the directory where your files are located.
  • Run the training script:

python train_gpt2.py        

Ask Questions:

  • After the training is complete, run the question-asking script:

python ask_gpt2.py        

Example Questions

You can now ask questions and get answers:

print(ask_question("What is the capital of France?"))
print(ask_question("How does AstroSynth use artificial intelligence?"))        

Conclusion

In this tutorial, we demonstrated how to train a GPT-2 model for a specific task: answering questions based on a custom dataset. We also covered how to load the trained model and make predictions. With this foundation, you can further enhance your model by adding more data, fine-tuning hyperparameters, or even experimenting with different architectures.

Feel free to explore and expand on this work. The possibilities are endless!



Rahul Kumar Bharti

Full Stack Developer | Deep Learning Enthusiast | Problem Solver at LeetCode | Building Intelligent Web Solutions

5 个月

How can we make our models more conversational? Additionally, what is the perplexity score of the trained model?

要查看或添加评论,请登录

G10X的更多文章

社区洞察

其他会员也浏览了