登录查看更多内容

Building a Custom Question-Answering System with GPT-2

G10X

CUSTOMER OBSESSION PASSION . OWNERSHIP . PARTNERSHIP

发布日期: 2024年10月14日

+ 关注

Author: Rakesh Sheshadri

I am not doing this on GPU, a simple CPU is all you require!

Train with few thousand records!

Training on a CPU: For training a GPT-2 model can run efficiently on a CPU, especially if you’re working with a small dataset. The training loop is straightforward and does not require extensive computational resources. For simple applications and smaller datasets, CPUs can handle the training without significant delays.

In this article, we will explore how to create a simple question-answering system using the GPT-2 model from Hugging Face’s Transformers library. We will cover two main parts: training a GPT-2 model on a custom dataset and then using that trained model to answer questions. This will give you a solid foundation for building more complex natural language processing applications.

Part 1: Training the GPT-2 Model

Setting Up the Environment

Before we start coding, make sure you have the necessary libraries installed. You can do this using pip:

pip install torch transformers

Defining the Dataset

We begin by creating a simple dataset containing questions and their corresponding answers. This dataset will be used to fine-tune our GPT-2 model.

import torch
from transformers import GPT2Tokenizer, GPT2LMHeadModel
from torch.utils.data import Dataset, DataLoader

# Define a simple dataset for training
class SimpleDataset(Dataset):
    def __init__(self, texts, tokenizer, max_length=128):
        tokenizer.pad_token = tokenizer.eos_token  # Set padding token to EOS
        self.encodings = tokenizer(texts, padding=True, truncation=True, max_length=max_length)

    def __len__(self):
        return len(self.encodings["input_ids"])

    def __getitem__(self, idx):
        return {k: torch.tensor(v[idx]) for k, v in self.encodings.items()}

Loading Pre-trained Model and Tokenizer

Next, we load the pre-trained GPT-2 model and tokenizer. This gives us a solid starting point for our training.

# Load pretrained GPT-2 model and tokenizer
tokenizer = GPT2Tokenizer.from_pretrained("gpt2")
model = GPT2LMHeadModel.from_pretrained("gpt2")

Preparing the Data Loader

We’ll create a list of question-answer pairs and prepare the DataLoader for batching our data.

qa_data = [
    ("What is AstroSynth?", "AstroSynth is an innovative program designed to synthesize oils from celestial materials."),
    # Add more Q&A pairs...
]

dataset = SimpleDataset(qa_data, tokenizer)
dataloader = DataLoader(dataset, batch_size=2, shuffle=True)

Training the Model

Now we can set up the training loop. In this mock training phase, we will iterate over our DataLoader, perform a forward pass, and optimize the model.

optimizer = torch.optim.AdamW(model.parameters(), lr=5e-5)

# Set model to training mode
model.train()

# Training loop (mock training)
for batch in dataloader:
    input_ids = batch["input_ids"]
    attention_mask = batch["attention_mask"]

    optimizer.zero_grad()
    outputs = model(input_ids, attention_mask=attention_mask, labels=input_ids)
    loss = outputs.loss
    loss.backward()
    optimizer.step()

    print(f"Loss: {loss.item()}")  # Print loss for monitoring

# Save the trained model
model.save_pretrained("trained_gpt2_model")
tokenizer.save_pretrained("trained_gpt2_model")  # Save the tokenizer as well

Part 2: Using the Trained Model for Question Answering

After training our model, we can load it and use it to answer questions.

Loading the Trained Model

领英推荐

Getting Started with Multimodal AI, CPUs and GPUs…

Towards Data Science 4 个月前

“What’s in Your Fridge?” – Build a Practical Computer…

LandingAI 2 个月前

Deep Learning on Mac Laptop

Patrick Nicolas 2 个月前

# Load the trained model and tokenizer
model = GPT2LMHeadModel.from_pretrained("trained_gpt2_model")
tokenizer = GPT2Tokenizer.from_pretrained("trained_gpt2_model")
model.eval()

Creating a Function to Ask Questions

We will create a function that takes a question as input and generates an answer using our trained model.

def ask_question(question):
    input_text = f"Question: {question} Answer:"
    input_ids = tokenizer.encode(input_text, return_tensors="pt")
    attention_mask = torch.ones(input_ids.shape, dtype=torch.long)

    with torch.no_grad():
        output = model.generate(
            input_ids,
            attention_mask=attention_mask,
            max_length=50,
            num_return_sequences=1,
            no_repeat_ngram_size=2,
            do_sample=True,
            top_k=50,
            top_p=0.95,
            temperature=0.7
        )

    answer = tokenizer.decode(output[0], skip_special_tokens=True)
    return answer.split("Answer:")[-1].strip()  # Return the generated answer

Step 2: Creating the Files

Create two Python files in your project directory: train_gpt2.py and ask_gpt2.py.

File 1: train_gpt2.py

import torch
from transformers import GPT2Tokenizer, GPT2LMHeadModel
from torch.utils.data import Dataset, DataLoader

# Define a simple dataset for training
class SimpleDataset(Dataset):
    def __init__(self, texts, tokenizer, max_length=128):
        tokenizer.pad_token = tokenizer.eos_token  # Set padding token to EOS
        self.encodings = tokenizer(texts, padding=True, truncation=True, max_length=max_length)

    def __len__(self):
        return len(self.encodings["input_ids"])

    def __getitem__(self, idx):
        return {k: torch.tensor(v[idx]) for k, v in self.encodings.items()}

# Load pretrained GPT-2 model and tokenizer
tokenizer = GPT2Tokenizer.from_pretrained("gpt2")
model = GPT2LMHeadModel.from_pretrained("gpt2")

# Prepare question-answer data
qa_data = [
    ("What is AstroSynth?", "AstroSynth is an innovative program designed to synthesize oils from celestial materials."),
    ("How does AstroSynth work?", "The AstroSynth program utilizes advanced technologies to extract and synthesize oils from meteorites."),
    # Add more Q&A pairs...
]

dataset = SimpleDataset(qa_data, tokenizer)
dataloader = DataLoader(dataset, batch_size=2, shuffle=True)

# Set up the optimizer
optimizer = torch.optim.AdamW(model.parameters(), lr=5e-5)

# Set model to training mode
model.train()

# Training loop (mock training)
for batch in dataloader:
    input_ids = batch["input_ids"]
    attention_mask = batch["attention_mask"]

    # Zero the gradients
    optimizer.zero_grad()

    # Forward pass
    outputs = model(input_ids, attention_mask=attention_mask, labels=input_ids)
    loss = outputs.loss

    # Backward pass
    loss.backward()

    # Update weights
    optimizer.step()

    print(f"Loss: {loss.item()}")  # Print loss for monitoring

# Save the trained model
model.save_pretrained("trained_gpt2_model")
tokenizer.save_pretrained("trained_gpt2_model")  # Save the tokenizer as well

File 2: ask_gpt2.py

import torch
from transformers import GPT2Tokenizer, GPT2LMHeadModel

# Load the trained model and tokenizer
model = GPT2LMHeadModel.from_pretrained("trained_gpt2_model")
tokenizer = GPT2Tokenizer.from_pretrained("trained_gpt2_model")
model.eval()

# Function to ask a question and get an answer
def ask_question(question):
    input_text = f"Question: {question} Answer:"
    input_ids = tokenizer.encode(input_text, return_tensors="pt")
    attention_mask = torch.ones(input_ids.shape, dtype=torch.long)

    with torch.no_grad():
        output = model.generate(
            input_ids,
            attention_mask=attention_mask,
            max_length=50,
            num_return_sequences=1,
            no_repeat_ngram_size=2,
            do_sample=True,
            top_k=50,
            top_p=0.95,
            temperature=0.7
        )

    answer = tokenizer.decode(output[0], skip_special_tokens=True)
    return answer.split("Answer:")[-1].strip()  # Return the generated answer

# Example questions
print(ask_question("What is the capital of France?"))
print(ask_question("How does AstroSynth use artificial intelligence?"))

Step 3: Running the Code

Train the Model:

Open your terminal or command prompt.
Navigate to the directory where your files are located.
Run the training script:

python train_gpt2.py

Ask Questions:

After the training is complete, run the question-asking script:

python ask_gpt2.py

Example Questions

You can now ask questions and get answers:

print(ask_question("What is the capital of France?"))
print(ask_question("How does AstroSynth use artificial intelligence?"))

Conclusion

In this tutorial, we demonstrated how to train a GPT-2 model for a specific task: answering questions based on a custom dataset. We also covered how to load the trained model and make predictions. With this foundation, you can further enhance your model by adding more data, fine-tuning hyperparameters, or even experimenting with different architectures.

Feel free to explore and expand on this work. The possibilities are endless!

Rahul Kumar Bharti

Full Stack Developer | Deep Learning Enthusiast | Problem Solver at LeetCode | Building Intelligent Web Solutions

5 个月

How can we make our models more conversational? Additionally, what is the perplexity score of the trained model?

1 次回应

查看更多评论

要查看或添加评论，请登录

G10X的更多文章

See all articles

Building a Custom Question-Answering System with GPT-2

G10X

CUSTOMER OBSESSION PASSION . OWNERSHIP . PARTNERSHIP

I am not doing this on GPU, a simple CPU is all you require!

Part 1: Training the GPT-2 Model

Setting Up the Environment

Defining the Dataset

Loading Pre-trained Model and Tokenizer

Preparing the Data Loader

Training the Model

Part 2: Using the Trained Model for Question Answering

Loading the Trained Model

领英推荐

Creating a Function to Ask Questions

Step 2: Creating the Files

Step 3: Running the Code

Example Questions

Conclusion

G10X的更多文章

社区洞察

其他会员也浏览了

7 Best Laptops For Deep Learning and Data Science in 2020

Paper Review: σ-GPTs: A New Approach to Autoregressive Models

Next-Level AI on Standard GPUs: Discover PowerInfer's Innovation in Language Model Inference

AI for Engineering: GPT-Powered Numerical Methods to Solve Engineering Problems

AI Engineer mindmap

Introduction to LoRA and QLoRA

From Manual Coding to AI-Powered Software 2.0

What is TensorFlow - Features, Use Case, and Example Explained

Deep Learning Compilers

Bing Aided Programming

I am not doing this on GPU, a simple CPU is all you require!

Part 1: Training the GPT-2 Model

Setting Up the Environment

Defining the Dataset

Loading Pre-trained Model and Tokenizer

Preparing the Data Loader

Training the Model

Part 2: Using the Trained Model for Question Answering

Loading the Trained Model

领英推荐

Creating a Function to Ask Questions

Step 2: Creating the Files

Step 3: Running the Code

Example Questions

Conclusion

G10X的更多文章

MACH Architecture Unleashed: Practical Insights and Innovations

Predicting Footfall to Boost Retail Sales: A Poisson Regression Strategy

Instant Gratification with App Clips and Instant Apps

Sound Synthesis in Python: Simulating Guitar Sounds with Sine Waves

Harnessing GenAI for Simplified Data Governance

Demystifying the Potential of MACH Architecture

Data governance in generative AI implementation – its role and relevance in modern world

Incorporating Generative AI tools in a Business Analyst’s daily routine

Uncovering the Top 8 Data Visualization Trends

Unlocking Commerce Agility: The Headless Approach with Salesforce

社区洞察

其他会员也浏览了

7 Best Laptops For Deep Learning and Data Science in 2020

Paper Review: σ-GPTs: A New Approach to Autoregressive Models

Next-Level AI on Standard GPUs: Discover PowerInfer's Innovation in Language Model Inference

AI for Engineering: GPT-Powered Numerical Methods to Solve Engineering Problems

AI Engineer mindmap

Introduction to LoRA and QLoRA

From Manual Coding to AI-Powered Software 2.0

What is TensorFlow - Features, Use Case, and Example Explained

Deep Learning Compilers

Bing Aided Programming