Crafting Coherent and Contextually Relevant Text with GPT-2: A Technical Exploration
Designed by Ddhruv Arora

Crafting Coherent and Contextually Relevant Text with GPT-2: A Technical Exploration

Introduction

In the rapidly evolving field of natural language processing (NLP), the ability to generate coherent and contextually relevant text has become increasingly important. From chatbots to content creation, the applications of text generation models are vast and varied. Among the many models that have emerged, GPT-2, developed by OpenAI , stands out for its remarkable ability to generate human-like text across a wide range of topics.

GPT-2, with its robust architecture and vast training data, is capable of producing text that not only flows naturally but also aligns with the context provided by the user. However, generating high-quality text that meets specific requirements often necessitates a deeper understanding of the model's inner workings and the application of various techniques during text generation.

In this blog, we will dive into the technical aspects of crafting coherent and contextually relevant text using GPT-2. We will explore the architecture of GPT-2, the process of fine-tuning it on custom datasets, and the different strategies that can be employed to enhance the quality of the generated text. By the end of this exploration, you'll have a comprehensive understanding of how to harness the power of GPT-2 for your own text generation needs.

Understanding GPT-2

Overview of the GPT-2 Architecture

GPT-2, short for Generative Pre-trained Transformer 2, is part of a family of models known as Transformers, which have revolutionized the field of NLP. At its core, GPT-2 is built upon the Transformer architecture, a model introduced in the seminal paper "Attention is All You Need" by Vaswani et al. Unlike traditional recurrent neural networks (RNNs) or long short-term memory (LSTM) networks, Transformers do not require sequential data processing, allowing for faster training and more effective parallelization.

The architecture of GPT-2 consists of multiple layers of Transformer blocks, each containing two key components: multi-head self-attention mechanisms and position-wise feedforward neural networks. These components work together to model the relationships between words in a sentence, regardless of their position, enabling the model to capture long-range dependencies in the text.


Arch og GPT 2
Architecture of GPT 2

Key aspects of the GPT-2 architecture include:

  • Self-Attention Mechanism: The self-attention mechanism allows GPT-2 to weigh the importance of different words in a sentence relative to one another. This is crucial for capturing context, as it helps the model understand which parts of the input are most relevant for generating the next word.
  • Multi-Head Attention: By splitting the attention mechanism into multiple "heads", GPT-2 can focus on different parts of the sentence simultaneously. This parallel processing enhances the model's ability to understand complex relationships and generate more nuanced text.
  • Layer Normalization: Layer normalization is applied after each attention and feedforward operation, helping to stabilize the training process and improving the model's convergence.
  • Position Embeddings: Since Transformers do not inherently understand the order of words, GPT-2 uses position embeddings to encode the position of each word in the sequence. This allows the model to maintain an understanding of word order, which is essential for producing coherent text.

Training Process and Large-Scale Datasets

The effectiveness of GPT-2 is largely due to its pre-training on vast amounts of text data. During pre-training, the model is exposed to a diverse range of text from the internet, allowing it to learn a broad spectrum of language patterns, structures, and general knowledge.

The training process involves predicting the next word in a sequence, given the preceding words. This is known as language modeling, and it enables GPT-2 to generate text by sampling from the probability distribution of possible next words. The model's large-scale pre-training endows it with the ability to generate text on a wide array of topics, even without fine-tuning.

However, while pre-training provides a strong foundation, fine-tuning GPT-2 on a domain-specific dataset is often necessary to generate text that aligns closely with specific requirements. Fine-tuning allows the model to adapt to the nuances of specialized content, making it more relevant and accurate for targeted applications.

Key Concepts: Transformers, Attention Mechanism, and Language Modeling

  • Transformers: Transformers are the backbone of GPT-2's architecture. They are designed to handle sequential data without relying on recurrence, making them more efficient than traditional models. The self-attention mechanism within Transformers allows GPT-2 to weigh the importance of different words, capturing context and relationships in the text.
  • Attention Mechanism: Attention is a technique that enables the model to focus on specific parts of the input sequence when generating text. In GPT-2, self-attention allows the model to consider all words in the input when predicting the next word, ensuring that the generated text is contextually relevant and coherent.
  • Language Modeling: Language modeling is the process of predicting the next word in a sequence, given the preceding words. GPT-2 is pre-trained as a language model, allowing it to generate text by sampling from the learned probability distribution. This process enables the model to produce text that flows naturally and aligns with the given context.

Understanding these key concepts is essential for effectively utilizing GPT-2. By utilising the power of Transformers, attention mechanisms, and language modeling, GPT-2 can generate text that is not only coherent but also contextually relevant, making it a valuable tool for a wide range of applications.

Techniques for Generating High-Quality Text

Generating high-quality text with GPT-2 involves not only understanding the model's architecture but also applying various techniques during the text generation process. These techniques influence how the model selects the next word, ultimately affecting the coherence, fluency, and relevance of the generated output. Below, we'll explore some key techniques: Greedy Search, Beam Search, N-gram Penalty, and Sampling Strategies, along with code examples.

Greedy Search

Greedy search is the simplest approach for generating text. At each step, the model selects the word with the highest probability as the next word. While this approach is straightforward, it often leads to repetitive or overly simplistic text, as it doesn't explore alternative words that might lead to more diverse or interesting sentences.


Greedy Search Image
Greedy Search Example

Starting from the word "The", the algorithm greedily chooses the next word of highest probability "nice" and so on, so that the final generated word sequence is ("The", "nice", "woman") having an overall probability of 0.5×0.4=0.20.

In the following we will generate word sequences using GPT2 on the context "The future of artificial intelligence is" Let's see how greedy search can be used in transformers:

from transformers import GPT2Tokenizer, GPT2LMHeadModel

# Load pre-trained model and tokenizer
model = GPT2LMHeadModel.from_pretrained("gpt2")
tokenizer = GPT2Tokenizer.from_pretrained("gpt2")

# Encode input prompt
input_text = "The future of artificial intelligence is"
input_ids = tokenizer.encode(input_text, return_tensors="pt")

# Generate text using greedy search
greedy_output = model.generate(input_ids, max_length=50)

# Decode and print the generated text
output_text = tokenizer.decode(greedy_output[0], skip_special_tokens=True)
print("Greedy Search Output:\n", output_text)        

The Output in my case:

Greedy Search Output: The future of artificial intelligence is uncertain. "We're not sure what the future will look like," said Dr. Michael S. Schoenfeld, a professor of computer science at the University of California, Berkeley. "But we're not

We have generated our first text with GPT2. The generated words following the context are reasonable, but the model starts repeating itself! (the output was cutoff as we have word limit as 50). This is a very common problem in language generation in general and seems to be even more so in greedy! To fix it, we have beam search.

Beam Search

Beam search improves upon greedy search by considering multiple potential sequences at each step. Instead of selecting just the highest probability word, beam search maintains a set of the most probable sequences (beams) and expands them in parallel. This approach allows for more exploration and often results in more coherent and diverse text. The trade-off is that beam search is more computationally intensive and can still suffer from repetition if not properly tuned.


Beam Search
Beam Search Example

At time step 1, besides the most likely hypothesis ("The", "nice") beam search also keeps track of the second most likely one ("The", "dog"). At time step 2, beam search finds that the word sequence ("The","dog","has"), has with 0.36 a higher probability than ("The","nice","woman") which has 0.2 . Great, it has found the most likely word sequence!

Beam search will always find an output sequence with higher probability than greedy search, but is not guaranteed to find the most likely output.

In the following we will generate word sequences using GPT2 on the context "The future of artificial intelligence is" Let's see how beam search can be used in transformers:

# Load pre-trained model and tokenizer
model = GPT2LMHeadModel.from_pretrained("gpt2")
tokenizer = GPT2Tokenizer.from_pretrained("gpt2")

# Encode input prompt
input_text = "The future of artificial intelligence is"
input_ids = tokenizer.encode(input_text, return_tensors="pt")

# Generate text using beam search with beam width of 2
beam_output = model.generate(input_ids, max_length=50, num_beams=2, early_stopping=True)

# Decode and print the generated text
output_text = tokenizer.decode(beam_output[0], skip_special_tokens=True)
print("Beam Search Output:\n", output_text)        

The Output in my case:

Beam Search Output: The future of artificial intelligence is uncertain, but it's clear that it's going to change the way we think about the future of computing. In the next few years, we'll be able to build machines that are smarter than humans, that

The result is arguably more fluent, the output still includes repetitions of the same word sequences. One of the available remedies is to introduce n-grams

N-gram Penalty

One common issue in text generation is the model repeating the same phrases or words, especially when using greedy or beam search. N-gram penalty is a technique used to penalize the generation of repeated n-grams (sequences of n words) within the generated text. By applying a penalty to repeated sequences, the model is encouraged to produce more varied and creative outputs.

In the following we will generate word sequences using GPT2 on the context "The future of artificial intelligence is" Let's see how N-gram Penalty works:

# Load pre-trained model and tokenizer
model = GPT2LMHeadModel.from_pretrained("gpt2")
tokenizer = GPT2Tokenizer.from_pretrained("gpt2")

# Encode input prompt
input_text = "The future of artificial intelligence is"
input_ids = tokenizer.encode(input_text, return_tensors="pt")

# Generate text using N-gram Penalty
beam_output = model.generate(input_ids, max_length=50, num_beams=2, no_repeat_ngram_size=2, early_stopping=True)

# Decode and print the generated text
output_text = tokenizer.decode(beam_output[0], skip_special_tokens=True)
print("Beam Search Output:\n", output_text)        

The Output in my case:

Beam Search Output with Penalty: The future of artificial intelligence is uncertain, but it's clear that the future is bright. In the meantime, we need to keep an eye out for the next big thing. We're going to have to figure out how to make it work

We can see that the repetition does not appear anymore. Nevertheless, n-gram penalties have to be used with care. An article generated about the city New York should not use a 2-gram penalty or otherwise, the name of the city would only appear once in the whole text!

This visualization shows how the N-gram penalty reduces the frequency of repeated phrases in the generated text. We can generate text with and without the penalty and compare the frequency of n-grams.


N-Gram Visual
Visualizing N-gram Repetitions

We can clearly state that N-gram Penalty shows a reduction in the repetition of phrases, confirming the effectiveness of the n-gram penalty in generating more diverse text.

Sampling Strategies

But first, Why sampling is required at all ? The answer is simple, In open-ended generation, a couple of reasons have been brought forward why beam search might not be the best possible option:

  • Beam search can work very well in tasks where the length of the desired generation is more or less predictable as in machine translation or summarization - see Murray et al. (2018) and Yang et al. (2018). But this is not the case for open-ended generation where the desired output length can vary greatly, e.g. dialog and story generation.
  • We have seen that beam search heavily suffers from repetitive generation. This is especially hard to control with n-gram or other penalties in story generation since finding a good trade-off between inhibiting repetition and repeating cycles of identical n-grams requires a lot of fine tuning.
  • As argued in Ari Holtzman et al. (2019), high quality human language does not follow a distribution of high probability next words. In other words, as humans, we want generated text to surprise us and not to be boring/predictable. The authors show this nicely by plotting the probability, a model would give to human text vs. what beam search does


Bean vs Human
Beam Search vs Human Text

So, let's introduce some Randomness using Sampling Strategies:

Sampling Strategies

Sampling involves introducing randomness into the text generation process, which can lead to more diverse and interesting outputs. There are two popular sampling strategies: temperature sampling and top-k/top-p sampling.

  • Temperature Sampling: The temperature parameter controls the randomness of predictions by scaling the logits before applying the softmax function. Lower temperatures make the model more confident (less random), while higher temperatures increase diversity by allowing the model to explore less likely word.

The code to experiment with temperature setting is as:

# Load pre-trained model and tokenizer
model = GPT2LMHeadModel.from_pretrained("gpt2")
tokenizer = GPT2Tokenizer.from_pretrained("gpt2")

# Encode input prompt
input_text = "The future of artificial intelligence is"
input_ids = tokenizer.encode(input_text, return_tensors="pt")

# Generate text using beam search with beam width of 5 and induced temp
beam_output = model.generate(input_ids, max_length=50, num_beams=2, no_repeat_ngram_size=2, early_stopping=True, temperature=0.1, do_sample=True)

# Decode and print the generated text
output_text = tokenizer.decode(beam_output[0], skip_special_tokens=True)
print("Temp Var Output:\n", output_text)        

Note: make sure to add 'do_sample=True'

The Output in my case:

With temp=0.1

Temp Var Output: The future of artificial intelligence is uncertain, but it's clear that the future is bright. In the meantime, we need to keep an eye out for the next big thing. We're going to have to figure out how to make it work

With temp=90.0

Temp Var Output: The future of artificial intelligence is an interesting matter – for one, many advanced systems in society will evolve in this time when these intelligent agents work tirelessly to adapt to changing societal realities as they understand and accept a world-views we humans had developed during nearly

The difference is clearly visible, if higher temperature the text is more dynamic and random as well.

  • Top-k Sampling: In top-k sampling, only the top k most probable next words are considered at each step, and the next word is chosen randomly from this subset. This approach balances randomness and control, allowing for more diverse outputs without completely sacrificing quality.

Let's see how Top-K can be used in the library by setting top_k=50:

# Load pre-trained model and tokenizer
model = GPT2LMHeadModel.from_pretrained("gpt2")
tokenizer = GPT2Tokenizer.from_pretrained("gpt2")

# Encode input prompt
input_text = "The future of artificial intelligence is"
input_ids = tokenizer.encode(input_text, return_tensors="pt")

# Generate text using Top-k
beam_output = model.generate(input_ids, max_length=50, early_stopping=True, top_k=50, do_sample=True)

# Decode and print the generated text
output_text = tokenizer.decode(beam_output[0], skip_special_tokens=True)
print("Top-k Sampling Output:\n", output_text)        

The Output in my case:

Top-k Sampling Output: The future of artificial intelligence is in its very early stages. But a computer is not yet the solution to our problem. It's too early to predict much from it and to decide what to work hard for. In the last few years, AI and

Not bad at all! The text is arguably the most human-sounding text so far. One concern though with Top-K sampling is that it does not dynamically adapt the number of words that are filtered from the next word probability distribution. This can be problematic as some words might be sampled from a very sharp distribution whereas others from a much more flat distribution. Thus, limiting the sample pool to a fixed size K could endanger the model to produce gibberish for sharp distributions and limit the model's creativity for flat distribution. This intuition led Ari Holtzman et al. (2019) to create Top-p or nucleus-sampling.

  • Top-p (Nucleus) Sampling: Top-p sampling, also known as nucleus sampling, is similar to top-k sampling but instead of limiting to a fixed number of words, it considers all words whose cumulative probability adds up to a threshold p. This allows for a dynamic number of choices, balancing diversity and coherence.

Let's implement Top-p sampling

# Load pre-trained model and tokenizer
model = GPT2LMHeadModel.from_pretrained("gpt2")
tokenizer = GPT2Tokenizer.from_pretrained("gpt2")

# Encode input prompt
input_text = "The future of artificial intelligence is"
input_ids = tokenizer.encode(input_text, return_tensors="pt")

# Generate text using top-p sampling
top_p_output = model.generate(input_ids, max_length=50, do_sample=True, top_p=0.9)

# Decode and print the generated text
output_text = tokenizer.decode(top_p_output[0], skip_special_tokens=True)
print("Top-p Sampling Output:\n", output_text)        

The Output in my case:

Top-p Sampling Output: The future of artificial intelligence is unclear, and it's unclear whether the future of machine learning will ever come to pass. That means a future where machines are smarter and more efficient, with more control over their performance, are probably not possible. But if

Great, that sounds like it could have been written by a human. Well, maybe not quite yet. Now, we can use all these techniques in combination with some experimentation to get the desired output.

Here is a quick chart, visualizing the effects of sampling strategies:

Diff chart
Visualizing the effects of sampling strategies

Fine-Tuning GPT-2 on Custom Data

Fine-tuning GPT-2 on custom data is a crucial step when you want the model to generate text that is tailored to a specific domain or topic. By fine-tuning, you adapt the pre-trained GPT-2 model to better understand the nuances of your target domain, resulting in more relevant and accurate text generation.

Why Fine-Tuning is Necessary for Specialized Tasks

While GPT-2 is pre-trained on a large and diverse dataset, it might not perform optimally for specialized tasks or niche domains. For instance, if we want to generate text related to Arduino Boards, fine-tuning on a dataset rich in Arduino-specific content helps the model generate text that is more relevant, accurate, and contextually appropriate.

Fine-tuning narrows down the model's focus, allowing it to generate outputs that better align with the style, terminology, and context of a specific dataset.

Preparing Dataset

To fine-tune GPT-2, we need a dataset that closely matches the kind of text that we want the model to generate. The dataset should be clean, well-structured, and representative of the domain you're targeting. For example, if you're working with Arduino -related content, your dataset could include tutorials, documentation, code snippets, and articles related to Arduino programming.

I have made a very simple dataset, which has just the basic overview of Arduino and few information gathered using Arduino Education website and Gemini AI

Fine-Tuning GPT-2

Fine-tuning GPT-2 typically requires a GPU for faster training. The process involves using the Hugging Face transformers library, which provides a straightforward API for fine-tuning GPT-2.

The Code can be found here: Fine Tuning GPT 2

Explanation:

  1. Load Pre-trained Model and Tokenizer: We start by loading the pre-trained GPT-2 model and its corresponding tokenizer from the Hugging Face model hub.
  2. Prepare Dataset for Fine-Tuning: The load_dataset function loads your custom dataset (e.g., Arduino-specific text) into a format suitable for training. The block_size parameter determines how much text is fed into the model at a time.
  3. Set Up Data Collator: The DataCollatorForLanguageModeling is used to batch and dynamically pad the input sequences during training. We set mlm to False because GPT-2 is not a masked language model.
  4. Configure Training Arguments: The TrainingArguments class allows you to configure various aspects of the training process, such as the number of epochs, batch size, and output directory.
  5. Initialize Trainer: The Trainer class handles the training loop, including the optimization and evaluation processes.
  6. Fine-Tune the Model: The trainer.train() method fine-tunes GPT-2 on your custom dataset. This process may take time depending on the size of the dataset and the computing resources available.
  7. Save the Fine-Tuned Model: After training, save the fine-tuned model and tokenizer for future use.

Generating Text:

After fine-tuning, it's important to evaluate the model's performance. This can involve generating text from various prompts and manually assessing the relevance, fluency, and coherence of the output.

Here, to generate text, we will be combining the techniques used above for human-alike output:

prompt = "What is Arduino ?"

# Encode the input prompt
input_ids = tokenizer.encode(prompt, return_tensors='pt')

# Generate text with all techniques combined
outputs = model.generate(
    input_ids,
    max_length=100,            # Maximum length of the generated text
    num_beams=5,               # Beam search width
    temperature=0.7,           # Temperature for controlling randomness
    top_k=50,                  # Top-k sampling
    top_p=0.9,                 # Top-p (nucleus) sampling
    no_repeat_ngram_size=2,    # n-gram penalty to prevent repetition
    early_stopping=True        # Stop when a beam hits an end token
)

# Decode and print the generated text
generated_text = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(generated_text)        

The Output in my case:

What is Arduino? Arduino is an open-source electronics platform based on easy-to-use hardware and software. It's designed to make electronics more accessible to artists, designers, hobbyists, and anyone interested in creating interactive projects. The beauty of Arduino lies in its ability to bridge the gap between software and hardware, enabling even those without a computer to create projects and projects that are easy to learn and use. In conclusion, Arduino is a versatile and accessible platform that has revolution

Looks good and human written, we have finally generated Coherent and Contextually Relevant Text! Congratulations!????


Conclusion

In this exploration of crafting coherent and contextually relevant text with GPT-2, we dived into the intricacies of model architecture, key concepts, and various sampling techniques that influence text generation. By understanding and fine-tuning parameters such as temperature, beam width, and sampling strategies, we can steer GPT-2 to produce text that aligns with specific goals—be it creativity, coherence, or a balance of both.

Through the visualizations provided, we've seen how adjusting parameters like temperature can dramatically alter the output's diversity, predictability, and overall quality. These insights allow us to make informed decisions when deploying GPT-2 in real-world applications, whether for generating creative content, drafting technical documents, or developing conversational agents.

As AI continues to evolve, mastering these techniques will be crucial in harnessing the full potential of language models like GPT-2. By experimenting with different settings and understanding their effects, we can push the boundaries of what's possible in text generation, creating outputs that are not only coherent and contextually relevant but also tailored to meet the specific needs of diverse applications.

Ultimately, the key to effective text generation lies in the balance between randomness and control. Whether you're generating text for creative writing, technical documentation, or conversational AI, understanding these concepts will empower you to craft outputs that resonate with your audience and achieve your desired outcomes.

Thank You!!

Thank you for taking the time to explore the nuances of text generation with GPT-2. I hope this deep dive into the technical aspects enhances your understanding and inspires your future projects. Your feedback and thoughts are always welcome!

#AI #GPT2 #TextGeneration #DeepLearning #NLP


Abhiraj Singh Thakur

Ex Bosch Global Software Technologies Intern | NIT Hamirpur Alumni| Backend Developer| DevOps

1 个月

Insightful!

Sayantan Samanta

3x RedHat | DevOps & Cloud | RHCE | Jenkins CI/CD | Provisioning Infra using Terraform | Kubernetes deployments and troubleshooting | AWS CSA | Configuration using Ansible | Tech Blogger

1 个月

Must have taken a lot of time to research, worthy of time investment though !!??

Nandini Goyal

Student at Amity University | C++ | Full Stack

1 个月

Insightful!

要查看或添加评论,请登录

社区洞察

其他会员也浏览了