Building a Simple Language Model With Pytorch

From chatbots to content creation, language models have become indispensable in the field of Artificial Intelligence. And even though building one on your own from scratch can seem daunting, you can build a simple AI language model with the help of Python and some libraries. This tutorial will walk you through the step - by - step process, including setting up the environment and generating text. We will be using pytorch to do this, so it is recommended to have a basic knowledge of the library in order to fully follow through the tutorial.

Step 1: Setting Up your Environment

You can use any IDE of your choice to do this, just make sure to have python installed on your system. There are few other libraries required to set up your environment.

 pip install numpy pandas torch torchvision torchaudio

You will need to import necessary libraries in order to finish setting up.

import numpy as np
import pandas as pd

Step 2: Data Collection

You can use a variety of publicly available text datasets for this process, depending on your requirement. I will be using text from "Pride and Prejudice" from Project Gutenberg for demonstration processes.

import requests
url = "https://www.gutenberg.org/files/1342/1342-0.txt"
response = requests.get(url)
text = response.text

with open("pride_and_prejudice.txt", "w") as file:
    file.write(text)

Step 3: Data Preprocessing

For any machine learning model, this is the most crucial step in order to make the data ready to feed into the model. Data Preprocessing for language model involves methods like tokenization, lemmatization, stemming etc.

import re
from collections import Counter
import torch
from torch.nn.utils.rnn import pad_sequence

# Read the text
with open("pride_and_prejudice.txt", "r") as file:
    text = file.read()

# Clean the text
text = re.sub(r'[^A-Za-z0-9 ]+', '', text).lower()

# Tokenize the text
words = text.split()
word_counts = Counter(words)
vocab = sorted(word_counts, key=word_counts.get, reverse=True)
vocab_to_int = {word: i+1 for i, word in enumerate(vocab)}

# Create input sequences
sequences = []
for i in range(1, len(words)):
    seq = words[max(0, i-10):i+1]
    sequences.append([vocab_to_int[word] for word in seq])

# Pad sequences
sequences = [torch.tensor(seq) for seq in sequences]
padded_sequences = pad_sequence(sequences, batch_first=True, padding_value=0)

First, we use regex to remove all the punctuations and non alphanumeric sections in the text. After that, the entire text is tokenized and a vocab list is created by sorting the words in descending order of their frequencies. Finally, we create a dictionary vocab_to_int by mapping each word to a unique integer index, starting from 1.

Creating input sequences is a final method for the preprocessing step of a language model, where the sequences of 11 words are created from vocab to int dictionary and by padding sequences to a consistent length, our language models can process and learn from variable-length input data.

Step 4: Creating the Model

We are using simple neural network model with an embedding layer, LSTM, and Dense layers. The architecture of our language model is provided below:

import torch.nn as nn
import torch.nn.functional as F

class LanguageModel(nn.Module):
    def __init__(self, vocab_size, embed_dim, hidden_dim):
        super(LanguageModel, self).__init__()
        self.embedding = nn.Embedding(vocab_size, embed_dim)
        self.lstm = nn.LSTM(embed_dim, hidden_dim, batch_first=True)
        self.fc = nn.Linear(hidden_dim, vocab_size)

    def forward(self, x):
        x = self.embedding(x)
        x, (hidden, cell) = self.lstm(x)
        x = self.fc(x[:, -1, :])
        return x

vocab_size = len(vocab_to_int) + 1
embed_dim = 100
hidden_dim = 150

model = LanguageModel(vocab_size, embed_dim, hidden_dim)
print(model)

Here, vocab_size, embed_dim, hidden_dim are the hyperparameters that define the vocabulary size, hidden dimension and embedded dimension of the model, respectively. You can play around with these numbers to see how it impacts your results.

Step 5: Training the Model

from torch.utils.data import DataLoader, TensorDataset

# Create predictors and labels
xs = padded_sequences[:, :-1]
labels = padded_sequences[:, -1]

dataset = TensorDataset(xs, labels)
dataloader = DataLoader(dataset, batch_size=64, shuffle=True)

# Training the model
criterion = nn.CrossEntropyLoss()
optimizer = torch.optim.Adam(model.parameters(), lr=0.001)

model.train()
epochs = 10

for epoch in range(epochs):
    for inputs, targets in dataloader:
        outputs = model(inputs)
        loss = criterion(outputs, targets)
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()

    print(f"Epoch {epoch+1}, Loss: {loss.item()}")

Step 6: Generating the Text

Now that we have trained the model, let's look at how our generated text looks like.

def generate_text(seed_text, next_words, max_len):
    model.eval()
    words = seed_text.split()
    state_h, state_c = torch.zeros(1, 1, hidden_dim), torch.zeros(1, 1, hidden_dim)

    for _ in range(next_words):
        seq = torch.tensor([[vocab_to_int[w] for w in words[-10:]]])
        seq = pad_sequence([seq], batch_first=True, padding_value=0)

        with torch.no_grad():
            output, (state_h, state_c) = model.lstm(model.embedding(seq), (state_h, state_c))
            last_word_logits = model.fc(output[0, -1])
        
        p = torch.nn.functional.softmax(last_word_logits, dim=0).detach().cpu().numpy()
        word_index = np.random.choice(len(last_word_logits), p=p)
        words.append(vocab[word_index - 1])

    return ' '.join(words)

print(generate_text("your seed text here", 50, max_len=11))

With this basic model, you can start experimenting with more complex architectures, variety of datasets, and different preprocessing techniques to improve your language model's performance. You can also make changes to the hyperparameters and learning rate to compare the model's accuracy. For more advanced tutorials, stay tuned to my upcoming guides.