EP 5: Language Modeling | Paper 1: A Neural Probabilistic Language Model

EP 5: Language Modeling | Paper 1: A Neural Probabilistic Language Model

In continuation to: Paper 1: A Neural Probabilistic Language Model

Hello Readers,

What is a Language Model?


Language modeling is a fundamental task in natural language processing (NLP) that involves predicting the next word in a sequence of words given the context of the preceding words. It's a crucial component for various applications such as speech recognition, machine translation, and text generation.

Here's a simple example using PyTorch to build a basic language model. We'll create a character-level language model that predicts the next character in a sequence. The same idea will then be extended to create word-level language model and so on.

Problem Statement:  Given a text

text = "hello, how are you doing today?"

Build a character-level model to predict the next character. Use the above text to train the model.

Test scenario:
start_text = "hello, how"
generated_text = "hello, how are you doing today?"        

Note: If you are new to "PyTorch" do not worry if you do not understand any of the code. I will explain every line and every word in the code, once we get to the technical implementation of the concepts. Stay tuned for that.

Complete Pytorch Code: Google Colab source.

Ref: PyTorch Course

import torch
import torch.nn as nn
import torch.optim as optim

# Define the training data
text = "hello, how are you doing today?"

# Create a mapping from characters to indices
chars = sorted(list(set(text)))
char_to_idx = {char: idx for idx, char in enumerate(chars)}
idx_to_char = {idx: char for idx, char in enumerate(chars)}

# Convert the text into a sequence of indices
data = [char_to_idx[char] for char in text]

# Define the model
class CharLanguageModel(nn.Module):
    def __init__(self, vocab_size, embedding_dim, hidden_dim):
        super(CharLanguageModel, self).__init__()
        self.embedding = nn.Embedding(vocab_size, embedding_dim)
        self.rnn = nn.LSTM(embedding_dim, hidden_dim, batch_first=True)
        self.fc = nn.Linear(hidden_dim, vocab_size)

    def forward(self, x):
        x = self.embedding(x)
        output, _ = self.rnn(x)
        output = self.fc(output)
        return output

# Instantiate the model
vocab_size = len(chars)
embedding_dim = 10
hidden_dim = 20
model = CharLanguageModel(vocab_size, embedding_dim, hidden_dim)

# Define loss function and optimizer
criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(model.parameters(), lr=0.01)

# Convert data into PyTorch tensors
data = torch.tensor(data).view(1, -1)

# Training loop
epochs = 100
for epoch in range(epochs):
    output = model(data[:, :-1])  # Exclude the last character from the input
    target = data[:, 1:]           # Exclude the first character from the target
    loss = criterion(output.view(-1, vocab_size), target.view(-1))

    if (epoch + 1) % 10 == 0:
        print(f'Epoch [{epoch+1}/{epochs}], Loss: {loss.item():.4f}')

# Generate text using the trained model
def generate_text(model, seed_text, length=50):
    with torch.no_grad():
        input_seq = [char_to_idx[char] for char in seed_text]
        input_seq = torch.tensor(input_seq).view(1, -1)
        generated_text = seed_text

        for _ in range(length):
            output = model(input_seq)
            predicted_idx = torch.argmax(output[:, -1, :]).item()
            generated_text += idx_to_char[predicted_idx]
            input_seq = torch.cat([input_seq[:, 1:], torch.tensor([[predicted_idx]])], dim=1)

    return generated_text

# Generate text using the trained model
seed_text = "hello, how"
print('--^--' * 20)
generated_text = generate_text(model, seed_text, length=21) #After 21st character is starts generating garbage.
print("Generated Text:", generated_text)
print('--_--' * 20)
generated_text = generate_text(model, seed_text, length=40)
print("Generated Text:", generated_text)
print("You can see some garbage output characters or repetitions")


Epoch [10/100], Loss: 2.4456
Epoch [20/100], Loss: 1.8544
Epoch [30/100], Loss: 1.1256
Epoch [40/100], Loss: 0.5862
Epoch [50/100], Loss: 0.2965
Epoch [60/100], Loss: 0.1504
Epoch [70/100], Loss: 0.0879
Epoch [80/100], Loss: 0.0583
Epoch [90/100], Loss: 0.0424
Epoch [100/100], Loss: 0.0328
Generated Text: hello, how are you doing today?
Generated Text: hello, how are you doing today??oday?u doing today
You can see some garbage output characters or repetitions        


So far, we have understood the following 4 concepts:

  1. Language Model (this post).
  2. Curse of Dimensionality.
  3. Random Variables.
  4. Joint probability distribution.

Using these above 4 concepts, we will be able to now understand what the author says:

“A fundamental problem that makes language modeling and other learning problems difficult is the curse of dimensionality. It is particularly obvious in the case when one wants to model the joint distribution between many discrete random variables (such as words in a sentence, or discrete attributes in a data-mining task). For example, if one wants to model the joint distribution of 10 consecutive words in a natural language with a vocabulary V of size 100,000, there are potentially 10000010 ? 1 = 1050 ? 1 free parameters. When modeling continuous variables, we obtain generalization more easily (e.g. with smooth classes of functions like multi-layer neural networks or Gaussian mixture models) because the function to be learned can be expected to have some local smoothness properties. For discrete spaces, the generalization structure is not as obvious: any change of these discrete variables may have a drastic impact on the value of the function to be estimated, and when the number of values that each discrete variable can take is large, most observed objects are almost maximally far from each other in hamming distance.”

You can now put these concepts together and give it a try to explain the paragraph above. Please leave a comment.

The post continues here: https://mathx.substack.com/p/ep-5-language-modeling-paper-1-a

We have both math based and in-plain-english explanation of the concepts, and much more.

Thank you for the time.

Bhushan L.

Managing Director | CEO | Artificial Intelligence | Computer Vision | Machine Learning | Deep Learning | Data Scientist | React Js | Flutter | NestJs

1 年

Thank you for sharing this well-articulated information! It's incredibly helpful. ??


Bikash Debnath ?的更多文章

