EP 6: Integrated Realms: Dimensionality, Variables, Probability | Paper 1: A Neural Probabilistic Language Model.

EP 6: Integrated Realms: Dimensionality, Variables, Probability | Paper 1: A Neural Probabilistic Language Model.

In continuation to: Paper 1: A Neural Probabilistic Language Model

Hello Readers,

In this post, we will finally put together the following concepts which we have been learning in the last 4 posts.

  1. Curse of Dimensionality.
  2. Random Variables.
  3. Joint Probability Distribution.
  4. Language Models.


“A fundamental problem that makes language modeling and other learning problems difficult is the curse of dimensionality. It is particularly obvious in the case when one wants to model the joint distribution between many discrete random variables (such as words in a sentence, or discrete attributes in a data-mining task). For example, if one wants to model the joint distribution of 10 consecutive words in a natural language with a vocabulary V of size 100,000, there are potentially 100000^10 ? 1 = 10^50 ? 1 free parameters. When modeling continuous variables, we obtain generalization more easily (e.g. with smooth classes of functions like multi-layer neural networks or Gaussian mixture models) because the function to be learned can be expected to have some local smoothness properties. For discrete spaces, the generalization structure is not as obvious: any change of these discrete variables may have a drastic impact on the value of the function to be estimated, and when the number of values that each discrete variable can take is large, most observed objects are almost maximally far from each other in hamming distance.”


Let's consider a small English vocabulary (word list) with three words V: {cat, dog, mouse}. Now, we want to find the number of possible combinations for a sequence of 2 consecutive words from this vocabulary.

So the number of possible combinations is V^2 (V to the power 2) where V is the size of the vocabulary. In this case, V^2 is equivalent to 3^2, which is 9.

Now, let's list all the possible combinations:

  1. cat cat
  2. cat dog
  3. cat mouse
  4. dog cat
  5. dog dog
  6. dog mouse
  7. mouse cat
  8. mouse dog
  9. mouse mouse

So, there are indeed 9 possible combinations for 2 consecutive words from this small English vocabulary.

In the original statement of the paper, (IMAGINE) with a vocabulary size of 100,000 (V = 100,000) and a sequence length of 10 words, V^10 (V to power of 10) represents the number of possible combinations for a sequence of 10 consecutive words from a much larger vocabulary.


The minus 1 in the expression V^10 ? 1 accounts for the fact that the last word in the sequence is not a free parameter. Once you know the first nine words, the tenth word is determined because it has to complete the sequence. To clarify, consider a sequence of 10 words: w1,w2,w3,w4,w5,w6,w7,w8,w9,w10. The first nine words (w1 to w9) are free parameters because you can choose any word from the vocabulary for each of them independently. However, once you've chosen w1 to w9, the last word w10 is not a free parameter; it is uniquely determined by the choices you made for the previous nine words.

Hence you can now appreciate GPT type models. The number of parameters in a machine learning model, such as the GPT (Generative Pre-trained Transformer) model, is influenced by the complexity of the model (which we will see different architecture in this newsletter) and the size of the vocabulary it is trained on.

Okay! but where do Curse of Dimensionality come into picture?

Let's first create a simple example using PyTorch to represent a vocabulary of 3 words in 2 dimensions. We'll use the torch library to create word embeddings and visualize them.

Attention: !! We will learn all these programming concepts once we start implementation of our own models. Right now understanding the concepts is what we should be focusing on!

Ref: PyTorch Course

import torch
import matplotlib.pyplot as plt

# Vocabulary
vocabulary = {'cat', 'dog', 'mouse'}

# Embeddings in 2 dimensions
embedding_dim = 2

# Create random embeddings for each word
word_embeddings = {
    'cat': torch.rand(embedding_dim),
    'dog': torch.rand(embedding_dim),
    'mouse': torch.rand(embedding_dim)
}

# Plot the embeddings
fig, ax = plt.subplots()

for word, embedding in word_embeddings.items():
    ax.scatter(embedding[0], embedding[1])
    ax.text(embedding[0], embedding[1], word)

plt.xlabel('Dimension 1')
plt.ylabel('Dimension 2')
plt.title('Word Embeddings in 2D')
plt.grid(True)
plt.show()        
Complete Pytorch Code: Google Colab source.


In this example:

  • We have a vocabulary of three words: 'cat', 'dog', and 'mouse'.
  • We represent each word with a 2-dimensional vector (embedding).
  • The code creates random vectors for each word (in a real-world scenario, these would be learned during training).
  • We then plot these vectors in a 2D space, with each point representing a word.

Results:


In 2D :


In 3D -

Ref: PyTorch Course

import torch
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D

# Vocabulary
vocabulary = {'cat', 'dog', 'mouse'}

# Embeddings in 3 dimensions
embedding_dim = 3

# Create random embeddings for each word
word_embeddings = {
    'cat': torch.rand(embedding_dim),
    'dog': torch.rand(embedding_dim),
    'mouse': torch.rand(embedding_dim)
}

# Plot the embeddings in 3D
fig = plt.figure()
ax = fig.add_subplot(111, projection='3d')

for word, embedding in word_embeddings.items():
    ax.scatter(embedding[0], embedding[1], embedding[2])
    ax.text(embedding[0], embedding[1], embedding[2], word)

ax.set_xlabel('Dimension 1')
ax.set_ylabel('Dimension 2')
ax.set_zlabel('Dimension 3')
ax.set_title('Word Embeddings in 3D')
plt.show()
        


Results:


In the provided PyTorch example, if you were to increase the dimensionality significantly, you might encounter “curse of dimensionality”. For instance, if the dimensionality were in the hundreds or thousands (in our example above we have 2D and 3D to represent 3 words), you would need an immense amount of data to avoid sparsity issues. Additionally, interpreting the meaning of specific dimensions or understanding relationships between words could become more challenging.

Now as we take larger vocabulary like 100,000 words, we would then require at least 50 or 300 D (dimension) space to represent this words. NLP applications often use 50 to 300 dimensions space.

Okay! So as we increase the vocabulary, the dimensions required to represent them increases. But where is the reference to “Discrete Random Variables”?

In this case, the discrete random variable is the sequence of 10 consecutive words in a natural language sentence. Each word in the sequence is a discrete random variable because it can take on one of the 100,000 possible values from the vocabulary.

My go to book: Probability & Statistics for Engineers & Scientists.


Ok! Now let’s summarize.

In language modeling, the goal is to understand and predict the likelihood of different combinations of words occurring together. Like given few words, what words will come next. The "curse of dimensionality" becomes apparent when dealing with a large vocabulary, as we saw above, and modeling the joint distribution of multiple words.

Example: Modeling 3 Consecutive Words

Let's consider a small vocabulary with only 3 words: {cat, dog, mouse}. We want to model the joint distribution of 3 consecutive words in a sentence.

  1. Vocabulary Size (V): 3 words (cat, dog, mouse)
  2. Sequence Length: 3 consecutive words

Now, let's calculate the potential number of parameters for modeling the joint distribution.

V^3 = 3^3-1 = 27-1 = 26

  1. cat cat cat
  2. cat cat dog
  3. cat cat mouse
  4. cat dog cat
  5. cat dog dog
  6. cat dog mouse
  7. cat mouse cat
  8. cat mouse dog ….. and so on. 26 such combinations in total.

So, even with a small vocabulary and a sequence length of 3 words, there are potentially 26 free parameters to model the joint distribution. Each parameter represents the probability associated with a specific combination of three consecutive words.

In the context of the example, the joint probability distribution would involve assigning probabilities to all possible combinations of 3 consecutive words. For instance:

  • P(cat, dog, mouse)
  • P(dog, mouse, cat)
  • P(mouse, cat, dog) and so on.

Each of these represents the probability of observing the specific sequence of words in the language.

To calculate the joint probability distribution, we need to assign probabilities to each of the 26 combinations. Let's assume a simple scenario where each combination has an equal probability. In this case, the probability of each combination would be 1/26. Meaning each combinations is equally likely to appear. Although in real-world it won’t be like this.

We'll use PyTorch to represent and calculate the probabilities.


Ref: PyTorch Course

The post continues here: https://mathx.substack.com/p/ep-6-integrated-realms-dimensionality

We have both math based and in-plain-english explanation of the concepts, and much more.

Thank you for the time. Please leave a comment.


要查看或添加评论,请登录

Bikash Debnath ?的更多文章

社区洞察

其他会员也浏览了