登录查看更多内容

EP 6: Integrated Realms: Dimensionality, Variables, Probability | Paper 1: A Neural Probabilistic Language Model.

Bikash Debnath ?

Associate Director for AI | Nuveen

发布日期: 2024年1月31日

+ 关注

In continuation to: Paper 1: A Neural Probabilistic Language Model

Hello Readers,

In this post, we will finally put together the following concepts which we have been learning in the last 4 posts.

“A fundamental problem that makes language modeling and other learning problems difficult is the curse of dimensionality. It is particularly obvious in the case when one wants to model the joint distribution between many discrete random variables (such as words in a sentence, or discrete attributes in a data-mining task). For example, if one wants to model the joint distribution of 10 consecutive words in a natural language with a vocabulary V of size 100,000, there are potentially 100000^10 ? 1 = 10^50 ? 1 free parameters. When modeling continuous variables, we obtain generalization more easily (e.g. with smooth classes of functions like multi-layer neural networks or Gaussian mixture models) because the function to be learned can be expected to have some local smoothness properties. For discrete spaces, the generalization structure is not as obvious: any change of these discrete variables may have a drastic impact on the value of the function to be estimated, and when the number of values that each discrete variable can take is large, most observed objects are almost maximally far from each other in hamming distance.”

Let's consider a small English vocabulary (word list) with three words V: {cat, dog, mouse}. Now, we want to find the number of possible combinations for a sequence of 2 consecutive words from this vocabulary.

So the number of possible combinations is V^2 (V to the power 2) where V is the size of the vocabulary. In this case, V^2 is equivalent to 3^2, which is 9.

Now, let's list all the possible combinations:

cat cat
cat dog
cat mouse
dog cat
dog dog
dog mouse
mouse cat
mouse dog
mouse mouse

So, there are indeed 9 possible combinations for 2 consecutive words from this small English vocabulary.

In the original statement of the paper, (IMAGINE) with a vocabulary size of 100,000 (V = 100,000) and a sequence length of 10 words, V^10 (V to power of 10) represents the number of possible combinations for a sequence of 10 consecutive words from a much larger vocabulary.

The minus 1 in the expression V^10 ? 1 accounts for the fact that the last word in the sequence is not a free parameter. Once you know the first nine words, the tenth word is determined because it has to complete the sequence. To clarify, consider a sequence of 10 words: w1,w2,w3,w4,w5,w6,w7,w8,w9,w10. The first nine words (w1 to w9) are free parameters because you can choose any word from the vocabulary for each of them independently. However, once you've chosen w1 to w9, the last word w10 is not a free parameter; it is uniquely determined by the choices you made for the previous nine words.

Hence you can now appreciate GPT type models. The number of parameters in a machine learning model, such as the GPT (Generative Pre-trained Transformer) model, is influenced by the complexity of the model (which we will see different architecture in this newsletter) and the size of the vocabulary it is trained on.

Okay! but where do Curse of Dimensionality come into picture?

Let's first create a simple example using PyTorch to represent a vocabulary of 3 words in 2 dimensions. We'll use the torch library to create word embeddings and visualize them.

Attention: !! We will learn all these programming concepts once we start implementation of our own models. Right now understanding the concepts is what we should be focusing on!

Ref: PyTorch Course

import torch
import matplotlib.pyplot as plt

# Vocabulary
vocabulary = {'cat', 'dog', 'mouse'}

# Embeddings in 2 dimensions
embedding_dim = 2

# Create random embeddings for each word
word_embeddings = {
    'cat': torch.rand(embedding_dim),
    'dog': torch.rand(embedding_dim),
    'mouse': torch.rand(embedding_dim)
}

# Plot the embeddings
fig, ax = plt.subplots()

for word, embedding in word_embeddings.items():
    ax.scatter(embedding[0], embedding[1])
    ax.text(embedding[0], embedding[1], word)

plt.xlabel('Dimension 1')
plt.ylabel('Dimension 2')
plt.title('Word Embeddings in 2D')
plt.grid(True)
plt.show()

Complete Pytorch Code: Google Colab source.

In this example:

We have a vocabulary of three words: 'cat', 'dog', and 'mouse'.
We represent each word with a 2-dimensional vector (embedding).
The code creates random vectors for each word (in a real-world scenario, these would be learned during training).
We then plot these vectors in a 2D space, with each point representing a word.

Results:

In 2D :

In 3D -

Ref: PyTorch Course

领英推荐

Janus Pro 7B vs DALL-E 3: A Comparative Analysis

The-Next-Tech 1 个月前

The misconception of self-learning capabilities of…

Frank Denneman 1 年前

Understanding Long Short-Term Memory (LSTM) Networks

Kumar Preeti Lata 5 个月前

import torch
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D

# Vocabulary
vocabulary = {'cat', 'dog', 'mouse'}

# Embeddings in 3 dimensions
embedding_dim = 3

# Create random embeddings for each word
word_embeddings = {
    'cat': torch.rand(embedding_dim),
    'dog': torch.rand(embedding_dim),
    'mouse': torch.rand(embedding_dim)
}

# Plot the embeddings in 3D
fig = plt.figure()
ax = fig.add_subplot(111, projection='3d')

for word, embedding in word_embeddings.items():
    ax.scatter(embedding[0], embedding[1], embedding[2])
    ax.text(embedding[0], embedding[1], embedding[2], word)

ax.set_xlabel('Dimension 1')
ax.set_ylabel('Dimension 2')
ax.set_zlabel('Dimension 3')
ax.set_title('Word Embeddings in 3D')
plt.show()

Results:

In the provided PyTorch example, if you were to increase the dimensionality significantly, you might encounter “curse of dimensionality”. For instance, if the dimensionality were in the hundreds or thousands (in our example above we have 2D and 3D to represent 3 words), you would need an immense amount of data to avoid sparsity issues. Additionally, interpreting the meaning of specific dimensions or understanding relationships between words could become more challenging.

Now as we take larger vocabulary like 100,000 words, we would then require at least 50 or 300 D (dimension) space to represent this words. NLP applications often use 50 to 300 dimensions space.

Okay! So as we increase the vocabulary, the dimensions required to represent them increases. But where is the reference to “Discrete Random Variables”?

In this case, the discrete random variable is the sequence of 10 consecutive words in a natural language sentence. Each word in the sequence is a discrete random variable because it can take on one of the 100,000 possible values from the vocabulary.

My go to book: Probability & Statistics for Engineers & Scientists.

Ok! Now let’s summarize.

In language modeling, the goal is to understand and predict the likelihood of different combinations of words occurring together. Like given few words, what words will come next. The "curse of dimensionality" becomes apparent when dealing with a large vocabulary, as we saw above, and modeling the joint distribution of multiple words.

Example: Modeling 3 Consecutive Words

Let's consider a small vocabulary with only 3 words: {cat, dog, mouse}. We want to model the joint distribution of 3 consecutive words in a sentence.

Vocabulary Size (V): 3 words (cat, dog, mouse)
Sequence Length: 3 consecutive words

Now, let's calculate the potential number of parameters for modeling the joint distribution.

V^3 = 3^3-1 = 27-1 = 26

cat cat cat
cat cat dog
cat cat mouse
cat dog cat
cat dog dog
cat dog mouse
cat mouse cat
cat mouse dog ….. and so on. 26 such combinations in total.

So, even with a small vocabulary and a sequence length of 3 words, there are potentially 26 free parameters to model the joint distribution. Each parameter represents the probability associated with a specific combination of three consecutive words.

In the context of the example, the joint probability distribution would involve assigning probabilities to all possible combinations of 3 consecutive words. For instance:

P(cat, dog, mouse)
P(dog, mouse, cat)
P(mouse, cat, dog) and so on.

Each of these represents the probability of observing the specific sequence of words in the language.

To calculate the joint probability distribution, we need to assign probabilities to each of the 26 combinations. Let's assume a simple scenario where each combination has an equal probability. In this case, the probability of each combination would be 1/26. Meaning each combinations is equally likely to appear. Although in real-world it won’t be like this.

We'll use PyTorch to represent and calculate the probabilities.

Ref: PyTorch Course

The post continues here: https://mathx.substack.com/p/ep-6-integrated-realms-dimensionality

We have both math based and in-plain-english explanation of the concepts, and much more.

Thank you for the time. Please leave a comment.

#WDKWWDK

1,012 位关注者

要查看或添加评论，请登录

Bikash Debnath ?的更多文章

EP 5: Language Modeling | Paper 1: A Neural Probabilistic Language Model

2024年1月30日

EP 5: Language Modeling | Paper 1: A Neural Probabilistic Language Model

In continuation to: Paper 1: A Neural Probabilistic Language Model Hello Readers, What is a Language Model? Answer:…

2 条评论
EP 4: Joint Probability Distribution | Paper 1: A Neural Probabilistic Language Model

2024年1月29日

EP 4: Joint Probability Distribution | Paper 1: A Neural Probabilistic Language Model

In continuation to: Paper 1: A Neural Probabilistic Language Model Hello Readers, Please follow us here:…
EP 3: Random Variables | Paper 1: A Neural Probabilistic Language Model

2024年1月26日

EP 3: Random Variables | Paper 1: A Neural Probabilistic Language Model

In continuation to: Paper 1: A Neural Probabilistic Language Model Hello Readers, Just wanted to share a little…

3 条评论
EP 2: Curse of Dimensionality | Paper 1: A Neural Probabilistic Language Model

2024年1月25日

EP 2: Curse of Dimensionality | Paper 1: A Neural Probabilistic Language Model

In continuation to: Paper 1: A Neural Probabilistic Language Model Hello Readers, I was introduced to the concept of…
EP 1: Paper 1: A Neural Probabilistic Language Model

2024年1月23日

EP 1: Paper 1: A Neural Probabilistic Language Model

Dear Valued Readers, Welcome to my inaugural post, where we embark on a journey through a groundbreaking research paper…

1 条评论
Word Embedding and Word vectors - MathX explained

2023年5月5日

Word Embedding and Word vectors - MathX explained

MathX publish this series to discuss the Mathematics fundamentals behind interesting concepts in the field of…
Disc

2022年7月24日

Disc

My 5-year old son recently inquired as to why Rainbow has only seven colors, not more or less.? I was stumped.

1 条评论
20 Physical Games to engage father child and reduce brain hyperactivity

2022年7月17日

20 Physical Games to engage father child and reduce brain hyperactivity

Technology has an unequal influence on us - both adult and children. It's side-effects vary in size.

2 条评论
8 most in-demand soft-skills for your career

2022年7月16日

8 most in-demand soft-skills for your career

According to a latest report in CNBC, 93% of employers want to see these 8 soft skills on their prospective employee's…
5 Topics you will definitely face during Natural Language Processing related interviews.

2021年2月8日

5 Topics you will definitely face during Natural Language Processing related interviews.

5 Topics you will definitely face during Natural Language Processing related interviews. Gone are those days when you…

See all articles

EP 6: Integrated Realms: Dimensionality, Variables, Probability | Paper 1: A Neural Probabilistic Language Model.

Bikash Debnath ?

Associate Director for AI | Nuveen

领英推荐

#WDKWWDK

1,012 位关注者

Bikash Debnath ?的更多文章

社区洞察

其他会员也浏览了

No code-no maths: Learn Gen AI (2)

Part 4: The Quest for Understanding Language ??

ML Algorithms

BERT - Next Generation topic detection and sentiment analysis explained to business people

Decoding the Transformers: A Dive into GPT with TensorFlow

Comparative summary of Generative Adversarial Networks (GANs), Variational Autoencoders (VAEs), Long Short-Term Memory (LSTM) networks, Transformers,

Dense Passage Retrieval for Open-Domain Question Answering

Edition 4: A simplified glossary for machine learning and deep learning

Intro to Large Language Models with Andrej Karpathy

领英推荐

#WDKWWDK

1,012 位关注者

Bikash Debnath ?的更多文章

EP 5: Language Modeling | Paper 1: A Neural Probabilistic Language Model

EP 4: Joint Probability Distribution | Paper 1: A Neural Probabilistic Language Model

EP 3: Random Variables | Paper 1: A Neural Probabilistic Language Model

EP 2: Curse of Dimensionality | Paper 1: A Neural Probabilistic Language Model

EP 1: Paper 1: A Neural Probabilistic Language Model

Word Embedding and Word vectors - MathX explained

Disc

20 Physical Games to engage father child and reduce brain hyperactivity

8 most in-demand soft-skills for your career

5 Topics you will definitely face during Natural Language Processing related interviews.

社区洞察

其他会员也浏览了

No code-no maths: Learn Gen AI (2)

Part 4: The Quest for Understanding Language ??

ML Algorithms

BERT - Next Generation topic detection and sentiment analysis explained to business people

Decoding the Transformers: A Dive into GPT with TensorFlow

Comparative summary of Generative Adversarial Networks (GANs), Variational Autoencoders (VAEs), Long Short-Term Memory (LSTM) networks, Transformers,

Dense Passage Retrieval for Open-Domain Question Answering

Edition 4: A simplified glossary for machine learning and deep learning

Intro to Large Language Models with Andrej Karpathy