Lets build a GPT style LLM from scratch - Part 2a, quick intro to Transformer and self-Attention
credit: Selfhosted StableDiffusionXL

Lets build a GPT style LLM from scratch - Part 2a, quick intro to Transformer and self-Attention

Alright, so by now I am sure most of you would have prepped up the infrastructure and data for us to train our LLM. But, before we go there, I think, it will be good to explain 2 key concepts which power the entire concept of GPT, Transformers and Self-Attention.

Even before that, I forgot to mention, if you dont have a GPU, you can easily use Google's Colab to train the model. It will be slow, but it works well.

With that out of the way, lets start with quick introduction:

We all understand what Generative and pre-trained is, in simple words, generative stands for model's ability to generate text, pre-trained states how it has gone through training on massive amount of text ( 'pre' emphasizing there still enough room to finetune it with additional data and provide it more dimensional knowledge).

The real deal is the "Transformer". Let me try to explain it using a simple "Dinner table" analogy. Imagine you are sitting at a dinner table with your friends. Every time someone speaks you focus on them, your brain tunes(attends) into the words they are speaking while maintaining the overall context of the conversation on the table. Transformers, kind of do the same in the digital world. It processes massive amount of data ( like text, audio, images) by paying attention to important parts of the data, learning to understand the context and nuances of the information it is given.

Lets try to explain key concepts of the transformer and its working in a GPT model, using a well known sentence

"The quick brown fox, jumped over a lazy dog".

Input embedding:

Every sentence fed, needs to be first tokenized into individual words. Thus the sentence above becomes: ["The", "quick", "brown", "fox", "jumps", "over", "the", "lazy", "dog"].

Once this is done, its important to capture the semantic meaning of the sentence and each word. Thus, each word is converted into a dense vector representation (embedding) of a fixed size. Confusing, right ?


SIDEBAR: Vocabulary

Lets take a step back and explain one key concept of NLP. For a model to understand and work properly, its important we "provide" it a unique set of words or tokens, which it recognizes and works with. This set of unique words is called "vocabulary", or essentially a dictionary that maps each unique word to a corresponding integer index. Now, this vocabulary is created from an extremely large set of words, so the number of unique words can be in 100s of thousands.

During the model's training process, each word in the input sentence is converted into its corresponding integer index based on the vocabulary. These indices are then used to look up the word embeddings, which are dense vector representations of the words.

GPT type models use subword tokenization techniques (e.g., byte-pair encoding) to handle out-of-vocabulary words and reduce the size of the vocabulary. This breaks down rare or unknown words into smaller subword units, allowing the model to handle a wider range of words with a fixed-size vocabulary.

Anyhow, we digress, tokenization and embeddings are a topic of their own and are key fundamentals for effectiveness of a model and we shall discuss it in another separate dedicated post later.


Now that we have the vector representation, its important to identify and preserve the position of the words/tokens as they occur. Hence, positional encoding is added to the embeddings to represent the position of each word in the sentence. I know, just mentioning these terms doesn't really help so here is another sidebar:


SIDEBAR: Positional Encoding

So, to preserve the semantic meaning of the sentence and the context which every word carries its important to preserve where a particular word occurred in a sentence, for it to convey a particular meaning. Thus, while the words are encoded using the vocabulary, their position is also encoded to the calculated embeddings. Lets take an example of "the, quick brow, fox". For simplicity lets imagine our vocabulary has only 4 words:

 {the:0, quick:1, brown:2, fox:3}        

Lets use simple on-hot encoding vector of size 4, to encode the above as below:

{"The": [1, 0, 0, 0];  "quick": [0, 1, 0, 0] ; "brown": [0, 0, 1, 0] ; "fox": [0, 0, 0, 1]}        

To incorporate positional information, lets create positional encoding vectors of the same size as the one-hot encoding vectors. For simplicity, let's use a simple positional encoding scheme where we assign a unique value to each position:

Position 0: [0.1, 0.1, 0.1, 0.1]
Position 1: [0.2, 0.2, 0.2, 0.2]
Position 2: [0.3, 0.3, 0.3, 0.3]
Position 3: [0.4, 0.4, 0.4, 0.4]        

Now, once we add the positional encoding vectors element-wise to the corresponding one-hot encoding vectors, following becomes the representation of each word with encoded positions:

"The"   (Position 0): [1, 0, 0, 0] + [0.1, 0.1, 0.1, 0.1] = [1.1, 0.1, 0.1, 0.1]
"quick" (Position 1): [0, 1, 0, 0] + [0.2, 0.2, 0.2, 0.2] = [0.2, 1.2, 0.2, 0.2]
"brown" (Position 2): [0, 0, 1, 0] + [0.3, 0.3, 0.3, 0.3] = [0.3, 0.3, 1.3, 0.3]
"fox"   (Position 3): [0, 0, 0, 1] + [0.4, 0.4, 0.4, 0.4] = [0.4, 0.4, 0.4, 1.4]        

When these positionally encoded vectors are passed through the transformer layers, the self-attention(we will discuss later) mechanism can take into account the relative positions of the words while computing the attention weights. This allows the model to understand and capture the order and relationships between the words in the sequence.

Well this was a simple example to show what all these things mean, but in reality when SOTA algorithms are built a much more complex methodology is used. There are 2 standards which are used:

Sine and Cosine Functions: Was introduced in the original "Attention Is All You Need" paper by Vaswani et al., uses sine and cosine functions of different frequencies to generate positional encodings.

Learned Positional Embeddings: Here, the positional encodings are learned during the training process, similar to how word embeddings are learned. Instead of using predefined functions, a separate embedding matrix is initialized randomly for each position in the sequence. These positional embeddings are learned and updated during the training process using backpropagation, allowing the model to learn the optimal positional representations for the task at hand.


Phew ??, wanted to write a short introduction as its important to know these concepts and given my fingers hurt already, I am quite sure he the post have gotten big. But, let me push through and complete this as these are key fundamental concepts, which when get clear, a lot of training and building becomes easier.

Ok, so that completes Input embedding. Lest focus on the next item on the list

Multi-Head Attention

For a moment lets focus on the word "fox" and assume that word by itself is the query. The model computes the attention weights between "fox" and all other words in the sentence, the goal is to determine which words are most relevant to "fox" based on their similarity. Now all the attention weights are then used to compute the weighted sum of the embeddings, creating a new representation for "fox" that incorporates information from the relevant words.

See, these attention weights reflect the importance or relevance of each word to the word "fox" in the given context. Words that are more relevant to "fox" will have higher attention weights, while less relevant words will have lower weights. Finally, by computing the weighted sum of embeddings, the model effectively incorporates information from the most relevant words into the new representation of "fox."

All of the above, allows the model to capture the contextual information and relationships between "fox" and other words in the sentence.

The weighted sum of embeddings results in a new representation for the word "fox" that is specific to the current context. You can consider this as a new representation of the word "fox", because it now includes information from other relevant words in the sentence. It captures the meaning of "fox" in relation to the surrounding words, taking into account their importance as determined by the attention weights.

All this to say that, the significance of this process lies in the model's ability to create context-aware representations for words. By incorporating information from relevant words, the model can better understand the meaning and role of each word within the specific context of the sentence.

For example, consider the following sentences:

  1. "The quick brown fox jumps over the lazy dog."
  2. "The sly fox crept into the chicken coop at night."

In sentence 1, the attention weights might highlight the relevance of words like "quick," "jumps," and "dog" to the word "fox," indicating its role as an active and agile animal in the context of the sentence.

In sentence 2, the attention weights might focus on words like "sly," "crept," and "chicken coop," suggesting a different aspect of the fox's behavior and characteristics in that specific context.

Anyways, you get the point.

Now, just in case you dont, let me take the same example, "The quick brown fox" and explain what happens in simplistic terms:

For each word in the sentence, the model creates query, key, and value vectors by multiplying the input word embeddings with learned weight matrices (WQ, WK, WV). Let's assume we have the following vectors for the words in the sentence "The quick brown fox jumps over the lazy dog":

Query vector of "fox" (Q_fox): [0.2, 0.3, 0.1, 0.4]
Key vector of "The" (K_The): [0.1, 0.2, 0.3, 0.4]
Key vector of "quick" (K_quick): [0.4, 0.1, 0.3, 0.2]
Key vector of "brown" (K_brown): [0.2, 0.3, 0.4, 0.1]
Key vector of "fox" (K_fox): [0.3, 0.4, 0.2, 0.1]
Key vector of "jumps" (K_jumps): [0.1, 0.3, 0.2, 0.4]
Key vector of "over" (K_over): [0.4, 0.2, 0.1, 0.3]
Key vector of "the" (K_the): [0.2, 0.1, 0.4, 0.3]
Key vector of "lazy" (K_lazy): [0.3, 0.2, 0.1, 0.4]
Key vector of "dog" (K_dog): [0.1, 0.4, 0.3, 0.2]        

Next, we compute the attention scores by taking the dot product between the query vector of "fox" (Q_fox) and the key vectors of all the words in the sentence.

Attention score between "fox" and "The": Q_fox · K_The = (0.2 * 0.1) + (0.3 * 0.2) + (0.1 * 0.3) + (0.4 * 0.4) = 0.28
Attention score between "fox" and "quick": Q_fox · K_quick = (0.2 * 0.4) + (0.3 * 0.1) + (0.1 * 0.3) + (0.4 * 0.2) = 0.24
Attention score between "fox" and "brown": Q_fox · K_brown = (0.2 * 0.2) + (0.3 * 0.3) + (0.1 * 0.4) + (0.4 * 0.1) = 0.27
Attention score between "fox" and "fox": Q_fox · K_fox = (0.2 * 0.3) + (0.3 * 0.4) + (0.1 * 0.2) + (0.4 * 0.1) = 0.28
Attention score between "fox" and "jumps": Q_fox · K_jumps = (0.2 * 0.1) + (0.3 * 0.3) + (0.1 * 0.2) + (0.4 * 0.4) = 0.29
Attention score between "fox" and "over": Q_fox · K_over = (0.2 * 0.4) + (0.3 * 0.2) + (0.1 * 0.1) + (0.4 * 0.3) = 0.27
Attention score between "fox" and "the": Q_fox · K_the = (0.2 * 0.2) + (0.3 * 0.1) + (0.1 * 0.4) + (0.4 * 0.3) = 0.26
Attention score between "fox" and "lazy": Q_fox · K_lazy = (0.2 * 0.3) + (0.3 * 0.2) + (0.1 * 0.1) + (0.4 * 0.4) = 0.28
Attention score between "fox" and "dog": Q_fox · K_dog = (0.2 * 0.1) + (0.3 * 0.4) + (0.1 * 0.3) + (0.4 * 0.2) = 0.26        

We apply the softmax function to the attention scores to convert them into attention weights. Thus, this gives us resulting weights as:

Attention weight between "fox" and "The": 0.11
Attention weight between "fox" and "quick": 0.09
Attention weight between "fox" and "brown": 0.10
Attention weight between "fox" and "fox": 0.11
Attention weight between "fox" and "jumps": 0.12
Attention weight between "fox" and "over": 0.10
Attention weight between "fox" and "the": 0.10
Attention weight between "fox" and "lazy": 0.11
Attention weight between "fox" and "dog": 0.10        

These attention weights indicate the relative importance of each word to the word "fox" in the given context. The words "jumps," "fox," "The," and "lazy" have slightly higher attention weights, suggesting their relevance to the word "fox" in the sentence.

Now, the transformer model will use the above weights to create a new representation for the word "fox" that incorporates information from the relevant words in the sentence. This new representation will then be used to generate the model's response or output related to the query word "fox"

The model computes the weighted sum of these value vectors using the attention weights ( i assumed a simplified value representation to explain):

new_fox_representation = (0.11 * V_The) + (0.09 * V_quick) + (0.10 * V_brown) + (0.11 * V_fox) + (0.12 * V_jumps) + (0.10 * V_over) + (0.10 * V_the) + (0.11 * V_lazy) + (0.10 * V_dog)        

Using this new representation, the transformer model can generate a response or output related to the query word "fox." The specific response depends on the task and the training of the model.


I wanted to make sure these 2 concepts are properly explained before we start writing our own simple IndieLLM. The process doesnt end here though. Once, representations are calculated, it is is passed through a feed-forward neural network, which applies non-linear transformations to the representation, allowing the model to learn more complex patterns.

The output of the feed-forward network is added to the original embedding of "fox" through a residual connection. In all these a ton of matrix multiplications and weight finetuning is happening, thus a aayer normalization is applied to normalize the activations and stabilize the training process.

Remember, all what I explained in the above steps (multi-head attention and feed-forward network) form a single transformer layer only. Multiple transformer layers are stacked above eachother, where one layer takes the input form the other layer. As the information flows through the layers, the model learns increasingly abstract representations of the sentence.

After multiple iterations, The final transformer layer produces an output embedding for each word in the sentence. These output embeddings capture the contextual information learned by the model. Which, can be ultimately used for various tasks, such as predicting the next word in the sentence (language modeling) or classifying the sentiment of the sentence.

All this to say, it takes a ton of computation and meticulous design to develop a large SOTA model which can generate coherent and contextually relevant text by understanding the relationships between words and their positions in the sentence.

While these models, understand the context and weight of tokens, they dont really possess intellectual knowledge or wisdom, but the representations are so accurate that the overall sentences make coherent sense.

Hopefully, this gave basic fundamental of what we will design in the next turn of our journey !!!! Looking forward to the same.

Clinton Reeves

Senior Enterprise Account Executive @ AWS | Insurance, Generative AI, AI/ML, DevOps, Containers

11 个月

awesome breakdown here - thanks!

要查看或添加评论,请登录

Siddharth T.的更多文章

社区洞察

其他会员也浏览了