Transformer in LLM - Encoder Block
Transformer - Where the Magic happens in LLM

Transformer in LLM - Encoder Block

What is an Encoder in a Transformer?

  • Stack of N+1 encoder block
  • Output of an encoder block is input for next encoder block
  • Final encoder block returns a representation
  • The paper on Attention is all you need suggested 6 encoder blocks but you can have as many as you want

First, let us take a look at the encoder as a whole, we can see an encoder

Encoder Block Simple Understanding

Encoder is made up of number of stacks of encoder blocks, each encoder block does an encoding and passes over to the next encoder block, so overall it is called Encoder and I refer the modules which does the encoding individually as encode blocks. You can have as many encoder blocks as you want, the output of one encoder blocks is the input for the next encoder block.

The moment we arrive at the final block, the output will be the representation itself and this is what we feed into the decoder. In the paper (Attention is all you need), the author have suggested to have 6 encoder blocks.

So, you may have a question here –

Why stacking encoder blocks? Can we just have a single encoder block, isn’t that enough?

Well potentially it is good enough but perhaps you want more. This is very similar to what happens with convolutional Neural Networks in CNN. We have a lot of convolutional layers and more convolutional layers you have, the more complex representation you are going to generate passing information through all of these convolutional layers. So, what happens is the same thing with multiple encoder blocks, the deeper you go and more complex the representation in different encoder blocks are going to be able to learn different aspect at each layer, the more abstract the feature, these encode blocks are going to be able to represent like semantics, morphology, metaphor, etc.

This what happens with Convolutional Neural Network, the deeper you go in the convolutional layers, the higher level of abstraction of features you get, i.e., it is sort of hierarchical extraction of features from least abstract to most abstract.

Complex Representation

Learning different aspects at each layer to improve contextualization. Finally, if you have more encoder blocks you will have improvised contextualization.

What is contextualization in this particular space?

It is the ability to have the representation where each word knows about all the other words, so it knows the context and all of its relationship between those words.

Architecture of an Encoder Block

Encoder Block Architecture

Let us see the different components or subcomponents of an encoder block, so as you can see here, we have 3 subcomponents, the Multi-Head Attention, Add & Norm, and Feedforward sublayer and then we repeat Add & Norm. Each encoder block is always the same, it does the encoding through these 3 sublayers. We will start with Multi-Head Attention but you can’t understand Multi-Head Attention if you don’t know Self-Attention.

Self-Attention Mechanism

If you understand Self-Attention, then you can understand any form of transformer architecture. This is really where the magic happens. In order to understand Self-Attention, I need to introduce the context for that and let you understand the problem we had with the previous model and we will do that with a sentence as an example.

"MY FRIENDS LIKE TOMATOES BECAUSE THEY ARE TASTY"

There is some level of ambiguity in the sentence, and that has to do with they, we should understand what they points to, it could be that my friends like tomatoes because my friends are tasty, doesn’t make sense but the other interpretation, the right one would be my friends like tomatoes because tomatoes are tasty. So, they reference tomatoes here, for us understanding this is quite intuitive. It is something like we do since we are small children but for a model it is difficult to understand these nuances and this is where we need that long term dependencies being accounted for and we will do that with self-attention.

So, how can I solve references between words?

Model can compute representation indeed and we are going to do that with relate each other attention with other words. We want to compute an embedding that not only have information about each word but about each word in relationship with all other words in the sequence so that it knows the overall context and we are going to be calling that embedding which is an attention matrix that is the Self-Attention Mechanism.

Here, we have our nice little sentence and now we get like they and we see that we can create a connection with all the other words in the sequence and the strength of the relationship is represented by the thickness of this line. They is particular related to tomatoes as the line is super thick, this is attention in a nutshell, by which we want to understand the relationship of each word with all the other words in the sequence.

Self Attention Mechanism Score ("they")


The attention mechanism is simple but the mathematics behind it is quite complex.

Self-Attention as the Protagonist :

Input Embedding Matrix (I)???????????? Query Matrix (Q)???????? Key Matrix (K)????????Value Matrix (V)

Input Embedding Matrix (I)

The dimension will be the number of words that we have by the embedding dimension and this is something arbitrary and you can put it to 64, 128 or 2 and each row is going to be a word vector. In other words, each row describes the embedding for a word. It provides all the features and by all the features, I mean all the values of the embedding for those particular words.

How a word is denoted by a vector in the AI models - The numbers are simply made up.

I like Cats = We get this natural language expression sentence and we convert that into this vector matrix, you see each row is a word and here you see we have the embedding values/features for each one of the words, so 0.2, 1.2 for I for example so, that is the word vector for I, this is the way, we let the model understand the original sentence which we convert into an input matrix like this.

Input, Query, Key and Value Embedding Matrix

They all have the same dimension as the input embedding matrix but you could potentially decide the level of dimension, So, in other words these embedding would have like the same number of features, the same number of columns and the no. of rows is going to be the same because each row represent a different word for the sentence that we have.

How do we derive Q, K , V?

Multiply Input Matrix by 3 Weight matrices like – IWq – Q, IWk = K and IWv = V

We derive them from 3 other matrices, so we multiply the Input Matrix by three weight matrices, so we have the Query matrix, and we have the same thing for K, V matrices also. Now, the Weight Matrices are the ones that we have for training our model, in other words we learn this Weights when we train a transformer, one of the things we learn are the Weights for Wq, Wk, Wv.

So, why do we need all these matrices? Well the reason is that by using these matrices, we can create the relation of a word with all the other words in a sequence (the sequence could be a sentence, paragraph, chapter, entire book etc.) in order to compute this relationship, we need Q, K and V. So, the formula for getting the attention matrix Z and X=Z where X is the attention score.

Self Attention Mechanism Score

So, Z is a functional of Q, K and V and is like the overall embedding that has the information of each word with all the other words

X is the attention score and gives the relevance of different parts of the sequence to each other and gives us this information for each word in relation to all the other words, so how relevant those other words are in relation with the original target word, so this is an attention score, so how important are the different connections between the words.

Next multiply attention score by value matrix V. Attention matrix Z holds the relationship of each word with each other.

Let us assume you are at a party, it is very noisy, there are lot of different people speaking, so the equivalent of this attention score would be your brain searching for people who are most relevant to you, so you are searching for them, but you don’t know what they are talking about.

So, you just search for speakers you are interested in but you don’t know the context. Then the equivalent of this value matrix would be the context of the discussion, so the overall attention matrix that you will get here, would be the mediation of your brain searching for things that are relevant for you and at the same time mediating that information with the context of the discussion that those speakers are having. When you combine those Z things together, you will have the concept of Self Attention Mechanism.

How can the transformer make sense of this phrase and understand they is connected to tomatoes, we just go in the attention matrix and we take a look at the attention vectors associated as they :

Z (they) = 0.01v1 + 0.02v2 + 0.03v3 + 0.9v4 +??? 0.05v5 +?? 0.0v6 + 0.01v7 + 0.0v8

????????????? My??????? friends??? like???????? tomatoes??because? ?they??????are????????? tasty

This I completely made up but I just want to make a point here.

Friends have a score value of 0, so friends doesn’t affect the attention vector of they at all rather, you have tomatoes with a score value of 90%, in other words we understand how powerful this attention mechanism is because by looking at these attention vectors, we can see what these words, in this case they are related to tomatoes. By this Self-Attention Mechanism Score, the transformer can understand long term relationship which can be applied to very long text or chapters even. By using this attention mechanism, you can find the relationship between each word with all the other words out there.

Isn’t it just fascinating.

Positional Encoding

But there is a problem and the problem is that Transformers doesn’t know anything about the order of your input sequence and this order is quite important in sequential data like text generation, music generation. So, we need to encode position or order, so our goal is to instead find a strategy to encode the order directly in the embedding and that strategy is called Positional encoding.

So, now we start with “I like cats” and we generate an input embedding and that input embedding added up with the positional encoding and that information is fed into the first encoder block.

Encoder Block – Architecture Components (Layers)

Multi-Head Attention runs multiple instances of the Self-Attention mechanism in parallel so rather than having a simple Self-Attention mechanism, you have many in parallel and computes as many Q, K, V, Z matrices as the no. of heads.

Each attention head is going to have its own number of independent matrices and finally we are going to end up with no. of different attention matrices one for each attention hat that we have, so we take all of them and we concatenate them.

Z = concatenate (Z1, Z2, Z3, ……, Zn) Wo

Why multiple heads and can’t we have a single head per encode block?

Each attention head can focus on different parts of the input sequence, perhaps there is a head that focuses mainly on syntaxes, another one on morphology and third one perhaps on semantics, so different heads can have different perspective on the same sequence of words and because of that you can increase the complexity of the representation, so you can represent constructs that are more complex and at the same you minimize the risk of overfitting.

Now LSTM or Recurrent Neural Network finds words sequentially so you feed one word at a time into an LSTM. But that is not true for a Transformer because all the words are fed into parallel through the input embedding and this is a great thing because everything works in parallel, and so what this means is that it decreases the training time, learning is going to be more efficient and at the same time parallel processing is going to help learning long term dependencies, since you have the entire input into one batch, so you have all the different words and so your attention mechanism is going to capture all the relationship between each word with all the other words.

Feedforward layer is a simple dense layer actually, it is fully connected layers and tends to process each data point separately and by data points I mean each word, embedding is going to be processed by itself independently of all the other data points out there.

Why do we apply Feedforward? This is ?because it has this nonlinearity and this adds to the complexity of the overall representation so that we can learn more sophisticated features.

Add & Norm

We have the first Add & Norm layer that connects the input of the attention layer to its output and then we have a second Add & Norm layer after the Feedforward layer and this one connects the input of the Feedforward layer to the output of the Feedforward layer, if you are familiar with residual network, these are simple Skip Connections, or residual connections. They are called Skip Connections because they skip one step and then they add the output of that subcomponent to its input.

We use this kind of skip connection because they help in mitigating a huge problem that we found, i.e., the problem of Vanishing Gradients, i.e., when you have lot of layers the problem is that the gradients tend to vanish and become super small and it became very difficult to continue training. If you have small gradient because you can’t optimize the moments, so by using this skip connection, we help mitigating the problem of vanishing gradients.

Norm layer normalizes the data across all the features in the embedding for each position, i.e., for each word basically we take the relevant embedding and then we apply normalization, so that we get a mean = 0 and a standard deviation = 1, we do that because with normalized data for faster convergence. Learning going to be very easier and this will happen because we prevent the value to change dramatically from one layer to another layer.

Encoder Block : Quick Summary

Multi-Head Attention : It provides context, it provides all these long-term relationship between each word in the input sequence with all the other words in the same sequence.

Feedforward : It provides nuances because it is able to add a degree of complexity and overall nuances to the type of representation that the model can learn.

Add & Norm : It is a utility layer that is used for streamlining learning and make it easier with huge deep network with a lot of encoder block stacked on top of each other.

Encoder Step-by-Step Guide

  • Create Input Embedding.
  • Sum the embedding and potential encoding so that we have now an input embedding that also have some information regarding relative words of the sequence and order of the words in the sequence
  • Then we feed this transformed input into the first encoder block and we compute Q, K, V multiplying by Input embedding and Weight Matrices.
  • At this point, we calculate the Attention Score of the overall global attention by concatenating the local attention matrices that are produced by different attention heads.
  • Compute Z by concatenating all the matrices produced in different attention heads.
  • Add Z+ input and normalize it.
  • Then we add the global attention to its input to the input of the Multi-Head Attention layer and we normalize that and then we take that and feed the output of Add & Norm to Feedforward layer.
  • Add the output of the Feedforward layer to its input. The output of the Feedforward layer is going to be added to its input so we apply a skip connection here, again we apply Normalization and at this point, we are at the end of the first encoder block.
  • Repeat for all the encoder blocks and we start repeating from Multi-Head Attention. Once we are at the last encoder block, we get the final representation and this is extremely important because it has lots of abstract information, a lot of complex and contextual information, so it is very rich and we take this representation and feed into the input of the decoder.
  • Get the final representation as output of the last encoder block
  • Feed the final representation to the decoder

Key Takeaways

  • Transformers capture long term dependencies
  • Encoder/decoder architecture
  • Self-attention = relationship between words
  • Multiple attention heads capture different information
  • Encoder positions in embedding
  • Feedforward layer adds Nuances and complexity
  • Add & Norm facilitates learning
  • Multiple encoder blocks
  • Final representation is fed to decoder.

要查看或添加评论,请登录

Kingshuk Biswas - Building Business Applications using LLM的更多文章

社区洞察

其他会员也浏览了