The Transformer Model - Neural Network which uses Attention.

The Transformer Model - Neural Network which uses Attention.

Prerequisites :

For this blog I assume that you are already familiar with Deep Learning, Attention Mechanism, encoder decoder mechanism.


Problems with RNN (Recurrent Neural Network) :

? RNN cannot remember things in case of long sentences or long dependencies.

? RNN cannot be parallelized as it is a sequential model which uses recurrence.


Problems with LSTM (Long Short Term Memory) :

? LSTM cannot be parallelized as it is also sequential.

? LSTM takes more time for training.


Transformer Model :

? To overcome problems like this and to achieve human like NLP outputs, Transformer model is introduced in 2017 by Ashish Vaswani and his team from Google Brain by publishing paper "Attention Is All You Need".

? Transformer model (specifically Multi Headed Self Attention) is now became the foundation for many of the state of the art LLMs like BERT, ChatGPT, Google Bard etc.

? Transformer uses only attention mechanism to remember things.

? They do not have recurrence so it is faster to train using parallel approach.

? Attention means the features on which model is focusing more to find correlation or to generate human like output.


Architecture of Transformer :

? Transformer is having encoder-decoder mechanism. There are 6 encoders and 6 decoders in original paper.

? Each encoder is having one multi-headed self attention layer and one feed forward layer.

? Each decoder is having two self attention layers, out of which one is masked multi-headed layer and other is multi-headed layer similar which used in encoder. Apart from this 2 self attention layers, decoder is also having one feed forward layer.

? Parallelization in the transformer comes from how we feed data to network. We feed all the words of sentence at same time to input of network, from then it is passed to encoder and decoder.

? Input to the transformer is embedded and on top of this word embedding, positional encoding is used. In transformer we not use recurrence so to understand which word comes first which comes second, or basically to understand the order or sequence of words, positional encoding is used. In this we pass some information with each word that will tell the model where that word located in the sentence, this is positional encoding.

? Encoders and decoders are connected to each other. We pass input to encoders, after passing the data through all encoders, its output is passed to all decoder, then the output of decoders is pass to linear layer and softmax layer which gives us final output.

? This is the basic structure of Transformer, in addition to this to optimize performance there are normalization layers which is Layer Normalization ( improvement in batch normalization ) and skip connections are used in network, which helps model to not forget things and to pass/forward important information to further in network.

? The two most important things in Transformer architecture are self attention layers and positional encoding.

No alt text provided for this image
Transformer Architecture


Self Attention Layers :

? There are two types of self attention layers, Multi-Headed self attention layer and Masked Multi-Headed self attention layer.

? In normal multi-headed self attention layer, all words are compared with all other words in sentence that are inputted.

? In masked multi-headed self attention layer, only the words which are coming before that word are compared to that word.


Multi Headed Self Attention :

? In more technical words, in self attention we use scaled dot product attention. It is multiplied and done multiple times to create multi headed effect. Everything is done in matrix to do it faster.

? In multi headed self attention layer, input embeddings are scaled multiplied with some matrices which are Query, Key and Value matrices. This matrices are initialize randomly in the beginning and get trained during training.

? As a result of this scaled multiplication we get query, key and value vectors.

? Now we calculate the scores for each word against all other words in the sentence. For that we take dot product of query vector of each word against key vector of all other words. For example, if we want to get score of first word on second word then we will take dot product of query vector of first word with key vector of second word, if we want to get score of first word against third word then we will take dot product of query vector of first word with key vector of third word.

? Similarly we will calculate scores for all words against all other words. Then we will divide it by 8, which is square root of 64, which is length of query, key and value vectors in original paper.

? After dividing scores ( dot products ) by 8, we pass it to softmax layer, we do this to normalize all the scores and now sum of all the score values of one word against all other words will be 1.

? The output of softmax layer is like weights, now we will multiply this weights with value vector of each words.

? Then we will sum up all the weighted value vectors of all the words for the one word together to create output of multi headed self attention layer for this one word. Similarly this is done for all the words simultaneously.

? This is done for 8 times to create multi headed effect. So it will create 8 different Query, Key and Value matrices. This way the transformer is able to pay attention on many words.

? In the end we will get 8 different output resulted matrices and each line (row) from this matrices corresponds to one word. So we concatenate this 8 matrices together and multiply it with yet another matrix. This matrix is trainable.

No alt text provided for this image
No alt text provided for this image


Positional Encoding :

? Positional encoding is used to pass location information of words in sentnece.

? In paper they have used fixed positional encoding.

? For positional encoding in original paper they have used sin and cos function at different frequencies.

? We add word embeddings with positional encoding together and then input it to encoders.


Working of Transformer :

? First we get word embeddings, then we add positional encoding to it, then we pass it through all 6 encoders sequentially and at the last encoder we get the output of encoders.

? This encoders output is fed to all the 6 decoders simultaneously, specifically the encoders output is fed to multi headed self attention sublayer of decoders and input to the masked multi headed self attention layer is the output of decoders from previous time step.

? Decoders works together to create stacked output.

? This decoders stacked output is fed to linear layer, where linear transformation is done to create logit vectors. Length of this logit vector is same as of vocabulary size.

? Then we pass this output to softmax layer which gives us the final output which is probabilities of words.



Image Reference: https://jalammar.github.io/illustrated-transformer/


Stanley Russel

??? Engineer & Manufacturer ?? | Internet Bonding routers to Video Servers | Network equipment production | ISP Independent IP address provider | Customized Packet level Encryption & Security ?? | On-premises Cloud ?

1 年

Great article on the Transformer Model ???? I'm particularly impressed by its use of the Attention mechanism, which allows for a more human-like output. It's no wonder that it's the backbone of many state-of-the-art NLP models today! Thanks for sharing your insights ??

回复

要查看或添加评论,请登录

Tejas Bankar的更多文章

社区洞察

其他会员也浏览了