An Illustrative introduction to Transformers(Part 1/3)

An Illustrative introduction to Transformers(Part 1/3)

Link : Part 2


“Attention Is All You Need” was the name of the research paper that came out in 2017, revolutionizing the field of Natural Language Processing (NLP) and significantly impacting Deep Learning.

So, what was so special about the paper that led to such a profound impact??

The answer to that question is that the paper introduced a new architecture called the Transformer.

While the paper and the model itself was created for the purpose of machine translation?, it turns out this specific approach and the use of “Attention Mechanism” can be used to address various kinds of problems ( with a few changes?, of course). Attention mechanisms themselves weren’t entirely new but the Transformer placed them at the forefront.

This is what the architecture looks like?:

We’re going to get technical in a bit?, but before that?. Let me outline the major components of the architecture:

  1. Input Embedding
  2. Positional Embedding
  3. Encoder- consists of Multi-headed Attention?, ‘Add & Norm’ layer?, ‘Residual’ connections?, Feed Forward (also called Dense Layer)
  4. Decoder?—?consists of “Masked” Multi-headed Attention?, Multi-headed Attention?, ‘Add & Norm’ layer?, ‘Residual’ connections?, Feed Forward layer

The Nx that you see in the diagram corresponds to the number of Encoders and Decoders used( Nx was equal to 6?, in the paper)?. These encoders and decoders are stacked one above another?. The output of the first encoder is fed to the second?, and the output of the second encoder is fed to the third, and so on. The output of the last Encoder is fed to all of the decoders, and the outputs of the first decoder is fed to the second?, the output of the second decoder is fed to the third and so on?—?same as in the case of encoders. ( If this doesn’t make sense as of now?, Its fine -keep reading and come back to this paragraph after you’ve read how decoder uses the output of the encoder to calculate its outputs)

Now?, instead of defining each component separately as an independent entity, I’ll try to explain how the model works as a whole for the task of machine translation. For example suppose we have an English sentence?—?“My name is Pratyush” and we want to translate it to Hindi?, which will be “???? ??? ???????? ??”.

We’ll start with the English sentence?, and we’ll tokenize it.

What does it mean to tokenize a sentence?? Simply?, to break the sentence into tokens or parts. We can do that by dividing the sentence into its constituent characters?, or words or sub-words. Here?, we’re going to divide the sentence by words?.

After tokenizing our sentence?, we’ll end up with an array of strings or tokens?, consisting of words:-

[’My’?, ‘name’?, ‘is’?, ‘Pratyush’].

This will be the input?, that is fed to “Input Embedding” in the diagram shown above.

Now?, just before the list of tokens are further processed?, the list of tokens (also known as an input sequence ) are converted to a list of IDs ( numbers ) corresponding to each token?, the purpose of these “IDs” is to uniquely identify a token( in this case?, a word).

If the input sequence have the same token repeated?, they all will have the same ID. Also?, the total number of “IDs” generated will correspond to the size of the vocabulary i.e. in model’s eyes- these are the only different tokens/IDs/words that exist.

Now this input sequence that consists of IDs is fed to the Embedding Layer.

The Embedding Layer maintains a dictionary?, from an ID → “list of numbers” that is used to represent that specific token or ID?. The choice of the size of that “list of numbers” is a hyperparameter and in the paper it was set to 512, which means?—?for each ID?, the Embedding Layer has a list of 512 numbers that represents that token / ID. We call this “list of numbers” an Embedding. Most importantly?, the Embedding generated corresponding to each ID is not fixed?, its learnable?, its a parameter. The transformer learns this representation of tokens/IDs.

So now we know what an Embedding Layer is?, but then what happens when you feed the input sequence to this layer?? It replaces the input ID with their corresponding Embeddings and so we end up with a list of lists containing numbers?, that is a MATRIX/ 2-D Tensor.

Can you guess the dimensions of this matrix?, given the length of the input sequence is seq_len and the size of the Embedding is d_model??

It’ll be (seq_len?, d_model)?.

This is just for illustration

This 2-D Tensor(let’s still call it the input sequence) passes through Positional Encoding.

What is Positional Encoding and why do we need it?? Basically?, we want to be able to carry some information related to the relative positions of words(tokens/IDs) with respect to other words i.e. where the different words are located with respect to each other.

To achieve this?, another matrix(let’s call it p) is prepared with the same dimensions (seq_len?, d_model) and this matrix is added to the input sequence matrix.

So, how are the values of p calculated?? You can use the formula given below:

i → corresponds to the row i.e. the ith token/ ID,

j → corresponds to the column which goes from 1 to 512 ( or 0 to 511?, if we consider 0-based indexing) corresponding to each token/ID.

d_embed_dim → corresponds to the size of the Embedding (d_model).

But then why do we use trigonometric functions to add positional information?? The answer is simple because they are continuous and periodic( the values repeat after a certain period and this corresponds to a pattern?, that the model can learn from)?. Theoretically?, we can use other functions too given that the function has these desirable properties.

So, we started with an input sentence?, and converted it into a 2-D Matrix that contains representations of the sentence along with their positional information. And this matrix is now ready to be fed to the Encoder.

An important thing?, I’d like to mention here is that usually we don’t input a single sentence / input sequence to the model, we input a Batch of sequences and so the models are designed to work with batches of sequences?, and expect 3-D tensors( batch_size?, seq_len?, d_model ). But for illustration purposes?, I am not considering the batch dimension.

Let us continue this discussion in the next post?.

要查看或添加评论,请登录

Pratyush Singh的更多文章

  • Main Challenges to Machine Learning

    Main Challenges to Machine Learning

    Our main task in machine learning is to select a machine learning algorithm and train it using some data, So, the two…

    2 条评论
  • Types of Machine Learning Systems

    Types of Machine Learning Systems

    Machine Learning Systems could be of various types depending on the criteria we're using to classify those systems…

  • An Illustrative introduction to Transformers(Part 2/3)

    An Illustrative introduction to Transformers(Part 2/3)

    Link: Part 1 Hey everyone ! In the previous article , I introduced you to the transformer architecture, highlighting…

社区洞察

其他会员也浏览了