"Transformers in Machine Learning: A Deep Dive (Part 2)"
chamindu lakshan
Out of the box thinker/YouTubepreneuer/programmer/Wordpress and Wix Designer
The Decoder Segment
Okay, so far we understand how the encoder segment works — i.e. how inputs are converted into an intermediate representation. Let’s now take a look at the decoder segment. This segment of a Transformer is responsible for converting the intermediary, high-dimensional representation into predictions for output tokens. Visually, it looks as follows. The decoder segment is composed of several individual components:
Let’s now take a look at each of the decoder’s individual components in more detail.
Output embedding
Like the encoder, the inputs to the decoder segment are also embedded first. Of course, this happens with the outputs, which are the target phrases from the sentence pairs with which vanilla Transformers are trained. Here, too, learned embeddings are used, and Vaswani et al. (2017) share the weight matrix of both embedding layers, and the pre-Softmax linear layer visualized above.
Positional Encoding
Exactly the same sine- and cosine-based positional encoding is performed in the decoder segment like the encoder segment.
N times the decoder segment
The first two elements of the decoder segment were equal in functionality to the first two elements of the encoder segment. Now is where we’ll take a look at (a few) differences, because we’re going to look at the decoder segment — which is also replicated N times (with N = 6 in Vaswani et al.’s work).
The decoder segment is composed of three sub segments:
Finally, there is a small additional appendix — a linear layer and a Softmax activation function. These will take the output of the decoder segment and transform it into a logits output (i.e. a value based output for each of the tokens in the vocabulary) and a pseudoprobability output which assigns probabilities to each of the possible token outputs given the logit values. By simply taking the argmax value from these outputs, we can identify the word that is the most likely prediction here.
We’ll take a look at all these aspects in more detail now.
Masked Multi-head Attention
The first sub segment to which the position-encoded embedded input is fed is called the masked multi-head attention segment. It is quite an irregular but regular attention segment:
It is regular in the sense that here too, we have queries, keys and values. The queries and keys are matrix multiplied yielding a score matrix, which is then combined with the values matrix in order to apply self-attention to the target values, i.e. determine which of the output values are most important.
In other words, the flow is really similar to the flow of the multi-head attention segment in the encoder:
Except for one key difference, which is that this segment is part of the decoder, which is responsible for predicting which target must be output next.
And if I’m constructing a phrase, as a human being, for producing the next word I cannot rely on all future words for doing so. Rather, I can only rely on the words that I have produced before. This is why the classic multi-head attention block does not work in the decoder segment, because the same thing applies here as well: that when predicting a token, the decoder should not be able to be aware of future outputs (and especially their attention values) for the simple reason that it would otherwise be able to glimpse into the future when predicting for the present.
The flow above will therefore not work and must be adapted. Vaswani et al. (2017) do so by adding a mask into the flow of the multi-head attention layer, making it a masked multi-head attention layer.
But what is this mask about?
Recall that the matrix multiplication (MatMul) between queries and keys yields a score matrix, which is scaled and then put through a SoftMax layer. When this happens, we get (conditional) pseudoprobabilities for each token/word that tell us something about the word importance given another word (or token). But as you can see, this is problematic if we don’t want to look into the future: if we are predicting the next token after <I>, which should be <am>, we don't want to know that <doing> comes after it; humans simply don't know this when they are producing words on the fly.
That’s why a mask is applied to the scaled score matrix prior to generating pseudoprobabilities with Softmax. That is, if this is our score matrix…
…we apply what is known as a look-ahead mask. It is a simple matrix addition: we add another matrix to the scores matrix, where values are either zero or minus infinity. As you can see, all values that can be visible for a token (i.e. all previous values) are set to zero, so they remain the same. The others (Vaswani et al. (2017) call them illegal connections) are combined with minus infinity and hence yield minus infinity as the value.
If we then apply Softmax, we can see that the importance for all values that lie in the future is set to zero. They’re no longer important. When masked, the model learns to attend to values from the past only when predicting for the present. This is a very important characteristic that allows Transformers to generalize to unseen data better.
Adding residual and Layer Normalization
As is common in the Transformer architecture, the masked multi-head attention segment also makes use of residuals and layer normalization. In other words, a residual connecting the input embedding to the addition layer is added, combining the output of the masked multi-head attention segment with the original position-encoded output embedding. This allows gradients to flow more freely, benefiting the training process. Layer normalization stabilizes the training process further, yielding better results.
Regular Multi-Head Attention with Encoder Output
The second sub-segment in the decoder segment is the multi-head attention segment. This is a regular multi-head attention segment which computes a non-masked score matrix between queries and keys and then applies it to the values, yielding an attention-based outcome.
Contrary to the encoder segment, which computes self-attention over the inputs, this segment performs it slightly differently. The queries and keys and hence the score matrix is based on the output of the encoder segment. In other words, the scores for putting attention to certain words in a phrase are determined by the inputs that have been encoded before.
And this makes a lot of sense, because as we shall see vanilla Transformers are trained on datasets with pairs in different languages (Vaswani et al, 2017). For example, if the goal is to translate I am doing okay into German, attention between languages is somewhat similar, and hence attention generated from the encoded input can be used for generating a decoder prediction, actually spawning sequence-to-sequence abilities for a Transformer model.
That this actually happens can also be seen in the figure below, because the queries and keys that together form the scorse matrix, are matrix multiplied with the values matrix, which are generated by the masked multi-attention segment and the residual combined previously. In other words, this segment combines encoder output with target output, and hence generates the ability to make the ‘spillover’ from source language into target language (or more general, source text into target text).
领英推荐
Adding residual and Layer Normalization
Here, too, we add the residual and perform Layer Normalization before we move forward.
Feed-Forward Layer
Like the encoder, a Feed Forward network composed of two linear layers and a ReLU activation function (discussed in a future blog) is applied position-wise.
Adding residual and Layer Normalization
The results of this network are added with another residual and subsequently a final Layer Normalization operation is performed.
Generating a token prediction
After the residual was added and the layer was normalized (visible in the figure as Add & Norm), we can start working towards the actual prediction of a token (i.e., a word). This is achieved by means of a linear layer and a Softmax activation function. In this linaer layer, which shares the weight matrix with the embedding layers, logits are generated — i.e. the importance of each token given the encoded inputs and the decoded outputs. With a Softmax function, we can generate output (pseudo)probabilities for all the tokens in our vocabulary.
Selecting the token prediction is then really simple. By taking the maximum argument (argmax) value, we can select the token that should be predicted next given the inputs and outputs sent into the model.
Et voila, that’s the architecture of a vanilla Transformer!
Training a Transformer
Vanilla Transformers are so-called sequence-to-sequence models, converting input sequences to target sequences. This means that they should be trained on bilingual datasets if the task is machine translation.
For example, Vaswani et al. (2017) have trained the vanilla Transformer on the WMT 2014 English-to-German translation dataset, i.e. training for a translation task.
The training set of this dataset has 4.5 million pairs of phrases (Stanford, n.d.):
All phrases have corresponding ones in German or at least German-like text:
Summary
Transformers are taking the world of Natural Language Processing by storm. But their architectures are relatively complex and it takes quite some time to understand them sufficiently. That’s why in this article we have looked at the architecture of vanilla Transformers, as proposed by Vaswani et al. in a 2017 paper.
This architecture, which lies at the basis of all Transformer related activities today, has solved one of the final problems in sequence-to-sequence models: that of sequential processing. No recurrent segments are necessary anymore, meaning that networks can benefit from parallelism, significantly boosting the training process. In fact, today’s Transformers are trained with millions of sequences, if not more.
To provide the necessary context, we first looked at what Transformers are and why they are necessary. We then moved forward looking at the encoder and decoder segments.
We saw that in the encoder segment, inputs are first passed through a (learned) input embedding, which converts integer based tokens into vectors having lower dimensionality. These are then position encoded by means of sine and cosine functions, to add information about the relative position of tokens into the embedding — information naturally available in traditional models due to the sequential nature of processing, but now lost given the parallelism. After these preparation steps, the inputs are fed to the encoder segment, which learns to apply self attention. In other words, the model learns itself what parts of a phrase are important when a particular word is looked at. This is achieved by multi-head attention and a feedforward network.
The decoder segment works in a similar way, albeit a bit differently. First of all, the outputs are embedded and position encoded, after which they are also passed through a multi-head attention block. This block however applies a look-ahead mask when generating the scores matrix, to ensure that the model cannot look at words down the line when predicting a word in the present. In other words, it can only use past words in doing so. Subsequently, another multi-head attention block is added, combining the encoded inputs as queries and keys with the attended output values as values. This combination is passed to a feedforward segment, which finally allows us to generate a token prediction by means of an additional Linear layer and a Softmax activation function.
Vanilla Transformers are trained on bilingual datasets if they are used for translation tasks. An example of such datasets is the WMT 2014 English-to-German dataset, which contains English and German phrases; it was used by Vaswani et al. (2014) for training their Transformer.
I hope you have learned something from this 2 part series on Transformers in Machine Learning. If you have any questions, comments or suggestions, please leave them in the comment section below. Thanks for reading!!
References
Wikipedia. (2005, April 7). Recurrent neural network. Wikipedia, the free encyclopedia. Retrieved December 23, 2020, from https://en.wikipedia.org/wiki/Recurrent_neural_network
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., … & Polosukhin, I. (2017). Attention is all you need. Advances in neural information processing systems, 30, 5998–6008.
Nuric. (2018). What does Keras Tokenizer method exactly do? Stack Overflow. https://stackoverflow.com/a/51956230
KDNuggets. (n.d.). Data representation for natural language processing tasks. KDnuggets. https://www.kdnuggets.com/2018/11/data-representation-natural-language-processing.html
Wikipedia. (2014, August 14). Word embedding. Wikipedia, the free encyclopedia. Retrieved December 24, 2020, from https://en.wikipedia.org/wiki/Word_embedding
Ncasas. (2020). Weights shared by different parts of a transformer model. Data Science Stack Exchange. https://datascience.stackexchange.com/a/86363
Dontloo. (2019). What exactly are keys, queries, and values in attention mechanisms? Cross Validated. https://stats.stackexchange.com/a/424127
Wikipedia. (2002, October 22). Matrix multiplication. Wikipedia, the free encyclopedia. Retrieved December 24, 2020, from https://en.wikipedia.org/wiki/Matrix_multiplication
Stanford. (n.d.). The Stanford natural language processing group. The Stanford Natural Language Processing Group. https://nlp.stanford.edu/projects/nmt/
(Source - medium)
Founder of the Simple Notion | Digital Entrepreneur | 10M impressions | Helping Startups Thrive
3 个月Insightful