登录查看更多内容

"Transformers in Machine Learning: A Deep Dive (Part 2)"

chamindu lakshan

Out of the box thinker/YouTubepreneuer/programmer/Wordpress and Wix Designer

发布日期: 2024年11月8日

The Decoder Segment

Okay, so far we understand how the encoder segment works — i.e. how inputs are converted into an intermediate representation. Let’s now take a look at the decoder segment. This segment of a Transformer is responsible for converting the intermediary, high-dimensional representation into predictions for output tokens. Visually, it looks as follows. The decoder segment is composed of several individual components:

Output Embeddings, which convert tokenized outputs into vector format — just like the embeddings used for the inputs. The only difference here is that outputs are shifted right by one position. This, together with the masked multi-head attention segment, ensures that predictions for any position can only depend on the known outputs at positions less than that input (Vaswasni et al., 2017). In other words, it is ensured that predictions depend on the past only, not on the future.
Positional Encodings, which like the input positional encodings slightly change the vector outputs of the embedding layer, adding positional information to these vectors.
The actual decoder segment, which is composed of the following sub segments:
The masked multi-head attention segment, which performs multi-head self-attention on the outputs, but does so in a masked way, so that positions depend on the past only.
The multi-head attention segment, which performs multi-head self-attention on a combination of the (encoded) inputs and the outputs, so that the model learns to correlate encoded inputs with desired outputs.
The feed forward segment, which processes each token individually.
Finally, there is a linear layer which generates logits and a Softmax layer which generates pseudoprobabilities. By taking the argmax value of this prediction, we know which token should be taken and added to the tokens already predicted.

Let’s now take a look at each of the decoder’s individual components in more detail.

Output embedding

Like the encoder, the inputs to the decoder segment are also embedded first. Of course, this happens with the outputs, which are the target phrases from the sentence pairs with which vanilla Transformers are trained. Here, too, learned embeddings are used, and Vaswani et al. (2017) share the weight matrix of both embedding layers, and the pre-Softmax linear layer visualized above.

Positional Encoding

Exactly the same sine- and cosine-based positional encoding is performed in the decoder segment like the encoder segment.

N times the decoder segment

The first two elements of the decoder segment were equal in functionality to the first two elements of the encoder segment. Now is where we’ll take a look at (a few) differences, because we’re going to look at the decoder segment — which is also replicated N times (with N = 6 in Vaswani et al.’s work).

The decoder segment is composed of three sub segments:

A masked multi-head attention segment, where self-attention is applied to (masked) outputs, so that the model learns to which previous tokens it must attend given some token.
A multi-head attention segment, where self-attention is applied to the encoded inputs (serving as queries and keys) and the combination of masked multi-head attention outputs / input residual, being the gateway where encoded inputs and target outputs are merged.
A feedforward segment, which is applied position-wise to each token passed along.

Finally, there is a small additional appendix — a linear layer and a Softmax activation function. These will take the output of the decoder segment and transform it into a logits output (i.e. a value based output for each of the tokens in the vocabulary) and a pseudoprobability output which assigns probabilities to each of the possible token outputs given the logit values. By simply taking the argmax value from these outputs, we can identify the word that is the most likely prediction here.

We’ll take a look at all these aspects in more detail now.

Masked Multi-head Attention

The first sub segment to which the position-encoded embedded input is fed is called the masked multi-head attention segment. It is quite an irregular but regular attention segment:

It is regular in the sense that here too, we have queries, keys and values. The queries and keys are matrix multiplied yielding a score matrix, which is then combined with the values matrix in order to apply self-attention to the target values, i.e. determine which of the output values are most important.

In other words, the flow is really similar to the flow of the multi-head attention segment in the encoder:

Except for one key difference, which is that this segment is part of the decoder, which is responsible for predicting which target must be output next.

And if I’m constructing a phrase, as a human being, for producing the next word I cannot rely on all future words for doing so. Rather, I can only rely on the words that I have produced before. This is why the classic multi-head attention block does not work in the decoder segment, because the same thing applies here as well: that when predicting a token, the decoder should not be able to be aware of future outputs (and especially their attention values) for the simple reason that it would otherwise be able to glimpse into the future when predicting for the present.

The flow above will therefore not work and must be adapted. Vaswani et al. (2017) do so by adding a mask into the flow of the multi-head attention layer, making it a masked multi-head attention layer.

But what is this mask about?

Recall that the matrix multiplication (MatMul) between queries and keys yields a score matrix, which is scaled and then put through a SoftMax layer. When this happens, we get (conditional) pseudoprobabilities for each token/word that tell us something about the word importance given another word (or token). But as you can see, this is problematic if we don’t want to look into the future: if we are predicting the next token after <I>, which should be <am>, we don't want to know that <doing> comes after it; humans simply don't know this when they are producing words on the fly.

That’s why a mask is applied to the scaled score matrix prior to generating pseudoprobabilities with Softmax. That is, if this is our score matrix…

…we apply what is known as a look-ahead mask. It is a simple matrix addition: we add another matrix to the scores matrix, where values are either zero or minus infinity. As you can see, all values that can be visible for a token (i.e. all previous values) are set to zero, so they remain the same. The others (Vaswani et al. (2017) call them illegal connections) are combined with minus infinity and hence yield minus infinity as the value.

If we then apply Softmax, we can see that the importance for all values that lie in the future is set to zero. They’re no longer important. When masked, the model learns to attend to values from the past only when predicting for the present. This is a very important characteristic that allows Transformers to generalize to unseen data better.

Adding residual and Layer Normalization

As is common in the Transformer architecture, the masked multi-head attention segment also makes use of residuals and layer normalization. In other words, a residual connecting the input embedding to the addition layer is added, combining the output of the masked multi-head attention segment with the original position-encoded output embedding. This allows gradients to flow more freely, benefiting the training process. Layer normalization stabilizes the training process further, yielding better results.

Regular Multi-Head Attention with Encoder Output

The second sub-segment in the decoder segment is the multi-head attention segment. This is a regular multi-head attention segment which computes a non-masked score matrix between queries and keys and then applies it to the values, yielding an attention-based outcome.

Contrary to the encoder segment, which computes self-attention over the inputs, this segment performs it slightly differently. The queries and keys and hence the score matrix is based on the output of the encoder segment. In other words, the scores for putting attention to certain words in a phrase are determined by the inputs that have been encoded before.

And this makes a lot of sense, because as we shall see vanilla Transformers are trained on datasets with pairs in different languages (Vaswani et al, 2017). For example, if the goal is to translate I am doing okay into German, attention between languages is somewhat similar, and hence attention generated from the encoded input can be used for generating a decoder prediction, actually spawning sequence-to-sequence abilities for a Transformer model.

That this actually happens can also be seen in the figure below, because the queries and keys that together form the scorse matrix, are matrix multiplied with the values matrix, which are generated by the masked multi-attention segment and the residual combined previously. In other words, this segment combines encoder output with target output, and hence generates the ability to make the ‘spillover’ from source language into target language (or more general, source text into target text).

领英推荐

Choosing Between Machine Learning and Rule-Based…

Fast Code AI 11 个月前

MACHINE LEARNING: THE INTELLIGENCE REVOLUTION

Moses Technologies Pvt Ltd 6 个月前

Machine Learning (ML): Subset of AI where systems…

VenPep Solutions 11 个月前

Adding residual and Layer Normalization

Here, too, we add the residual and perform Layer Normalization before we move forward.

Feed-Forward Layer

Like the encoder, a Feed Forward network composed of two linear layers and a ReLU activation function (discussed in a future blog) is applied position-wise.

Adding residual and Layer Normalization

The results of this network are added with another residual and subsequently a final Layer Normalization operation is performed.

Generating a token prediction

After the residual was added and the layer was normalized (visible in the figure as Add & Norm), we can start working towards the actual prediction of a token (i.e., a word). This is achieved by means of a linear layer and a Softmax activation function. In this linaer layer, which shares the weight matrix with the embedding layers, logits are generated — i.e. the importance of each token given the encoded inputs and the decoded outputs. With a Softmax function, we can generate output (pseudo)probabilities for all the tokens in our vocabulary.

Selecting the token prediction is then really simple. By taking the maximum argument (argmax) value, we can select the token that should be predicted next given the inputs and outputs sent into the model.

Et voila, that’s the architecture of a vanilla Transformer!

Training a Transformer

Vanilla Transformers are so-called sequence-to-sequence models, converting input sequences to target sequences. This means that they should be trained on bilingual datasets if the task is machine translation.

For example, Vaswani et al. (2017) have trained the vanilla Transformer on the WMT 2014 English-to-German translation dataset, i.e. training for a translation task.

The training set of this dataset has 4.5 million pairs of phrases (Stanford, n.d.):

All phrases have corresponding ones in German or at least German-like text:

Summary

Transformers are taking the world of Natural Language Processing by storm. But their architectures are relatively complex and it takes quite some time to understand them sufficiently. That’s why in this article we have looked at the architecture of vanilla Transformers, as proposed by Vaswani et al. in a 2017 paper.

This architecture, which lies at the basis of all Transformer related activities today, has solved one of the final problems in sequence-to-sequence models: that of sequential processing. No recurrent segments are necessary anymore, meaning that networks can benefit from parallelism, significantly boosting the training process. In fact, today’s Transformers are trained with millions of sequences, if not more.

To provide the necessary context, we first looked at what Transformers are and why they are necessary. We then moved forward looking at the encoder and decoder segments.

We saw that in the encoder segment, inputs are first passed through a (learned) input embedding, which converts integer based tokens into vectors having lower dimensionality. These are then position encoded by means of sine and cosine functions, to add information about the relative position of tokens into the embedding — information naturally available in traditional models due to the sequential nature of processing, but now lost given the parallelism. After these preparation steps, the inputs are fed to the encoder segment, which learns to apply self attention. In other words, the model learns itself what parts of a phrase are important when a particular word is looked at. This is achieved by multi-head attention and a feedforward network.

The decoder segment works in a similar way, albeit a bit differently. First of all, the outputs are embedded and position encoded, after which they are also passed through a multi-head attention block. This block however applies a look-ahead mask when generating the scores matrix, to ensure that the model cannot look at words down the line when predicting a word in the present. In other words, it can only use past words in doing so. Subsequently, another multi-head attention block is added, combining the encoded inputs as queries and keys with the attended output values as values. This combination is passed to a feedforward segment, which finally allows us to generate a token prediction by means of an additional Linear layer and a Softmax activation function.

Vanilla Transformers are trained on bilingual datasets if they are used for translation tasks. An example of such datasets is the WMT 2014 English-to-German dataset, which contains English and German phrases; it was used by Vaswani et al. (2014) for training their Transformer.

I hope you have learned something from this 2 part series on Transformers in Machine Learning. If you have any questions, comments or suggestions, please leave them in the comment section below. Thanks for reading!!

References

Wikipedia. (2005, April 7). Recurrent neural network. Wikipedia, the free encyclopedia. Retrieved December 23, 2020, from https://en.wikipedia.org/wiki/Recurrent_neural_network

Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., … & Polosukhin, I. (2017). Attention is all you need. Advances in neural information processing systems, 30, 5998–6008.

Nuric. (2018). What does Keras Tokenizer method exactly do? Stack Overflow. https://stackoverflow.com/a/51956230

KDNuggets. (n.d.). Data representation for natural language processing tasks. KDnuggets. https://www.kdnuggets.com/2018/11/data-representation-natural-language-processing.html

Wikipedia. (2014, August 14). Word embedding. Wikipedia, the free encyclopedia. Retrieved December 24, 2020, from https://en.wikipedia.org/wiki/Word_embedding

Ncasas. (2020). Weights shared by different parts of a transformer model. Data Science Stack Exchange. https://datascience.stackexchange.com/a/86363

Dontloo. (2019). What exactly are keys, queries, and values in attention mechanisms? Cross Validated. https://stats.stackexchange.com/a/424127

Wikipedia. (2002, October 22). Matrix multiplication. Wikipedia, the free encyclopedia. Retrieved December 24, 2020, from https://en.wikipedia.org/wiki/Matrix_multiplication

Stanford. (n.d.). The Stanford natural language processing group. The Stanford Natural Language Processing Group. https://nlp.stanford.edu/projects/nmt/

(Source - medium)

Chami Notes

509 位关注者

Mahima Sanketh

Founder of the Simple Notion | Digital Entrepreneur | 10M impressions | Helping Startups Thrive

3 个月

Insightful

要查看或添加评论，请登录

chamindu lakshan的更多文章

The Truth About To-Do Lists: Why Productivity Isn’t About Doing It All

2024年12月27日

The Truth About To-Do Lists: Why Productivity Isn’t About Doing It All

The Productivity market is a booming industry. Pushing and convincing everyone you can and should get everything done…
"The Art and Evolution of Performance Reviews: From Rituals to Real Impact"

2024年11月29日

"The Art and Evolution of Performance Reviews: From Rituals to Real Impact"

Performance Reviews: Transforming the Awkward Ritual into Real Conversations Ah, performance reviews — the annual…
Three Lessons From 127 Venture Capital Rejections

2024年11月29日

Three Lessons From 127 Venture Capital Rejections

From Vision to Closure: Lessons from Building Ikaria Two and a half years ago, I began building Ikaria, a personalized…
? Self-Reflection Through AI: Discovering Strengths and Weaknesses ?

2024年11月28日

? Self-Reflection Through AI: Discovering Strengths and Weaknesses ?

? Self-Reflection Through AI: Discovering Strengths and Weaknesses ? Have you ever tried asking ChatGPT…
Forget About Marketing. Focus on Building An Awesome Product

2024年10月28日

Forget About Marketing. Focus on Building An Awesome Product

There’s a popular belief that if you “just build an amazing product, customers will come.” In an ideal world, this…

2 条评论
The 10-minute rule, and more tips to kick off your week

2024年9月17日

The 10-minute rule, and more tips to kick off your week

?? We’ve got 106 days left until the end of 2024. Let’s make them count.
Design Principles Used by Apple: For Better User Experience

2024年9月9日

Design Principles Used by Apple: For Better User Experience

As a child visiting my aunt for the first time in her grand five-story apartment, which had striking paint colors and a…

1 条评论
Understanding the wp_rel_ugc() Function in WordPress

2024年9月6日

Understanding the wp_rel_ugc() Function in WordPress

## Understanding the wp_rel_ugc() Function in WordPress The function, introduced in WordPress 5.3, is a powerful tool…
Next.js SEO: Best Practices for Higher Rankings

2024年8月8日

Next.js SEO: Best Practices for Higher Rankings

In addition to leveraging Next.js features for SEO, there are several other best practices you should consider to…
I Found the Best Apps for Productivity in 2024

2024年8月8日

I Found the Best Apps for Productivity in 2024

Discovering useful apps in the always-changing world of productivity tools can greatly improve your workflow. After…

See all articles

"Transformers in Machine Learning: A Deep Dive (Part 2)"

chamindu lakshan

Out of the box thinker/YouTubepreneuer/programmer/Wordpress and Wix Designer

The Decoder Segment

Output embedding

Positional Encoding

N times the decoder segment

Masked Multi-head Attention

Adding residual and Layer Normalization

Regular Multi-Head Attention with Encoder Output

领英推荐

Adding residual and Layer Normalization

Feed-Forward Layer

Adding residual and Layer Normalization

Generating a token prediction

Training a Transformer

Summary

References

(Source - medium)

Chami Notes

509 位关注者

chamindu lakshan的更多文章

社区洞察

其他会员也浏览了

What is Machine Learning?

Unveiling the Future: The Transformative Power of Machine Learning in Business

How Often Should You Retrain Machine Learning Models?

Machine Learning: Your Questions Answered (FAQs)

A Practical Guide to AutoML for Enterprise

Implementing and Leveraging Machine Learning Models

Navigating the Complexities of High Dimensional Functions in Machine Learning.

Demystifying Model Results: Advanced Techniques for Interpreting Machine Learning Models

Which Quantization Method is Right for You? PTQ, QAT, AWQ, GGUF, GGML, and GPTQ

Trapped in the Data Web: The Perils of Overfitting in Machine Learning

The Decoder Segment

Output embedding

Positional Encoding

N times the decoder segment

Masked Multi-head Attention

Adding residual and Layer Normalization

Regular Multi-Head Attention with Encoder Output

领英推荐

Adding residual and Layer Normalization

Feed-Forward Layer

Adding residual and Layer Normalization

Generating a token prediction

Training a Transformer

Summary

References

(Source - medium)

Chami Notes

509 位关注者

chamindu lakshan的更多文章

The Truth About To-Do Lists: Why Productivity Isn’t About Doing It All

"The Art and Evolution of Performance Reviews: From Rituals to Real Impact"

Three Lessons From 127 Venture Capital Rejections

? Self-Reflection Through AI: Discovering Strengths and Weaknesses ?

Forget About Marketing. Focus on Building An Awesome Product

The 10-minute rule, and more tips to kick off your week

Design Principles Used by Apple: For Better User Experience

Understanding the wp_rel_ugc() Function in WordPress

Next.js SEO: Best Practices for Higher Rankings

I Found the Best Apps for Productivity in 2024

社区洞察

其他会员也浏览了

What is Machine Learning?

Unveiling the Future: The Transformative Power of Machine Learning in Business

How Often Should You Retrain Machine Learning Models?

Machine Learning: Your Questions Answered (FAQs)

A Practical Guide to AutoML for Enterprise

Implementing and Leveraging Machine Learning Models

Navigating the Complexities of High Dimensional Functions in Machine Learning.

Demystifying Model Results: Advanced Techniques for Interpreting Machine Learning Models

Which Quantization Method is Right for You? PTQ, QAT, AWQ, GGUF, GGML, and GPTQ

Trapped in the Data Web: The Perils of Overfitting in Machine Learning