Attention
Mukesh Manral????
??DataScience Specialist(Consultant) - Generative AI | MLOps | Data & AI Architect | Product Development | Cloud - AI + Education
01. Understanding Attention
In this lesson, we'll be talking about one of the most important innovations in deep learning in the last few years, Attention.
Attention started out in the field of computer vision as an attempt to mimic human perception.
This is a quote from a paper on Visual Attention from 2014.
It says that, "One important property of human perception is that one does not tend to process a scene in its entirety all at once.
Instead humans focus attention selectively on parts of the visual space to acquire information when and where it's needed, and then combine information from different fixations over time to build up an internal representation of the entire scene, guiding future eye movement and decision-making."
What that means is that when we look at a scene in our daily lives, our brains do not just process a visual snapshot all at once.
Instead we selectively focus on different parts of the image, and we sequentially collect and process that visual information over time.
Say, for example, you're shopping in a shopping mall.
Now, the footage that we're going to see here is from an eye-tracking device.
If you haven't seen those before, it's a device that records both what's in front of you and it also records your eye movement.
Then, we can overlay these two recordings so we can have an idea of where you were looking at each time in the video.
So, what we're seeing here is footage from a person wearing the eye-tracking device, where the orange circle is highlighting where the person is looking at each moment.
So, we can see attention in general visual perception, but you can also see it in reading and trying to process text one word at a time.
This type of device is used, for example, in user experience testing.
In machine learning, attention methods give us a mechanism for adding selective focus into a machine learning model.
Typically, one that does its processing sequentially.
Attention is a concept that powers up some of the best performing models spanning both natural language processing and computer vision.
These models include: neural machine translation, image captioning, speech recognition, and text summarization, as well as others.
Take image classification in captioning as an example.
Before the use of attention, convolutional neural networks were able to classify images by looking at the whole image and outputting a class label.
But not all of this image is necessary to produce that classification; only some of these pixels are needed to identify a bird, and attention came out of the desire to attend to these most important pixels.
Now, not only that, but attention also improved our ability to describe images with full sentences by focusing on different parts of the image as we generate our output sentence.
Attention achieved its rise to fame, however, from how useful it became in tasks like neural machine translation.
As sequence to sequence models started to exhibit impressive results, they were held back by certain limitations that made it difficult for them to process long sentences, for example.
Classic sequence to sequence models, without attention, have to look at the original sentence that you want to translate one time and then use that entire input to produce every single small outputted work.
Attention, however, allows the model to look at this small relevant parts of the input as you generate the output over time.
When attention was incorporated in sequence to sequence models, they became the state of the art in neural machine translation.
This is what like Google to adopt neural machine translation with attention as the translation engine for Google translate in the end of 2016.
In this lesson, we'll look at how attention works and how and where it can be applied.
02 Sequence To Sequence Recap V2
A sequence to sequence model takes in an input that is a sequence of items, and then it produces another sequence of items as an output.
In a machine translation application, the input sequence is a series of words in one language, and the output is the translation in another language.
In text summarization, the input is a long sequence of words, and the output is a short one.
A sequence to sequence model usually consists of an encoder and a decoder, and it works by the encoder first processing all of the inputs turning the inputs into a single representation, typically a single vector.
This is called the context vector, and it contains whatever information the encoder was able to capture from the input sequence.
This vector is then sent to the decoder which uses it to formulate an output sequence.
In machine translation scenarios, the encoder and decoder are both recurrent neural networks,
typically LSTM cells in practice, and in this scenario, the context vector is a vector of numbers encoding the information that the encoder captured from the input sequence.
In real-world scenarios, this vector can have a length of 256 or 512 or more.
As a visual representation, we'll start showing the hidden states as this vector of length four.
Just think of the brightness of the cells corresponding to how high or low the value of that cell is.
Let's look at our basic example again, but this time we will look at the hidden states of the encoder as they develop.
The first step, we process the first word and generate a first hidden state.
Second step, we take the second word and the first hidden state as inputs to the RNN, and produce a second hidden state.
In the third step, we process the last word and generate the last hidden state.
This is the hidden state that would be the context vector will send to the decoder.
Now, this here is the limitation of sequence to sequence models.
The encoder is confined to sending a single vector no matter how long or short the input sequence is.
Choosing a reasonable size for this vector makes the model have problems with long input sequences.
Now, one can say, let's just use a very large number of hidden units in the encoder, so that the context is very large.
But then your model overfits with short sequences, and you take a performance hit as you increase the number of parameters.
This is the problem that attention solves.
03. Encoding -- Attention Overview
A Sequence to Sequence Model with attention works in the following way.
First, the encoder processes the input sequence just like the model without attention one word at a time, producing a hidden state and using that hidden state and the next step.
Next, the model passes a context vector to the decoder but unlike the context vector in the model without attention, this one is not just the final hidden state it's all of the hidden states.
This gives us the benefit of having the flexibility in the context size.
So longer sequences can have longer contexts vectors that better capture the information from the input sequence.
One additional point that's important for the intuition of attention, is that each hidden state is sort of associated the most with the part of the input sequence that preceded how that word was generated.
So, the first hidden state was outputted after processing the first word, so it captures the essence of the first word the most.
So when we focus on this vector, we will be focusing on that word the most, the same with the second hidden state with the second word with the third word, even though that last and third vector incorporates a little bit of everything that preceded it as well.
04. Decoding -- Attention Overview
Now, let's look at the attention decoder and how it works at a very high level.
At every time step, an attention decoder pays attention to the appropriate part of the input sequence using the context factor.
How does the attention decoder know which of the parts of the input sequence to focus on at each step?
That process is learned during the training phase, and it's not just stupidly going sequentially from the first and the second to the third.
It can learn some sophisticated behavior.
Let's look at this example of translating a French sentence to an English one.
So let's say we have this input sentence in French.
Let's say we pass this to our encoder and now we're ready to look at each step in the decoding phase.
In the first step, the attention decoder would pay attention to the first part of the sentence.
This is a trained model, right.
So the more light the square is the more attention that he gave to that word in particular.
So it pays attention to the first word and it outputs a first English word.
In the second step, it pays attention to the second word in the input sequence and translates that word as well.
It goes on sequentially for about four steps and it produces reasonable English translation so far.
Then something different happens here in the fifth step.
So, when we're generating the fifth word of the output, the attention actually jumped two words to translate European.
So, we have zone, economique, europeenne, so on the English side it's not going to be in the same order.
So, europeenne is translated as European and then in the next step it focuses on the word before that, economique, economic, and it focuses on zone and it outputs area.
This is a case where the order of these words in the French language does not follow how it would be ordered in the English language and the model was able to learn that just from a training data set.
The rest of the sentence goes on pretty much sequentially.
So, this is a really cool example of how attention is able to make these models focus on the right parts at the right moments based on what dataset we have.
05. Attention Overview
- A seq2seq model works by feeding one element of the input sequence at a time to the encoder
Which of the following is a limitation of seq2seq models which can be solved using attention methods?
SOLUTION:
How large is the context matrix in an attention seq2seq model?
SOLUTION:
Depends on the length of the input sequence
06. Attention Encoder
Now that we've taken a high level look at how attention works in a sequence to sequence model, let's look into it in more detail.
We'll use machine translation as the example as that's the application the main papers on attention tackled.
But whatever we do here, translates into other applications as well.
It's important to note that there is a small variety of attention algorithms.
We'll be looking at a simple one here.
Let's start from the Encoder.
In this example, the Encoder is a recurrent neural network.
When creating an RNN, we have to declare the number of hidden units in the RNN cell.
This applies whether we have a vanilla RNN or an LSTM or GRU cell.
Before we start feeding our input sequence words to the Encoder, they have to pass through an embedding process which translates each word into a vector.
Here we can see the vector representing each of these words.
Now, this is a toy embedding of size four just for the purpose of easier visualization.
In real-world applications, a size like 200 or 300 is more appropriate.
We'll continue to use these color-coded boxes to represent the vectors, just so we don't have a lot of numbers plastered all over the screen.
Now that we have our words and their embeddings, we're ready to feed that into our Encoder.
Feeding the first word into the first time step of the RNN produces the first hidden state.
This is what's called an unrolled view of the RNN, where we can see the RNN at each time step.
We'll hold onto this state and the RNN would continue to process the next time step.
So, it would take the second word and pass it to the RNN at the second time step, and then it would do that with the third word as well.
Now that we have processed the entire input sequence, we're ready to pass the hidden states to the attention decoder.
07. Attention Decoder
Let's now look at things on the decoder side.
In models without attention, we'd only feed the last context vector to the decoder RNN, in addition to the embedding of the end token, and it will begin to generate an element of the output sequence at each time-step.
The case is different in an attention decoder, however.
An attention decoder has the ability to look at the inputted words, and the decoder‘s own hidden state,
and then it would do the following.
It would use a scoring function to score each hidden state in the context matrix.
We'll talk later about the scoring function, but after scoring each context vector would end up with a certain score and if we feed these scores into a softmax function, we end up with scores that are all positive, that are all between zero and one, and that all sum up to one.
These values are how much each vector will be expressed in the attention vector that the decoder will look at before producing an output.
Simply multiplying each vector by its softmax score and then, summing up these vectors produces an attention contexts vector, this is a basic weighted sum operation.
The context vector is an important milestone in this process, but it's not the end goal.
In a later writeup, we'll explain how the context vector merges with the decoder’s hidden state to create the real output of the decoder at the time-step.
The decoder has now looked at the input word and at the attention context vector, which focused its attention on the appropriate place in the input sequence.
So, it produces a hidden state and it produces the first word in the output sequence.
Now, this is still an over-simplified look, that's why we have the asterisks here.
There is still a step, whether we'll talk about in later writeup, between the RNN and the final output.
In the next time-step, the RNN takes its previous output as an input, and it generates its own context vector for that time-step, as well as the hidden state from the previous time-step, and that produces new hidden state for the decoder, and a new word in the output sequence, and this goes on until we've completed our output sequence.
08. Attention Encoder & Decoder
In machine translation applications, the encoder and decoder are typically
SOLUTION:
Recurrent Neural Networks (Typically vanilla RNN, LSTM, or GRU)
Word Embeddings
What's a more reasonable embedding size for a real-world application?
SOLUTION:
200
What are the steps that require calculating an attention vector in a seq2seq model with attention?
SOLUTION:
Every time step in the decoder only
09 Multiplicative Attention V2
Earlier in this lesson, we looked at how the key concept of attention is to calculate an attention weight vector, which is used to amplify the signal from the most relevant parts of the input sequence and in the same time, drown out the irrelevant parts.
In this writeup, we'll begin to look at the scoring functions that produce these attention weights.
An attention scoring function tends to be a function that takes in the hidden state of the decoder and the set of hidden states of the encoder.
Since this is something we'll do at each timestep on the decoder side, we only use the hidden state of the decoder at that timestep or the previous timestep in some scoring methods.
Given these two inputs, this vector and this matrix, it produces a vector that scores each of these columns.
Before looking at the matrix version, which calculates the scores for all the encoder hidden states in one step, let's simplify it by looking at how to score a single encoder hidden state.
The first scoring method and the simplest is to just calculate the dot product of the two input vectors.
The dot product of two vectors produces a single number, so that's good.
But the important thing is the significance of this number.
Geometrically, the dot product of two vectors is equal to multiplying the lengths of the two vectors by the cosine of the angle between them,
and we know that cosine has this convenient property that it equals one if the angle is zero and it decreases, the wider the angle becomes.
What this means is that if we have two vectors with the same length, the smaller the angle between them, the larger the dot product becomes.
This dot product is a similarity measure between vectors.
The dot product produces a larger number, the smaller the angle between the vectors are.
In practice, however, we want to speed up the calculation by scoring all the encoder hidden states at once, which leads us to the formal mathematical definition of dot product attention.
That's what we have here.
It is the hidden state of the current timestep transposed times the matrix of the encoder hidden timesteps
That looks like this and that will produce the vector of the scores.
With the simplicity of this method comes the drawback of assuming the encoder and decoder have the same embedding space.
So, while this might work for text summarization, for example, where the encoder and decoder use the same language and the same embedding space.
For machine translation, however, you might find that each language tends to have its own embedding space.
This is a case where we might want to use the second scoring method, which is a slight variation on the first.
It simply introduces a weight matrix between the multiplication of the decoder hidden state and the encoder hidden states.
This weight matrix is a linear transformation that allows the inputs and outputs to use different embeddings and the result of this multiplication would be the weights vector.
Let us now look back at this animation and incorporate everything that we know about attention.
The first time step in the attention decoder starts by taking an initial hidden state as well as the embedding for the end symbol.
It does its calculation and generates the hidden state at that timestep and here, we are ignoring the actual outputs of the RNN, we're just using the hidden states.
Then we do our attention step.
We do that by taking in the matrix of the hidden states of the encoder.
We produce a scoring as we've mentioned.
So, if we're doing multiplicative attention, we'll use the dot product or maybe the general,
we produce the scores,
we do a softmax,
we multiply the softmax scores by each corresponding hidden state from the encoder,
we sum them up producing our attention context vector
and then what we do next is this,
we concatenate the attention context vector
with the hidden state of a decoder at that timestep so h4.
So this would be c4 concatenated with h4, that's what we will do here.
So, we basically glued them together as one vector
and then we pass them through a fully connected neural network which is basically multiplying by the weights matrix WC and apply a tanh activation.
The output of this fully connected layer would be our first outputted word in the output sequence.
We can now proceed to the second step,
passing the hidden state to it and taking the output from the first decoder timestep.
We produce h5,
we start our attention at this step as well,
we score,
we produce a weights vector,
we do softmax, we multiply,
we add them up producing c5
The attention context vector at step five,
we glue it together with the hidden state,
we pass it through the same fully-connected network with tanh activation producing
the second word in our output and this goes on
until we have completed outputting the output sequence.
This is pretty much the full view of how attention works in sequence-to-sequence models.
Next, we'll touch on additive attention.
10. Additive Attention
In this writeup, we'll look at the third commonly used scoring method.
It's called concat, and the way to do it is to use a feedforward neural network.
To take a simple example, let's say we're scoring this encoder hidden state, at the fourth time step at the decoder.
领英推荐
Again this is an oversimplified example scoring only one, while in practice we'll actually do a matrix and do it all discord in one step.
The concat scoring method, is commonly done by concatenating the two vectors, and making that the input to a feed forward neural network.
Let's see how that works. So, we merge them,
we concat them into one vector, and then we pass them through a neural network.
This network has a single hidden layer, and outputs this score.
The parameters of this network, are learned during the training process.
Namely the WA weights matrix, and the VA weights matrix.
To look at how the calculation is done, this is our concatenated vector,
we simply multiply it by W of a,
we apply tanh activation producing this two by one matrix.
We multiply that by the V of a weights matrix,
and we get the score for this encoder hidden state.
Formally, it is expressed like this
where H of T as we've mentioned is the hidden state at the current time step,
and H of S is the collection of the set of encoder hidden states.
This is the concatenation and then we multiply it by W of a, tanh activation and then multiply it by V of a transpose.
One thing to note is the difference.
So concat is very similar to the scoring method from Bahdanau paper, but this is the one,
this is the concat method from the Luong paper, where there's only one weight matrix.
In the Bahdanau paper there are two major differences that we can look at.
One of them is that the weights matrix is split into two,
so we don't have just W of a,we have W of a and U of a, and each is applied to the respective vector.
The decoder hidden state in this case, and the encoder hidden state at this case. Another thing to note is that the Bahdanau paper used the hidden state from the previous time step at the decoder.
While in the Luong paper it uses the one from the current time step at the decoder.
Let's make a note on notation here, in case you're planning to read the papers.
Here we've used the notation mainly from the Luong paper where we referred to the encoder and the decoder hidden states as H.
So, H of T for the decoder, and H of S for the encoder.
This means so H is for hidden state,T is for target, so that's the target sequence that we're going to output so that's associated with the decoder.
S is for source.
In the Bahdanau paper, this is called S
So it is not H, it's called S.
So now the picture is complete. Now, we've gone over the entire attention process.
11. Additive and Multiplicative Attention
Which of the following are valid scoring methods for attention?
SOLUTION:
What's the intuition behind using dot product as a scoring method?
SOLUTION:
The dot product of two vectors in word-embedding space is a measure of similarity between them
12. Computer Vision Applications
In this concept, we'll go over some of the computer vision applications and tasks that attention empowers.
In the text below the video, we'll link to a number of papers in case you want to go deeper into any specific application or task.
In this video, we'll focus on image captioning and one of the key papers from 2016 titled, "Show, Attend and Tell."
This paper presented a model that achieved the state of the art performance in caption generation in a number of datasets.
For example, when presented with an image like this, the generated caption was,
"A woman is throwing a frisbee in a park."
When presented with an image like this, the generated caption was,
"A giraffe standing in a forest with trees in the background."
Models like these are trained on a dataset like the MS Coco,
which has a set of about 200,000 images, each with five captions written by people.
This is sourced through something like Amazon's Mechanical Turk Service.
For example, this image from the dataset has these five captions as it's labels and this is the dataset that we used to train a model like this.
If we look under the hood, the model is very similar to the sequence to sequence models we've looked at earlier in the lesson.
In this case, the model takes the image as an input to it's encoder, the encoder generates a context,
passes it to the decoder.
The decoder then proceeds to output a caption.
The model generates the caption sequentially and uses attention to focus on the appropriate place of the image as it generates each word of the caption.
For example, when presented with this image, in the first step, the trained model focuses on this region.
So, this is the thumbnail of this image.
The white areas is where the model is paying the most attention right now.
So, we can see that it's mainly focused on the wings.
It then outputs the first element in the output sequence or the caption, which is the word "
The next step at the decoder, it focuses on this region, so mainly the body of the bird as you see, and the output would be "bird", and then it expands it's focus area to this area around the bird to try to figure out what to describe next, and the output at this step is "flying".
This goes on. We can see how it's attention.
Right now is starting to radiate out of the bird and focus on things behind it or around it.
So, it's generating "a bird flying over a body of", and then the focus here completely sort of ignores the bird, and trying to look at everywhere else in the image, "water."
The first time I looked at something like this, image captioning specifically, it was mind-blowing to me, but now, we have an idea of how it works.
The model here is made of an encoder and a decoder as we've mentioned
The encoder in this case is a convolutional neural network that produces a set of feature vectors, each of which corresponds to a part of the image or a feature of the image.
To be more exact, the paper used a VGG net convolutional network trained on the Image net.
The annotations were created from this f
ature map.
This feature volume has dimensions of 14 x 14 x 512, meaning that it has 512 features,
each of them has the dimensions of 14 x 14.
To create our annotation vector, we need to flatten each feature, turning it from 14 x 14 to 196 x 1.
So, this is simply reshaping the matrix.
After we reshape it to end up with a matrix of 196 x 512.
So, we have 512 features, each one of them is a vector of 196 numbers.
So, this is our context vector.
We can proceed to use it just like we've used the context vector in the previous videos, where we score each of these features and then we merge them to produce our attention context vector.
The decoder is a recurrent neural network, which uses attention to focus on the appropriate annotation vector at each time step.
We plug this into the attention process we've outlined before and that's our image captioning model.
Be sure to check the text below the video for some very exciting applications in computer vision for attention.
Super interesting computer vision applications using attention:
Show, Attend and Tell: Neural Image Caption Generation with Visual Attention
Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering
Video Paragraph Captioning Using Hierarchical Recurrent Neural Networks
Every Moment Counts: Dense Detailed Labeling of Actions in Complex Videos
Tips and Tricks for Visual Question Answering: Learnings from the 2017 Challenge
Visual Question Answering: A Survey of Methods and Datasets
13. NLP Application: Google Neural Machine Translation
Google Neural Machine Translation
The best demonstration of an application is by looking at real-world systems that are in production right now. In late 2016, Google released the following paper describing Google’s Neural Machine Translation System:
Google’s Neural Machine Translation System: Bridging the Gap between Human and Machine Translation [pdf]
This system later went into production powering up Google Translate.
Take a stab at reading the paper and connecting it to what we've discussed in this lesson so far. Below are a few questions to guide this external reading:
Text Summarization:
Abstractive Text Summarization using Sequence-to-sequence RNNs and Beyond
14. Other Attention Methods
Since the two main Attention papers were published in 2014 and '15, Attention has been an active area of research with many developments.
While the two mechanisms continue to be commonly used, there have been significant developments over the years.
In this video, we will look at one of these developments published in a paper titled Attention Is All You Need.
This paper noted that the complexity of encoder-decoder with Attention models can be simplified by adopting a new type of model that only uses Attention, no RNNs.
They called this new model the Transformer.
In two of their experiments on machine translation tasks, the model proved superior in quality as well as requiring significantly less time to train.
The Transformer takes a sequence as an input and generate a sequence, just like the sequence-sequence models we've seen so far.
The difference here, however, that it does not take the inputs one by one, as in the case of an RNN. It can produce all of them together in parallel.
Perhaps each element is processed by a separate GPU if we want.
It then produces the output one by one but also not using an RNN.
The Transformer model also breaks down into an encoder and a decoder.
But instead of RNNs, they use feed-forward neural networks and a concept called self-attention.
This combination allows the encoder and decoder to work without RNNs, which vastly improves performance since it allows parallelization of processing that was not possible with RNNs.
The Transformer contains a stack of identical encoders and decoders.
Six is the number the paper proposes.
Let's focus on the encoder in more layer and look at it more closely.
Each encoder layer contains two sublayers: a multi-headed self-attention layer and a feed-forward layer.
As you might notice, this Attention component is completely on the encoder side as opposed to being a decoder component like the previous Attention mechanisms we've seen.
This Attention component helps the encoder comprehend its inputs by focusing on other parts of the input sequence that are relevant to each input element it processes.
This idea is an extension of work previously done on the concept of self-attention and how it can aid comprehension.
In one paper, for example, this type of Attention is used in the context of machine reading, where the experiments on this technique matched or outperformed the state of the art at that time in tasks like language modeling, sentiment analysis and natural language inference.
They still used RNNs but they augmented it with this idea that later became self-attention.
The example they used in this machine reading paper shows where the train model pays attention as it reads each word.
So, for example, when the model reads the sentence using an LSTM, it learns which other parts of the input to pay attention to as it processes each word of the input.
So, the red is where it's reading and the blue is where it's paying attention as it's reading this word.
At each step, it reads a word and it pays attention to the relevant previous words that would aid in comprehending that word.
The structure of the Transformer, however, allows the encoder to not only focus on previous words in the input sequence, but also on words that appeared later in the input sequence.
This, however, is not the only Attention component in the Transformer.
The decoder contains two Attention components.
One that allows it to focus on the relevant part of the inputs and another that only pays attention to previous decoder outputs, and there you have it.
A high-level view of the components of the Transformer.
We can see how extensively this model uses Attention.
We can see three Attention components here.
They don't all work exactly the same way, but they all boil down pretty much to multiplicative attention, which we already understand.
Paper: Attention Is All You Need
15. The Transformer and Self-Attention
Let's look at how self-attention works in a little bit more detail.
Let's say, we have these words that we want our encoder to read and create a representation of.
As always, we begin by embedding them into vectors.
Since the transformer gives us a lot of flexibility for parallelization, this example assumes we're looking at the process or GPU tasked with encoding the second word of the input sequence.
First step is to compare them.
So, we score the embeddings against each other.
So, we have a score here and a score here, comparing this word that we're currently reading or encoding with the other words in the input sequence.
We scale the score.
We basically divide by two here.
The dimension of the keys, which we're using a toy dimension of four, so that would be two.
We do a softmax for these, and then, we multiply the softmax score with the embedding to get the level of expression of each of these vectors.
The embedding of the current word, just passed as it is.
We add them up, and that produces the self-attention context vector, if that's something we'd like to call it.
This is the image the authors of the paper showed when they presented this paper first at the NIPS Conference
Then, we're looking at the second word.
So, we have the words here, and then, they have their embeddings,
and then, these are the vectors of the embeddin
So, this word is compared or scored against each of the other words in the input vector
This score is then multiplied by the embedding of that relevant word, and then, all of these are added up.
We did not score the current word.
We scored all the other words.
After we add them up, we just pass this up to the feedforward neural network.
If we implement it like this, however, we could see that the model is mainly focusing on other similar words, if we judge it only on the embedding of the word.
So, there's a little modification that we need to do here.
We need to create queries out of each embedding.
We do that by just multiplying by a query matrix or just passing it through a query feedforward neural network.
We also create keys.
So, we have another separate matrix of keys.
We can calculate that again.
So, we have our embeddings, we create the queries.
We're only processing the second word here. So, that's where we created the query for.
Then, we have our keys here.
The scoring is comparing the query versus the key.
So, that's where we get these numbers here, 40, and then, 26.
We scale, then softmax, and then, we multiply the softmax score with the key.
That gets us the self-attention context vector after we add all of these together
This is an acceptable way of doing it, but there is a variation that we need to look at as well.
So, these are our embeddings.
We have our queries, which are made by multiplying the embedding by the Q matrix,
which is learned from the training process.
We have our keys, which are created by multiplying the embeddings by the K matrix,
and we have our values, which are produced the same way by multiplying by the V matrix, which is also learned in the training process.
This is a graphic from the authors as well in their NIPS presentation, where they outline how to create the key, the query, and the value.
So, this is the embedding.
We multiply it by V to get the value
We multiply it by Q to get the query..
You multiply it by K to get the key
So, the final form of self-attention, as presented in this paper, is,we have our embeddings.
We've calculated our V, Q, our values, keys, and queries.
We score the queries against the keys,
and then, that softmax score is multiplied by the values.
These we add up and pass to the feedforward neural network.
This is a very high-level view at this model here and discussion of the self-attention concept.
Follow Mukesh Manral???? for more