Attention Mechanism - Part 2 : English Version
A new chapter has just been published, and this time Michael Erlihson and I provide a comprehensive analysis of the attention mechanism, from its definition to how Transformers implement it and why.
As always, here is the link to the previous article.
How has the attention mechanism evolved?
The attention mechanism was first introduced in natural language processing through the use of an encoder-decoder architecture, as depicted in Figure 3, for English-to-French translation in the following article. To shed light on why the attention mechanism is needed, let's explore an essential concept in natural language processing known as 'alignment.' In the context of translation, alignment refers to the relationship or matching between a word or words in the original language and the corresponding words in the target language. Figure 1 on the left demonstrates this alignment between English and French. In essence, it represents the degree of linkage between word groups in the destination language and the corresponding words in the original language.
In this architecture, the encoder receives a sentence (or a text segment) in language A as a series of tokens and represents the information as a hidden vector, known as a latent representation. On the other hand, the decoder extracts relevant information from the received vector and translates it into a sentence in language B. The network is trained as a single entity, learning to encode and decode simultaneously. In the paper, both the encoder and decoder were implemented using chained Gated Recurrent Units (GRUs), although Recurrent Neural Networks (RNNs) or Long Short-Term Memory networks (LSTMs) could also be used. The encoder consists of bidirectional GRUs (two-directional GRUs). A bidirectional GRU architecture comprises two chains of GRU units, with the first chain handling input from the beginning to the end, and the second chain of GRUs dealing with the input from the end to the beginning (as seen in Figure 4). The reason for utilizing a bidirectional architecture stems from the desire to gain comprehensive information about the input, meaning both from its beginning and its end, since all the information (the source sentence) is available at the time the model is activated after training (inference). In contrast, the output in the decoder is created autoregressively (word by word, with the current output becoming the future input once created) during inference, rendering the use of a bidirectional network in the decoder logically redundant.
In previous encoder-decoder architectures that predated the paper, the input that the decoder receives in each iteration (when creating a new data unit) is the internal state Hi and the output Yi-1 from the previous iteration (of the decoder). To incorporate the information that the encoder has learned from the input, the decoder also receives the encoder's output vector, henceforth referred to as C. In basic encoder-decoder architectures, C is the chain of internal states calculated in the last iteration from each of the networks that make up the bidirectional network in the encoder? ([hi = [→hi ;←hi), as it contains information about the entire input. Figure 5 illustrates the architecture explained in this paragraph.
How is the attention mechanism expressed in the paper?
As previously mentioned, the use of attention is meant to address the alignment problem between the input and output. Due to the Achilles heel of architectures that preceded that of the paper, there was use of the encoder's last internal states (the outputs of the last iteration of both directions of the bidirectional GRU or BGRU). Since these states contained compressed information about the entire input, it was not possible to model the local dependencies between the internal states of the decoder and those of the encoder. In other words, the context vectors received at the entrance of the decoder compressed all the information from the encoder's input sequence without considering the relationship between information units in sequence A and the information unit being built in sequence B.
In contrast, in the method proposed in the paper, the context vector that the decoder creates while building the output unit i receives information on all the data units of the input that were built by the encoder. The context vector is created as a weighted sum of all the internal states of the encoder, with the weights modelling the relationships between each input unit and output unit i.
Attention Mechanism: The Mathematical Modeling
The first concept that the paper defines is the alignment score, which represents the relationship between the internal state i-1 in the decoder and some internal state j in the encoder. To avoid confusion, we will define the internal states of the decoder as s (as it is referred to in the paper), and we'll leave the internal states in the encoder as h. The indices i, j represent the sequential number of input and output units i, j. Now, for output unit i, we define the vector eij, j=1,..., T? as follows:
? (1) eij = attention(si-1,hj) = vaT*tanh(W*[si-1;hj]), j =1,..., T
Equation 1 - Calculation of Alignment Strength (Attention Mechanism)
Where:
As mentioned, the alignment strength mechanism is a learned function that calculates the strength of the connection between the internal state of the decoder and the internal states of the encoder. Since the multiplications inside the hyperbolic tangent function tanh create a vector of size n times 1, and the attention value between two data units must be scalar, the multiplication by vector v_a results in a scalar at the output. Note that the matrix W and the vector v_a are learned (trained) parameters of the model.
The second concept that the paper defines is the attention weight. The purpose of the attention mechanism is to create a weight of importance of a token against every other token, as a continuous and differentiable function. The meaning of continuity in the context we are talking about is the weight of the connection strength e_ij (linked to the internal state hi from the encoder) in relation to all other connection strengths. These strengths represent the connection between all the other internal states of the encoder in relation to the given internal state of the decoder. To do this, we use the softmax function that is applied to the connection strengths. This calculation actually solves the alignment problem that we opened with, since the softmax function will convert the alignment result, standing on its own, to be a probability density function that depends on all the internal states of the encoder. We perform this action for each internal state of the encoder against the same internal state of the decoder, thereby obtaining the continuous connection strength we mentioned.
(2) α_ij = exp(eij)/(Sum{k = 1 to Tx} of (exp(eik))
Equation 2 - Calculation of attention weight for data units i,j pair.
Here, Tx is the number of computation units in the encoder, meaning the maximum length of a data sequence that can be entered as a single input.
The last part of our explanation is the construction of the dynamic context vector C. We use the attention weight we calculated and compose the vector as a multiplication of this weight with the internal state of the encoder associated with it. This construction of the context vector enables us to give the decoder the most relevant information in relation to its current internal state. Internal states of the encoder with low connection strength in relation to the internal state of the decoder will not affect vector C (since if their attention weight value is low, their relative part in vector C will also be low).
(3) Ci =Sum{j = 1 to Tx} α_ij * h_j ? ? ? ?
Equation 3 - Constructing the Dynamic Context Vector Ci
领英推荐
The input of the decoder is a combination of its previous internal state, the context vector, and the previous output fed into the network, similar to previous iterative networks we have seen (LSTM/RNN). The calculation function consists of the internal activation functions of the GRU block (similar to LSTM, there are forget and update gates). The construction of the decoder output is described in Figure 6.
If we go back to Figure 1, we can see that along most of the diagonal, only one word from the output is aligned (related) in a significant way to a word in the input, and therefore the corresponding internal state from the encoder will be transferred almost entirely to the context vector \( C_i \) (Equation 3). In contrast, in the parts where a word in the output depends (according to the attention weights) on several words from the input, the context vector will consist of a weighted sum of the internal states of the encoder. Figure 6 presents the architecture as it is detailed in the full paper.
So how does the attention mechanism help us?
Besides the ability to focus the decoder on specific input when predicting the current word, the attention mechanism works similarly to the skip connection mechanism in deep networks. We provide direct access between the internal states of the encoder to the decoder through the dynamic vector we create, thus allowing the information in them to "flow" almost unchanged, similar to how skip connection operates. This differs from encoder-decoder architectures without this mechanism, where some information is lost when creating the encoder's output vector. This mechanism addresses the two main drawbacks of iterative networks: the bottleneck of representing long data sequences and gradient vanishing. Additionally, the attention mechanism allows for better interpretability of the network. Interpretability is a concept that describes our ability as humans to understand the processes occurring within learning networks. In the context of architectures that use an attention mechanism, it is possible to understand how the model generates the output by the weight that each information unit received, and from that to learn about the network's limitations and find ways to improve it.
What were the drawbacks of the architecture we've seen so far?
Even though the architecture we presented was a tremendous advancement in the field of natural language processing, it had several limitations.
The first limitation stemmed from the amount of resources required to compute the context vector C, which added to the already existing computational burden in iterative networks. Now, O(n*m) invocations of the alignment score function are required, where m represents the number of tokens in the input, and n is the number of tokens in the output. This led to extended training and inference times in the encoder.
The second limitation of the network is its limited context representation. The network learns the context between the input and output. However, it does not learn the relationships that information units create with one another (intra-dependencies) within the input and output. Therefore, it cannot "understand" complex semantics such as slang, sarcasm, double meanings, and indirect relationship systems hidden in the input (which we as humans easily understand). This leads to the network's performance decreasing in direct relation to the length of the sequence.
How can we resolve this issue?
The concept of attention we've discussed until now is referred to as cross attention. Essentially, cross attention integrates the information from the encoder with what's retrieved from the decoder. Let's introduce another form of attention, known as self attention, which was initially introduced in this paper. Self attention examines the relationship between each token and every other token within the same sequence. It's worth noting that in encoder-decoder architectures, both parts receive a sequence during training, allowing the self-attention mechanism to be embedded into either part. One of the key innovations in the transformer architecture was the fusion of these two attention mechanisms.
Why do we need self-attention?
Previously, we talked about the constraints of the previous architecture, particularly its limitations in handling intricate texts. The network was primarily focusing on finding the connection between a word in one language and a cluster of words in another for translation purposes. Essentially, it was building the context vector only from the encoder's internal states. But translation isn't that simple, especially when dealing with complex sentence structures.
When we translate, we need to understand the nuanced relationships within the source sentence to get the translation right. That's where self-attention comes in. Unlike the cross-attention that just considers the encoder-decoder relationship, self-attention digs deeper. It looks at the interrelationships between each internal state, not just how the encoder connects to the decoder.
Think of it like trying to translate a sentence from English to Hebrew. An architecture with both self-attention and cross-attention mechanisms will likely provide a more nuanced translation than one with cross-attention alone. It's akin to understanding the full context of a conversation rather than just responding to the last statement. It adds depth and complexity, making the translation more accurate and sophisticated.
Suppose the sentence we want to translate is:
"Malgré les retards causés par le temps orageux, le couple, accompagné de leurs amis proches, a réussi à atteindre le sommet de la montagne et à profiter de la vue à couper le souffle."
An architecture using both mechanisms would translate the sentence as follows:
"Despite the delays caused by the stormy weather, the couple, who were accompanied by their close friends, managed to reach the mountaintop and enjoy the breathtaking view."
In contrast, an architecture that only uses the cross-attention mechanism might translate the sentence as:
"The couple, who were accompanied by their close friends, managed to reach the mountaintop, despite the delays and the stormy weather, and enjoy the breathtaking view."
Although both translations are coherent and adhere to the rules of syntax and grammar, the second translation fails to understand the connection between the delay and the weather. In contrast, the architecture that does use self-attention succeeded in finding this connection, and in expressing it in the translation
Summary Points for the Chapter: