Attention Mechanism - Part 2 : English Version

Attention Mechanism - Part 2 : English Version

A new chapter has just been published, and this time Michael Erlihson and I provide a comprehensive analysis of the attention mechanism, from its definition to how Transformers implement it and why.

As always, here is the link to the previous article.

How has the attention mechanism evolved?

The attention mechanism was first introduced in natural language processing through the use of an encoder-decoder architecture, as depicted in Figure 3, for English-to-French translation in the following article. To shed light on why the attention mechanism is needed, let's explore an essential concept in natural language processing known as 'alignment.' In the context of translation, alignment refers to the relationship or matching between a word or words in the original language and the corresponding words in the target language. Figure 1 on the left demonstrates this alignment between English and French. In essence, it represents the degree of linkage between word groups in the destination language and the corresponding words in the original language.

No alt text provided for this image
Figure 3 - A streamlined version of the encoder-decoder architecture.

In this architecture, the encoder receives a sentence (or a text segment) in language A as a series of tokens and represents the information as a hidden vector, known as a latent representation. On the other hand, the decoder extracts relevant information from the received vector and translates it into a sentence in language B. The network is trained as a single entity, learning to encode and decode simultaneously. In the paper, both the encoder and decoder were implemented using chained Gated Recurrent Units (GRUs), although Recurrent Neural Networks (RNNs) or Long Short-Term Memory networks (LSTMs) could also be used. The encoder consists of bidirectional GRUs (two-directional GRUs). A bidirectional GRU architecture comprises two chains of GRU units, with the first chain handling input from the beginning to the end, and the second chain of GRUs dealing with the input from the end to the beginning (as seen in Figure 4). The reason for utilizing a bidirectional architecture stems from the desire to gain comprehensive information about the input, meaning both from its beginning and its end, since all the information (the source sentence) is available at the time the model is activated after training (inference). In contrast, the output in the decoder is created autoregressively (word by word, with the current output becoming the future input once created) during inference, rendering the use of a bidirectional network in the decoder logically redundant.

No alt text provided for this image
Figure 4 - Bidirectional GRU

In previous encoder-decoder architectures that predated the paper, the input that the decoder receives in each iteration (when creating a new data unit) is the internal state Hi and the output Yi-1 from the previous iteration (of the decoder). To incorporate the information that the encoder has learned from the input, the decoder also receives the encoder's output vector, henceforth referred to as C. In basic encoder-decoder architectures, C is the chain of internal states calculated in the last iteration from each of the networks that make up the bidirectional network in the encoder? ([hi = [→hi ;←hi), as it contains information about the entire input. Figure 5 illustrates the architecture explained in this paragraph.

No alt text provided for this image
Figure 5 - Encoder-decoder architecture

How is the attention mechanism expressed in the paper?

As previously mentioned, the use of attention is meant to address the alignment problem between the input and output. Due to the Achilles heel of architectures that preceded that of the paper, there was use of the encoder's last internal states (the outputs of the last iteration of both directions of the bidirectional GRU or BGRU). Since these states contained compressed information about the entire input, it was not possible to model the local dependencies between the internal states of the decoder and those of the encoder. In other words, the context vectors received at the entrance of the decoder compressed all the information from the encoder's input sequence without considering the relationship between information units in sequence A and the information unit being built in sequence B.


In contrast, in the method proposed in the paper, the context vector that the decoder creates while building the output unit i receives information on all the data units of the input that were built by the encoder. The context vector is created as a weighted sum of all the internal states of the encoder, with the weights modelling the relationships between each input unit and output unit i.

Attention Mechanism: The Mathematical Modeling

The first concept that the paper defines is the alignment score, which represents the relationship between the internal state i-1 in the decoder and some internal state j in the encoder. To avoid confusion, we will define the internal states of the decoder as s (as it is referred to in the paper), and we'll leave the internal states in the encoder as h. The indices i, j represent the sequential number of input and output units i, j. Now, for output unit i, we define the vector eij, j=1,..., T? as follows:

? (1) eij = attention(si-1,hj) = vaT*tanh(W*[si-1;hj]), j =1,..., T

Equation 1 - Calculation of Alignment Strength (Attention Mechanism)

Where:

  • e_ij - Unnormalized alignment strength
  • h_j - The internal state of unit j of the encoder.
  • s_(i-1) - The internal state of unit i-1 of the decoder.
  • W - Weight matrix of the attention mechanism.
  • v_a? - Weight vector of the attention function.
  • T - Number of data units in the encoder.

As mentioned, the alignment strength mechanism is a learned function that calculates the strength of the connection between the internal state of the decoder and the internal states of the encoder. Since the multiplications inside the hyperbolic tangent function tanh create a vector of size n times 1, and the attention value between two data units must be scalar, the multiplication by vector v_a results in a scalar at the output. Note that the matrix W and the vector v_a are learned (trained) parameters of the model.

The second concept that the paper defines is the attention weight. The purpose of the attention mechanism is to create a weight of importance of a token against every other token, as a continuous and differentiable function. The meaning of continuity in the context we are talking about is the weight of the connection strength e_ij (linked to the internal state hi from the encoder) in relation to all other connection strengths. These strengths represent the connection between all the other internal states of the encoder in relation to the given internal state of the decoder. To do this, we use the softmax function that is applied to the connection strengths. This calculation actually solves the alignment problem that we opened with, since the softmax function will convert the alignment result, standing on its own, to be a probability density function that depends on all the internal states of the encoder. We perform this action for each internal state of the encoder against the same internal state of the decoder, thereby obtaining the continuous connection strength we mentioned.

(2) α_ij = exp(eij)/(Sum{k = 1 to Tx} of (exp(eik))

Equation 2 - Calculation of attention weight for data units i,j pair.

Here, Tx is the number of computation units in the encoder, meaning the maximum length of a data sequence that can be entered as a single input.

The last part of our explanation is the construction of the dynamic context vector C. We use the attention weight we calculated and compose the vector as a multiplication of this weight with the internal state of the encoder associated with it. This construction of the context vector enables us to give the decoder the most relevant information in relation to its current internal state. Internal states of the encoder with low connection strength in relation to the internal state of the decoder will not affect vector C (since if their attention weight value is low, their relative part in vector C will also be low).

(3) Ci =Sum{j = 1 to Tx} α_ij * h_j ? ? ? ?

Equation 3 - Constructing the Dynamic Context Vector Ci


The input of the decoder is a combination of its previous internal state, the context vector, and the previous output fed into the network, similar to previous iterative networks we have seen (LSTM/RNN). The calculation function consists of the internal activation functions of the GRU block (similar to LSTM, there are forget and update gates). The construction of the decoder output is described in Figure 6.


If we go back to Figure 1, we can see that along most of the diagonal, only one word from the output is aligned (related) in a significant way to a word in the input, and therefore the corresponding internal state from the encoder will be transferred almost entirely to the context vector \( C_i \) (Equation 3). In contrast, in the parts where a word in the output depends (according to the attention weights) on several words from the input, the context vector will consist of a weighted sum of the internal states of the encoder. Figure 6 presents the architecture as it is detailed in the full paper.

No alt text provided for this image
Figure 6 - Description of the complete architecture.

So how does the attention mechanism help us?

Besides the ability to focus the decoder on specific input when predicting the current word, the attention mechanism works similarly to the skip connection mechanism in deep networks. We provide direct access between the internal states of the encoder to the decoder through the dynamic vector we create, thus allowing the information in them to "flow" almost unchanged, similar to how skip connection operates. This differs from encoder-decoder architectures without this mechanism, where some information is lost when creating the encoder's output vector. This mechanism addresses the two main drawbacks of iterative networks: the bottleneck of representing long data sequences and gradient vanishing. Additionally, the attention mechanism allows for better interpretability of the network. Interpretability is a concept that describes our ability as humans to understand the processes occurring within learning networks. In the context of architectures that use an attention mechanism, it is possible to understand how the model generates the output by the weight that each information unit received, and from that to learn about the network's limitations and find ways to improve it.

What were the drawbacks of the architecture we've seen so far?

Even though the architecture we presented was a tremendous advancement in the field of natural language processing, it had several limitations.

The first limitation stemmed from the amount of resources required to compute the context vector C, which added to the already existing computational burden in iterative networks. Now, O(n*m) invocations of the alignment score function are required, where m represents the number of tokens in the input, and n is the number of tokens in the output. This led to extended training and inference times in the encoder.

The second limitation of the network is its limited context representation. The network learns the context between the input and output. However, it does not learn the relationships that information units create with one another (intra-dependencies) within the input and output. Therefore, it cannot "understand" complex semantics such as slang, sarcasm, double meanings, and indirect relationship systems hidden in the input (which we as humans easily understand). This leads to the network's performance decreasing in direct relation to the length of the sequence.

How can we resolve this issue?

The concept of attention we've discussed until now is referred to as cross attention. Essentially, cross attention integrates the information from the encoder with what's retrieved from the decoder. Let's introduce another form of attention, known as self attention, which was initially introduced in this paper. Self attention examines the relationship between each token and every other token within the same sequence. It's worth noting that in encoder-decoder architectures, both parts receive a sequence during training, allowing the self-attention mechanism to be embedded into either part. One of the key innovations in the transformer architecture was the fusion of these two attention mechanisms.

Why do we need self-attention?

Previously, we talked about the constraints of the previous architecture, particularly its limitations in handling intricate texts. The network was primarily focusing on finding the connection between a word in one language and a cluster of words in another for translation purposes. Essentially, it was building the context vector only from the encoder's internal states. But translation isn't that simple, especially when dealing with complex sentence structures.

When we translate, we need to understand the nuanced relationships within the source sentence to get the translation right. That's where self-attention comes in. Unlike the cross-attention that just considers the encoder-decoder relationship, self-attention digs deeper. It looks at the interrelationships between each internal state, not just how the encoder connects to the decoder.

Think of it like trying to translate a sentence from English to Hebrew. An architecture with both self-attention and cross-attention mechanisms will likely provide a more nuanced translation than one with cross-attention alone. It's akin to understanding the full context of a conversation rather than just responding to the last statement. It adds depth and complexity, making the translation more accurate and sophisticated.

Suppose the sentence we want to translate is:

"Malgré les retards causés par le temps orageux, le couple, accompagné de leurs amis proches, a réussi à atteindre le sommet de la montagne et à profiter de la vue à couper le souffle."

An architecture using both mechanisms would translate the sentence as follows:

"Despite the delays caused by the stormy weather, the couple, who were accompanied by their close friends, managed to reach the mountaintop and enjoy the breathtaking view."

In contrast, an architecture that only uses the cross-attention mechanism might translate the sentence as:

"The couple, who were accompanied by their close friends, managed to reach the mountaintop, despite the delays and the stormy weather, and enjoy the breathtaking view."

Although both translations are coherent and adhere to the rules of syntax and grammar, the second translation fails to understand the connection between the delay and the weather. In contrast, the architecture that does use self-attention succeeded in finding this connection, and in expressing it in the translation

Summary Points for the Chapter:

  • There are two main types of attention mechanisms. The self-attention mechanism evaluates the intensity of the connections among various information units within the input. On the other hand, cross-attention explores the relationships between information units in the input and corresponding units in the output.
  • The first scholarly article to present the use of attention mechanisms for natural language processing leveraged an encoder-decoder architecture. This model included an attention mechanism that dynamically linked the decoder to the encoder through a context vector.
  • This context vector considers the strength of the connection between the presently constructed output unit in the decoder (represented by the internal state Si-1) and all the units in the input (the internal states of the encoder). If a particular information unit in the encoder is deemed highly relevant to the output unit, it receives a greater weight when forming its context vector.
  • The transformer architecture was a pioneering model that combined both cross and self-attention. This allowed for the sophisticated transfer of dependencies from the encoder to the decoder, a crucial step in generating high-quality translation."



要查看或添加评论,请登录

社区洞察

其他会员也浏览了