How The Self-attention Layer Works in Transformer Model?
Self-attention Layer

How The Self-attention Layer Works in Transformer Model?

What Is The Transformer Neural Network?

The transformer neural network represents a revolutionary architecture crafted to handle sequence-to-sequence tasks with exceptional efficiency, particularly those involving extensive dependencies across sequences. Originating from the paper "Attention Is All You Need," it has evolved into a leading-edge technique in natural language processing (NLP).
These mechanisms have become indispensable for efficient sequence modeling and transduction tasks because they can manage dependencies regardless of the distance between elements in input or output sequences. Typically, they are employed in conjunction with a recurrent network.

Self-Attention at a High Level

Don't be misled by my casual mention of "self-attention" as if it's a universally understood concept. Personally, I only encountered it for the first time when reading the "Attention is All You Need" paper. Let's break down how it works

Consider this sentence we want to translate: "The animal didn't cross the street because it was too tired."

What does "it" refer to in this sentence? Is it referring to the street or the animal? While this is a straightforward question for a human, it's not as simple for an algorithm to answer.

When the model processes the word "it," self-attention enables it to associate "it" with "animal." As the model processes each word in the input sequence, self-attention allows it to examine other positions in the sequence for clues that can help produce a better encoding for the current word.

If you're familiar with RNNs, consider how maintaining a hidden state allows an RNN to integrate its representation of previously processed words/vectors with the current one. Self-attention is the technique used by the Transformer to incorporate the "understanding" of other relevant words into the word currently being processed. [1]

The Transformer - model architecture

Self-Attention in Detail Encoder

the Attention layer operates using three parameters: Query, Key, and Value. In the Encoder's self- attention mechanism, the Encoder's input is provided to all three parameters: Query, Key, and Value. [2]

Encoder and Decoder Stacks

Similarly, in the Decoder's self-attention mechanism, the Decoder's input is fed into the Query, Key, and Value parameters.

For the Decoder's encoder-decoder attention, the output from the final Encoder in the stack is assigned to the Value and Key parameters, while the output from the self-attention (and Layer Norm) module below it is assigned to the Query parameter.

Self-attention and Feed-forward sub-layers

Both the Self-attention and Feed-forward sub-layers, have a residual skip-connection around them, followed by a Layer-Normalization.

The output of the last Encoder is fed into each Decoder in the Decoder Stack as explained below.

Attention

We talked about why Attention is so important while processing sequences. In the Transformer, Attention is used in three places:

Self-attention in the Encoder — the input sequence pays attention to itself        
Self-attention in the Decoder — the target sequence pays attention to itself        
Encoder-Decoder-attention in the Decoder — the target sequence pays attention to the input sequence        

The attention layer receives its input in the form of three parameters, referred to as the Query, Key, and Value. In the encoder's self-attention, the encoder's input is fed into all three parameters: Query, Key, and Value.

Encoder’s input is passed to all three parameters (Query, Key, and Value)

  • In the decoder's self-attention mechanism, the decoder's input is utilized for all three parameters: Query, Key, and Value.
  • The attention layer operates using three inputs, referred to as the Query, Key, and Value.
  • In the decoder's encoder-decoder attention, the final encoder's output in the stack is provided to the Value and Key parameters. The output of the self-attention (and layer normalization) module below it is supplied to the Query parameter.


(Encoder-Decoder attention)

Multi-head Attention

In the Transformer architecture, each attention processor is termed an Attention Head, and this process is repeated several times in parallel, a concept known as Multi-head attention. This technique enhances the attention mechanism's discriminative capability by combining several similar attention calculations.

Multi-head Attention
Each of the Query, Key, and Value undergoes individual processing via dedicated Linear layers, each applying unique weights, resulting in the generation of three distinct outputs referred to as Q, K, and V. These outputs are subsequently merged using the Attention formula to calculate the final Attention Score. This process ensures that the model can effectively capture and integrate relevant information across the input sequence.        


Formula for Attention Score.

The crucial point here is that the Q, K, and V values encapsulate an encoded representation of each word in the sequence. The attention mechanism then combines each word with every other word in the sequence, resulting in an Attention Score that encodes a score for each word.

Previously, when discussing the Decoder, we mentioned masking briefly. The Mask is also illustrated in the attention diagrams. Let's delve into how it functions.

That essentially covers the concept of multi-headed self-attention. It involves quite a few matrices, but let's compile them into one visual for a comprehensive view. [3]


multi-headed self-attention

Conclusion

In this work, we presented the Transformer, the first sequence transduction model based entirely on attention, replacing the recurrent layers most commonly used in encoder-decoder architectures with multi-headed self-attention. For translation tasks, the Transformer can be trained significantly faster than architectures based on recurrent or convolutional layers.


References

?[1].1706.03762 (arxiv.org)

[2] Transformers Explained Visually (Part 1): Overview of Functionality | by Ketan Doshi | Towards Data Science

[3] The Illustrated Transformer – Jay Alammar – Visualizing machine learning one concept at a time. (jalammar.github.io)

要查看或添加评论,请登录

社区洞察

其他会员也浏览了