Bahdanau Attention Mechanism
In my last NLP post regarding NMT(Neural Machine Translation), I shared about its architecture in a very intuitive manner. I shared about Encoder, Context Vector, and Decoder.
At the end of the article, I said we could add Attention Mechanism to our decoder. So, let's continue from there.
Instead of relying just on the encoder's last state, attention enables the decoder to analyze the complete history of the encoder's state outputs. The decoder does this at every step of the prediction and creates a weighted average of all the state outputs depending on what it needs to produce at that step.
For example, in the translation from English to German, I went to the shop -> ich ging zum Laden, when predicting the word ging, the decoder will pay more attention to the first part of the English sentence than the latter.
?? The context/thought vector is a performance bottleneck
As we have seen in the encoder-decoder architecture of NMT, the encoder part spits out a summarized representation of the source language sentence as 'context/thought vector', which basically creates a link b/w the encoder and the decoder, which later the decoder uses to translate the sentence.
I went to the flower market to buy some flowers
This translates to the following:
Ich ging zum Blumenmarkt, um Blumen zu kaufen
1. Information about the subject (I)
2. Information about the verbs (buy and went)
3. Information about the objects (flowers and flower market)
4. Interaction of the subjects, verbs, and objects with each other in the sentence
领英推荐
?? How does Attention deal with this issue?
Attention sidesteps this issue:
Conceptual Breakdown of the Attention Mechanism:
?? The Bahdanau Attention Mechanism
(Also called Additive Attention)
The Bahdanau attention mechanism is introduced in the paper Neural Machine Translation by Learning to Jointly Align and Translate , by Dzmitry Bahdanau .
The attention mechanism was introduced to address the bottleneck problem that arises with the use of a fixed-length encoding vector, where the decoder would have limited access to the information provided by the input. This is thought to become especially problematic for long and/or complex sequences, where the dimensionality of their representation would be forced to be the same as for shorter or simpler sequences.
? Simple & Intuitive Explanation of Bahdanau Attention:
Here's a simple and intuitive explanation:
In summary, Bahdanau Attention is like having a helpful spotlight that guides you through understanding and translating sentences by emphasizing the important words. It's a way for machines to focus on the relevant parts of information when processing sequences of data, making them more efficient and accurate in tasks like translation, summarization, and more.
Other resources to learn about Bahdanau Attention:
BTW, if you are interested in learning more about this, here is my very in-depth notebook on this topic, explaining the concepts and code implementation in great detail.
??GitHub Link: Seq2Seq Learning - Implementing NMT System