"Attention" for Neural Machine Translation (NMT) without pain
Ibrahim Sobh - PhD
?? Senior Expert of Artificial Intelligence, Valeo Group | LinkedIn Top Voice | Machine Learning | Deep Learning | Data Science | Computer Vision | NLP | Developer | Researcher | Lecturer
"Without translation, we would be living in provinces bordering on silence" - George Steiner
Machine Translation (MT) is the task of translating a sentence x from one language (the source language) to a sentence y in another language (the target language).
Machine Translation research began in the early 1950s. Systems were mostly rule-based, and bilingual dictionaries to map Russian words to their English counterparts for instance.
1990s-2010s: Statistical Machine Translation (SMT)
Core idea: Learn a probabilistic model from data, a large amount of parallel data (e.g. pairs of human-translated French/English sentences)
Translation: Given French sentence x, we want to find the best English sentence y
Use Bayes Rule to break this down into two components to be learned separately
- SMT was a huge research field
- The best systems were extremely complex: Lots of feature engineering, maintaining extra resources like tables of equivalent phrases.
Neural Machine Translation (NMT)
Neural Machine Translation (NMT) is a way to do Machine Translation with a single neural network
One basic and well known neural network architecture for NMT is called sequence-to-sequence (seq2seq) and it involves two RNNs.
- Encoder: RNN network that encodes the input sequence to a single vector (sentence encoding)
- Decoder: RNN network that generates the output sequences conditioned on the encoder's output. (conditioned language model)
For more detials and code, check this article: Anatomy of sequence-to-sequence for Machine Translation (Simple RNN, GRU, LSTM)
"The translator is a privileged writer who has the opportunity to rewrite masterpieces in their own language." - Javier Marías
Compared to SMT, NMT has many advantages:
- A single neural network to be optimized end-to-end
- No subcomponents to be individually optimized
- Requires much less human engineering effort
- Better use of context
- Better performance
Disadvantages of NMT, Compared to SMT:
- NMT is less interpretable (Hard to debug)
- NMT is difficult to control (can’t easily specify rules or guidelines for translation)
How do we evaluate Machine Translation?
BLEU (Bilingual Evaluation Understudy) BLEU compares the machine-written translation to one or several human-written translation(s), and computes a similarity score based on n-gram precision (for 1, 2, 3 and 4-grams).
- BLEU is useful but imperfect: There are many valid ways to translate a sentence, a good translation can get a poor BLEU score because it has a low n-gram overlap with human translation.
SMT systems, built by hundreds of engineers over many years, outperformed by NMT systems trained by a handful of engineers in a few months!
Attention
The problem of the vanilla seq2seq is information bottleneck, where the encoding of the source sentence needs to capture all information about it in one vector (to be moved to the decoder). As mentioned in the well-known paper "Neural Machine Translation by Jointly Learning to Align and Translate":
"A potential issue with this encoder–decoder approach is that a neural network needs to be able to compress all the necessary information of a source sentence into a fixed-length vector. This may make it difficult for the neural network to cope with long sentences, especially those that are longer than the sentences in the training corpus."
Attention provides a solution to the bottleneck problem.
Core idea: on each step of the decoder, use a direct connection to the encoder to focus on a particular part of the source sequence.
Attention is basically a technique to compute a weighted sum of the values (in the encoder), dependent on another value (in the decoder).
As shown:
- We have encoder hidden states h1, ... hN
- On timestep t, we have the decoder hidden state st
- We get the attention scores by dot product of the st and all h1, ... hN (we can think that a dot product is a measure of how two vectors are pointing in the same direction)
- We take softmax to get the attention distribution to convert the sores into probability
- We use this attention distribution to take a weighted sum of the encoder hidden states to get the attention output.
- Finally, we concatenate the attention output with the decoder hidden state and proceed as in the non-attention seq2seq model
Why Attention is great?
- Improves NMT performance where allows the decoder to focus on (attend to) certain parts of the source and solves the bottleneck problem.
- Attention helps with vanishing gradient problem (think of ResNet with skip connections)
- Attention provides some interpretability: by inspecting attention distribution, we can see what the decoder was focusing on while producing an output.
Attention visualization – an example of the alignments between source and target sentences. (Bahdanau et al., 2015).
Query and Values
We sometimes say that the query attends to the values.
In the seq2seq + attention model, each decoder hidden state (query) attends to all the encoder hidden states (values)
The weighted sum is a selective summary of the information contained in the values, where the query determines which values to focus on.
Attention is a way to obtain a fixed-size representation of an arbitrary set of representations (the values), dependent on some other representation (the query).
References:
CS224n: Natural Language Processing with Deep Learning Stanford / Winter 2019
Natural Language Processing with Deep Learning (Winter 2017)