登录查看更多内容

"Attention" for Neural Machine Translation (NMT) without pain

Ibrahim Sobh - PhD

?? Senior Expert of Artificial Intelligence, Valeo Group | LinkedIn Top Voice | Machine Learning | Deep Learning | Data Science | Computer Vision | NLP | Developer | Researcher | Lecturer

发布日期: 2021年1月1日

"Without translation, we would be living in provinces bordering on silence" - George Steiner

Machine Translation (MT) is the task of translating a sentence x from one language (the source language) to a sentence y in another language (the target language).

Machine Translation research began in the early 1950s. Systems were mostly rule-based, and bilingual dictionaries to map Russian words to their English counterparts for instance.

1990s-2010s: Statistical Machine Translation (SMT)

Core idea: Learn a probabilistic model from data, a large amount of parallel data (e.g. pairs of human-translated French/English sentences)

Translation: Given French sentence x, we want to find the best English sentence y

Use Bayes Rule to break this down into two components to be learned separately

SMT was a huge research field
The best systems were extremely complex: Lots of feature engineering, maintaining extra resources like tables of equivalent phrases.

Neural Machine Translation (NMT)

Neural Machine Translation (NMT) is a way to do Machine Translation with a single neural network

One basic and well known neural network architecture for NMT is called sequence-to-sequence (seq2seq) and it involves two RNNs.

Encoder: RNN network that encodes the input sequence to a single vector (sentence encoding)
Decoder: RNN network that generates the output sequences conditioned on the encoder's output. (conditioned language model)

For more detials and code, check this article: Anatomy of sequence-to-sequence for Machine Translation (Simple RNN, GRU, LSTM)

"The translator is a privileged writer who has the opportunity to rewrite masterpieces in their own language." - Javier Marías

Compared to SMT, NMT has many advantages:

A single neural network to be optimized end-to-end
No subcomponents to be individually optimized
Requires much less human engineering effort
Better use of context
Better performance

Disadvantages of NMT, Compared to SMT:

NMT is less interpretable (Hard to debug)
NMT is difficult to control (can’t easily specify rules or guidelines for translation)

How do we evaluate Machine Translation?

BLEU (Bilingual Evaluation Understudy) BLEU compares the machine-written translation to one or several human-written translation(s), and computes a similarity score based on n-gram precision (for 1, 2, 3 and 4-grams).

BLEU is useful but imperfect: There are many valid ways to translate a sentence, a good translation can get a poor BLEU score because it has a low n-gram overlap with human translation.

SMT systems, built by hundreds of engineers over many years, outperformed by NMT systems trained by a handful of engineers in a few months!

Attention

The problem of the vanilla seq2seq is information bottleneck, where the encoding of the source sentence needs to capture all information about it in one vector (to be moved to the decoder). As mentioned in the well-known paper "Neural Machine Translation by Jointly Learning to Align and Translate":

"A potential issue with this encoder–decoder approach is that a neural network needs to be able to compress all the necessary information of a source sentence into a fixed-length vector. This may make it difficult for the neural network to cope with long sentences, especially those that are longer than the sentences in the training corpus."

Attention provides a solution to the bottleneck problem.

Core idea: on each step of the decoder, use a direct connection to the encoder to focus on a particular part of the source sequence.

Attention is basically a technique to compute a weighted sum of the values (in the encoder), dependent on another value (in the decoder).

As shown:

We have encoder hidden states h1, ... hN
On timestep t, we have the decoder hidden state st
We get the attention scores by dot product of the st and all h1, ... hN (we can think that a dot product is a measure of how two vectors are pointing in the same direction)
We take softmax to get the attention distribution to convert the sores into probability
We use this attention distribution to take a weighted sum of the encoder hidden states to get the attention output.
Finally, we concatenate the attention output with the decoder hidden state and proceed as in the non-attention seq2seq model

Why Attention is great?

Improves NMT performance where allows the decoder to focus on (attend to) certain parts of the source and solves the bottleneck problem.
Attention helps with vanishing gradient problem (think of ResNet with skip connections)
Attention provides some interpretability: by inspecting attention distribution, we can see what the decoder was focusing on while producing an output.

Attention visualization – an example of the alignments between source and target sentences. (Bahdanau et al., 2015).

Query and Values

We sometimes say that the query attends to the values.

In the seq2seq + attention model, each decoder hidden state (query) attends to all the encoder hidden states (values)

The weighted sum is a selective summary of the information contained in the values, where the query determines which values to focus on.

Attention is a way to obtain a fixed-size representation of an arbitrary set of representations (the values), dependent on some other representation (the query).

References:

CS224n: Natural Language Processing with Deep Learning Stanford / Winter 2019

Natural Language Processing with Deep Learning (Winter 2017)

Neural Machine Translation (seq2seq) Tutorial

要查看或添加评论，请登录

Ibrahim Sobh - PhD的更多文章

How to Learn Artificial Intelligence: A Beginner’s Guide

2024年5月31日

How to Learn Artificial Intelligence: A Beginner’s Guide

Artificial Intelligence (AI) is a fascinating field that simulates human intelligence and task performance using…
[????????????] ?????????????????? ???????????? explained with code ??

2023年1月28日

[????????????] ?????????????????? ???????????? explained with code ??

"During the last two years there has been a plethora of large generative models such as ChatGPT or Stable Diffusion…

2 条评论
A conversation with ChatGPT about AI, study roadmap, applications, interview questions with answers, salaries, and more!

2023年1月21日

A conversation with ChatGPT about AI, study roadmap, applications, interview questions with answers, salaries, and more!

Hello everyone, and thank you all for being here today! Let me introduce our new star, the ChatGPT, who will discuss…
10 Object detectors with code [YOLOF, YOLOX, DETR, Deformable DETR, SparseR-CNN, VarifocalNet, PAA, SABL, ATSS, Double Heads]

2022年2月17日

10 Object detectors with code [YOLOF, YOLOX, DETR, Deformable DETR, SparseR-CNN, VarifocalNet, PAA, SABL, ATSS, Double Heads]

In this article, 10 well-known pre-trained object detectors are loaded and used in a standard and easy way. YOLOF: You…

6 条评论
FNet: Do we need the attention layer at all? [Explained with code]

2021年10月30日

FNet: Do we need the attention layer at all? [Explained with code]

FNet: Mixing Tokens with Fourier Transforms "In this work, we investigate whether simpler token mixing mechanisms can…
Patches Are All You Need! [with code]

2021年10月28日

Patches Are All You Need! [with code]

"It is only a matter of time before Transformers become the dominant architecture for vision domains, just as they have…
MLP is all you need! [with code]

2021年10月23日

MLP is all you need! [with code]

From Google: MLP-Mixer: An all-MLP Architecture for Vision Main idea: "While convolutions and attention are both…

2 条评论
9 Steps for solving any machine learning problem

2021年8月28日

9 Steps for solving any machine learning problem

In this article, we will present a universal blueprint that we can use to attack and solve any machine-learning…

2 条评论
Anatomy of the Beast with many heads! [with code]

2021年6月12日

Anatomy of the Beast with many heads! [with code]

1) Introduction: In previous articles, we discussed the Transfomers, where Learning Representations of Variable Length…

2 条评论
The magic of XLM-R: Unsupervised Cross-lingual Representation Learning at Scale

2021年1月16日

The magic of XLM-R: Unsupervised Cross-lingual Representation Learning at Scale

The goal of this paper, by Facebook AI, is to improve cross-lingual language understanding (XLU). Previously, we…

See all articles

"Attention" for Neural Machine Translation (NMT) without pain

Ibrahim Sobh - PhD

?? Senior Expert of Artificial Intelligence, Valeo Group | LinkedIn Top Voice | Machine Learning | Deep Learning | Data Science | Computer Vision | NLP | Developer | Researcher | Lecturer

1990s-2010s: Statistical Machine Translation (SMT)

Neural Machine Translation (NMT)

How do we evaluate Machine Translation?

Attention

Why Attention is great?

Query and Values

References:

Ibrahim Sobh - PhD的更多文章

社区洞察

其他会员也浏览了

Training data

Evaluating Large Language Models (LLMs): A Standard Set of Metrics for Accurate Assessment

Finetuning Large Language Models: A Comprehensive Guide

What is neural machine translation (NMT) and how can it help you?

Translation Quality Evaluation Is All We Need

How I Use AI

SUMMARY: Natural Language Processing explained by Brian Yu

The Rise of GPT: How AI Could Take Over Your Job and Threaten Your Future

AI: Bridging the Gap

Is GenAI Going to Replace NMT?

1990s-2010s: Statistical Machine Translation (SMT)

Neural Machine Translation (NMT)

How do we evaluate Machine Translation?

Attention

Why Attention is great?

Query and Values

References:

Ibrahim Sobh - PhD的更多文章

How to Learn Artificial Intelligence: A Beginner’s Guide

[????????????] ?????????????????? ???????????? explained with code ??

A conversation with ChatGPT about AI, study roadmap, applications, interview questions with answers, salaries, and more!

10 Object detectors with code [YOLOF, YOLOX, DETR, Deformable DETR, SparseR-CNN, VarifocalNet, PAA, SABL, ATSS, Double Heads]

FNet: Do we need the attention layer at all? [Explained with code]

Patches Are All You Need! [with code]

MLP is all you need! [with code]

9 Steps for solving any machine learning problem

Anatomy of the Beast with many heads! [with code]

The magic of XLM-R: Unsupervised Cross-lingual Representation Learning at Scale

社区洞察

其他会员也浏览了

Training data

Evaluating Large Language Models (LLMs): A Standard Set of Metrics for Accurate Assessment

Finetuning Large Language Models: A Comprehensive Guide

What is neural machine translation (NMT) and how can it help you?

Translation Quality Evaluation Is All We Need

How I Use AI

SUMMARY: Natural Language Processing explained by Brian Yu

The Rise of GPT: How AI Could Take Over Your Job and Threaten Your Future

AI: Bridging the Gap

Is GenAI Going to Replace NMT?