Transformers without pain ??
Ibrahim Sobh - PhD
?? Senior Expert of Artificial Intelligence, Valeo Group | LinkedIn Top Voice | Machine Learning | Deep Learning | Data Science | Computer Vision | NLP | Developer | Researcher | Lecturer
"We've suffered losses, But we've not lost the war." - Optimus Prime
Contents:
- What is wrong with RNNs and CNNs
- A High-Level Look
- Machine Translation Task
- Architecture main components
- What is attention? Where is attention?
- How to represent the order of words without RNNs?
- Generating words
- The big picture
- The future is here
- How to start?
1) What is wrong with RNNs and CNNs
Learning Representations of Variable Length Data is a basic building block of sequence-to-sequence learning for Neural machine translation, summarization, etc
- Recurrent Neural Networks are natural fit variable-length sentences and sequences of pixels. But sequential computation inhibits parallelization. No explicit modeling of long and short-range dependencies.
- Convolutional Neural Networks are trivial to parallelize (per layer) and exploit local dependencies. However, long-distance dependencies require many layers.
Attention between encoder and decoder is crucial in NMT. Why not use attention for representations?
The Transformer was proposed in the paper Attention is All You Need.
"We propose a new simple network architecture, the Transformer, based solely on attention mechanisms, dispensing with recurrence and convolutions entirely."
RIP RNNs and CNNs!
"Experiments on two machine translation tasks show these models to be superior in quality while being more parallelizable and requiring significantly less time to train."
2) A High-Level Look
I know, this looks scary! However, the idea is very simple. Let's abstract and simplify everything for a better understanding, without pain (hopefully).
3) Machine Translation Task
Machine Translation (MT) is the task of translating a sentence x from one language (the source language) to a sentence y in another language (the target language).
- Input: a sequence of source tokens (words/word pieces)
- Output: a sequence of target tokens (words/word pieces)
4) Architecture main components
Encoder component and Decoder component (No RNN!)
- The encoding component is a stack of 6 encoders, each encoder is composed of a number of layers.
- The decoding component is a stack of 6 decoders, each decoder is composed of a number of layers.
So far, this should look familiar, just encoder and decoder, where the decoder has access to some output of the encoder. But, where is the magic? No RNNs, while learning the sequences for machine translation is based on attention mechanism!
5) What is attention? Where is attention?
Core idea: on each step of the decoder, use a direct connection to the encoder to focus on a particular part of the source sequence.
Attention is basically a technique to compute a weighted sum of the values (in the encoder), dependent on another value (in the decoder).
An example fo using attention for machine translation is discussed here "Attention" for Neural Machine Translation (NMT) without pain
Self Attention in the Encoder: the word "self" is used here to indicate that each input word to the encoder (a vector of real numbers) is projected (or transformed) via learnable parameters of NN to another vector that is actually some sort of weighted sum of the other words in the input. Accordingly, each word in the input will attend to different words in the same input, this is a self-attention. Remember, we have 6 encoders where each input word is transformed along its way till the last encoder.
Self-attention: the model learns what parts of a sequence are important when a particular word in the same sequence is considered.
Self-attention basic steps:
- Assume the input word is represented as a vector (embedding) with a dimensionality of 512.
- Each word is projected into three other vectors, each of which with a dimensionality of 64. (imagine we have fully connected layers with 512 input and 64 outputs)
- These vectors are called query (q), key (k), and value (v). (These terms are related to information retrieval). This is a generalization of the same idea of seq2seq machine translation with attention.
- As shown in the figure, the query vector is used to compute a weighted sum of the values through the keys. Specifically: q dot product all the keys, then softmax to get weights and finally use these weights to compute a weighted sum of the values.
The weighted sum of values (v) is a selective summary of the information contained in them, where the query (q) determines which values to focus on at each timestep.
Multihead attention: the intuition is similar to have a multi-filter in CNNs. Here we can have multi-head attention, to give the network more capacity and ability to learn different attention patterns.
By having multiple different layers that generate (or project) the vectors of queries, keys and values, we can learn multiple representations of these queries, keys and values.
For example, a word w may attend to word x in head 1 (representing some semantic connection) the same word w attends to another word b in another head. Actually we have 8 attention heads and accordingly, the 8 projections (q, k, and v) are concatenated to have vectors of 8*64 = 512 dimensionalities (same as the input word embeddings).
Self Attention in the Decoder: This is the same as in the encoder, except that the decoder has access to the whole input sequence (past, present, and future words), while the encoder has access only to the past and present words and can not see the future words, they are not generated yet. This is why it is called masked self-attention because the future words are masked out.
Encoder-decoder attention: This is not a self-attention, it is the attention used by the decoder to attend or focus on values from the encoder. The decoder uses the query vector to be applied to the key and value vectors provided by the encoder.
As shown, we have 3 connected arrows in the first self-attention layers for both encoder and decoder indicating the q, k, and v vectors. While in the encoder-decoder attention layer on the encoder, we have only one arrow from the decoder itself representing the q one vector for the current token, and 2 other arrows from the encoder to pass the k and v vectors (for the whole input sequence).
Enough with the attention! we now know that attention is just a learnable weighted sum of vectors (values) based on another vector (query).
6) How to represent the order of words without RNNs?
The relative position of tokens is naturally available in traditional RNN models, but now lost given the parallelism.
Positional Encoding is used to represent the order of the sequence. The transformer adds a vector to each input embedding. These vectors follow a specific pattern (generated offline) that the model learns, which helps it determine the position of words and the distance between different words in a sequence. Positional Encodings are added only to the input of both the encoder and the decoder components.
Positional Encodings slightly change the orignal input word embedding vectors by adding positional information.
Think of the positional encoders as a simple lookup table of vectors (based on sin and cos functions) added to the original word embeddings, based on the word position in the sequence.
Additional note: In these architectures, we can see that each sub-layer (self-attention, fully connected) has a residual connection around it (think of ResNet), and is followed by a layer-normalization step. Simply, the output from the Multi-head attention block or the Feed-forward block is merged with the residual (addition), and the result is layer normalized.
?7) Generating words
The last layer of the decoder is a softmax layer that predicts probability distribution over the vocabulary words at time t+1, based on the predicted words from 1 to t (input to the decoder), and also based on the encoder provided information about the input sequence. Think of the decoder as an autoregressive conditional language model that generates words one by one.
8) The big picture
- Attention is all you need, literally!
- Transformers are used for seq2seq machine translation tasks, without RNN, without CNN.
- Attention is applied within the same sequence (self-attention) and between the encoder and decoder.
- Sequence order is maintained via positional encoders.
- For translation tasks, Transformers are trained on bilingual datasets (ex WMT 2014 English-to-German)
This attention only architecture enables parallelization, boosting the training process on massive datasets.
9) The future is here
Based on this architecture (the vanilla Transformers!), encoder or decoder components are used alone to enable massive pre-trained generic models that can be used and fine-tuned for downstream tasks such as text classification, translation, summarization, question answering, etc.
For Example, "Pre-training of Deep Bidirectional Transformers for Language Understanding" BERT is mainly based on the encoder architecture trained on massive text datasets to predict randomly masked words and "is-next sentence" classification task. GPT, on the other hand, is an auto-regressive generative model (unlike BERT, GPT can generate sequences) that is mainly based on the decoder architecture (with masked self-attention and without the encoder-decoder attention).
These models, BERT and GPT for instance, are considered as the NLP's ImageNET.
Here is a nice source code that implements a Transformer block as a Keras layer and uses it for text classification.
Transformers have become prominent architectures since 2017
Moreover, Transformers are recently used for computer vision tasks such as classification and object detection.
10) How to start? ??
?? Transformers provides thousands of pretrained models to perform tasks on texts such as classification, information extraction, question answering, summarization, translation, text generation, etc in 100+ languages. Its aim is to make cutting-edge NLP easier to use for everyone.
?? Transformers is backed by the two most popular deep learning libraries, PyTorch and TensorFlow, with seamless integration between them, allowing you to train your models with one then load it for inference with the other.
"Sometimes even the wisest of Man or Machine can make an error." - Optimus Prime
References:
CS224n: Natural Language Processing with Deep Learning Stanford / Winter 2019
Software Engineer at Nana
3 年That was my first time to truly grasp the idea of Transformers. Thanks, Doctor.
Adobe Magento Solution Specialist, Adobe Magento Developer, Christian Louboutin, ZamTech Ltd, Valtech
3 年Such an amazing article Thanks