FNet: Do we need the attention layer at all? [Explained with code]
Ibrahim Sobh - PhD
?? Senior Expert of Artificial Intelligence, Valeo Group | LinkedIn Top Voice | Machine Learning | Deep Learning | Data Science | Computer Vision | NLP | Developer | Researcher | Lecturer
"In this work, we investigate whether simpler token mixing mechanisms can wholly replace the relatively complicated self-attention layers in Transformer encoder architectures."
Giving up of attention mechanism is an interesting direction of research happening right now. The attention mechanism is definitely cool. However, it needs a lot of memory and compute.
"Transformer encoder architectures can be sped up, with limited accuracy costs, by replacing the self-attention sublayers with simple linear transformations that “mix” input tokens."
Replacing the self-attention sublayer in a Transformer encoder with a standard, unparameterized Fourier Transform achieves comparable accuracies while having 70% ~80% faster training.
FNet has a light memory footprint and is particularly efficient at smaller model sizes; for a fixed speed and accuracy budget, small FNet models outperform Transformer counterparts.
Efficient Transformers
"Most efforts to improve attention efficiency are based on sparsifying the attention matrix"
Fourier Transform
Jean Baptiste Fourier (1768-1830) showed that any signal or waveform could be made up just by adding together a series of pure tones (sine waves) with appropriate amplitudes and phases.
Discrete Fourier Transforms (DFT), and in particular the Fast Fourier Transform (FFT), were used to tackle signal processing problems. Moreover, because ordinary multiplication in the frequency domain corresponds to a convolution in the time domain, FFTs have been deployed in Convolutional Neural Networks (CNNs) to speed up computations. DFTs have been used indirectly in several Transformer works. The Performer linearizes the complexity of the Transformer self-attention mechanism by leveraging random Fourier features.
FNet architecture
The main idea is to route information between tokens. The exact routing might not be as important as the fact that information is flowing, and that's what the Fourier transform actually does.
FNet is an attention-free Transformer architecture (Transformer-like architecture), wherein each layer consists of a Fourier mixing sublayer followed by a feed-forward sublayer.
领英推荐
"Fourier Transform, despite having no parameters at all, achieves nearly the same performance as dense linear mixing and scales very efficiently to long inputs"
Results
Replacing the self-attention sublayer in a Transformer encoder with a standard, unparameterized Fourier Transform achieves 92-97% of the accuracy of BERT counterparts on the GLUE benchmark, but trains 80% faster on GPUs and 70% faster on TPUs at standard 512 input lengths.
Settings:
"FNet offers an excellent compromise between speed, memory footprint, and accuracy"
"we demonstrated that, for a fixed speed and accuracy budget, small FNet encoders outperform small Transformer models"
"because of its favorable scaling properties, FNet is very competitive with the “efficient” Transformers"
Conclusions
"Adding only a few self-attention sublayers to FNet offers a simple way to trade off speed for accuracy"
A very intuitive source code can be found here
The official implementation is also available
Regards