FNet: Do we need the attention layer at all? [Explained with code]

FNet: Do we need the attention layer at all? [Explained with code]

FNet: Mixing Tokens with Fourier Transforms

"In this work, we investigate whether simpler token mixing mechanisms can wholly replace the relatively complicated self-attention layers in Transformer encoder architectures."

Giving up of attention mechanism is an interesting direction of research happening right now. The attention mechanism is definitely cool. However, it needs a lot of memory and compute.

"Transformer encoder architectures can be sped up, with limited accuracy costs, by replacing the self-attention sublayers with simple linear transformations that “mix” input tokens."


Replacing the self-attention sublayer in a Transformer encoder with a standard, unparameterized Fourier Transform achieves comparable accuracies while having 70% ~80% faster training.


FNet has a light memory footprint and is particularly efficient at smaller model sizes; for a fixed speed and accuracy budget, small FNet models outperform Transformer counterparts.


Efficient Transformers

"Most efforts to improve attention efficiency are based on sparsifying the attention matrix"

Longformer

No alt text provided for this image

Big Bird

No alt text provided for this image


Fourier Transform

Jean Baptiste Fourier (1768-1830) showed that any signal or waveform could be made up just by adding together a series of pure tones (sine waves) with appropriate amplitudes and phases.

No alt text provided for this image

Discrete Fourier Transforms (DFT), and in particular the Fast Fourier Transform (FFT), were used to tackle signal processing problems. Moreover, because ordinary multiplication in the frequency domain corresponds to a convolution in the time domain, FFTs have been deployed in Convolutional Neural Networks (CNNs) to speed up computations. DFTs have been used indirectly in several Transformer works. The Performer linearizes the complexity of the Transformer self-attention mechanism by leveraging random Fourier features.


FNet architecture

The main idea is to route information between tokens. The exact routing might not be as important as the fact that information is flowing, and that's what the Fourier transform actually does.

FNet is an attention-free Transformer architecture (Transformer-like architecture), wherein each layer consists of a Fourier mixing sublayer followed by a feed-forward sublayer.

No alt text provided for this image
"Fourier Transform, despite having no parameters at all, achieves nearly the same performance as dense linear mixing and scales very efficiently to long inputs"


Results

Replacing the self-attention sublayer in a Transformer encoder with a standard, unparameterized Fourier Transform achieves 92-97% of the accuracy of BERT counterparts on the GLUE benchmark, but trains 80% faster on GPUs and 70% faster on TPUs at standard 512 input lengths.

Settings:

  • FNet encoder: we replace every self-attention sublayer with a Fourier sublayer
  • Linear encoder: we replace each self-attention sublayer with two learnable, dense, linear sublayers, one applied to the hidden dimension and one applied to the sequence dimension. (similar to MLP-Mixer)
  • Random encoder: we replace each self-attention sublayer with two constant random matrices, one applied to the hidden dimension and one applied to the sequence dimension.
  • Feed Forward-only (FF-only) encoder: we remove the self-attention sublayer from the Transformer layers; this model has no token mixing.

No alt text provided for this image
No alt text provided for this image


"FNet offers an excellent compromise between speed, memory footprint, and accuracy"


"we demonstrated that, for a fixed speed and accuracy budget, small FNet encoders outperform small Transformer models"


"because of its favorable scaling properties, FNet is very competitive with the “efficient” Transformers"


Conclusions

  • There are some situations where we don't need the full or partial attention mechanisms, it might be enough to just somehow mix the information.
  • The Fourier transform is one attractive option because it doesn't have parameters, fast and makes sense on a conceptual level.

"Adding only a few self-attention sublayers to FNet offers a simple way to trade off speed for accuracy"


A very intuitive source code can be found here

No alt text provided for this image


The official implementation is also available


Regards

要查看或添加评论,请登录

Ibrahim Sobh - PhD的更多文章

社区洞察

其他会员也浏览了