登录查看更多内容

FNet: Do we need the attention layer at all? [Explained with code]

Ibrahim Sobh - PhD

?? Senior Expert of Artificial Intelligence, Valeo Group | LinkedIn Top Voice | Machine Learning | Deep Learning | Data Science | Computer Vision | NLP | Developer | Researcher | Lecturer

发布日期: 2021年10月30日

FNet: Mixing Tokens with Fourier Transforms

"In this work, we investigate whether simpler token mixing mechanisms can wholly replace the relatively complicated self-attention layers in Transformer encoder architectures."

Giving up of attention mechanism is an interesting direction of research happening right now. The attention mechanism is definitely cool. However, it needs a lot of memory and compute.

"Transformer encoder architectures can be sped up, with limited accuracy costs, by replacing the self-attention sublayers with simple linear transformations that “mix” input tokens."

Replacing the self-attention sublayer in a Transformer encoder with a standard, unparameterized Fourier Transform achieves comparable accuracies while having 70% ~80% faster training.

FNet has a light memory footprint and is particularly efficient at smaller model sizes; for a fixed speed and accuracy budget, small FNet models outperform Transformer counterparts.

Efficient Transformers

"Most efforts to improve attention efficiency are based on sparsifying the attention matrix"

Longformer

Big Bird

Fourier Transform

Jean Baptiste Fourier (1768-1830) showed that any signal or waveform could be made up just by adding together a series of pure tones (sine waves) with appropriate amplitudes and phases.

Discrete Fourier Transforms (DFT), and in particular the Fast Fourier Transform (FFT), were used to tackle signal processing problems. Moreover, because ordinary multiplication in the frequency domain corresponds to a convolution in the time domain, FFTs have been deployed in Convolutional Neural Networks (CNNs) to speed up computations. DFTs have been used indirectly in several Transformer works. The Performer linearizes the complexity of the Transformer self-attention mechanism by leveraging random Fourier features.

FNet architecture

The main idea is to route information between tokens. The exact routing might not be as important as the fact that information is flowing, and that's what the Fourier transform actually does.

FNet is an attention-free Transformer architecture (Transformer-like architecture), wherein each layer consists of a Fourier mixing sublayer followed by a feed-forward sublayer.

领英推荐

??Top ML Papers of the Week

DAIR.AI 1 年前

How to Build Better AI Models with a Production-Aware…

Deci AI (Acquired by NVIDIA) 1 年前

Model Compression Techniques: Quantization, Pruning…

Florent LIU 1 个月前

"Fourier Transform, despite having no parameters at all, achieves nearly the same performance as dense linear mixing and scales very efficiently to long inputs"

Results

Replacing the self-attention sublayer in a Transformer encoder with a standard, unparameterized Fourier Transform achieves 92-97% of the accuracy of BERT counterparts on the GLUE benchmark, but trains 80% faster on GPUs and 70% faster on TPUs at standard 512 input lengths.

Settings:

FNet encoder: we replace every self-attention sublayer with a Fourier sublayer
Linear encoder: we replace each self-attention sublayer with two learnable, dense, linear sublayers, one applied to the hidden dimension and one applied to the sequence dimension. (similar to MLP-Mixer)
Random encoder: we replace each self-attention sublayer with two constant random matrices, one applied to the hidden dimension and one applied to the sequence dimension.
Feed Forward-only (FF-only) encoder: we remove the self-attention sublayer from the Transformer layers; this model has no token mixing.

"FNet offers an excellent compromise between speed, memory footprint, and accuracy"

"we demonstrated that, for a fixed speed and accuracy budget, small FNet encoders outperform small Transformer models"

"because of its favorable scaling properties, FNet is very competitive with the “efficient” Transformers"

Conclusions

There are some situations where we don't need the full or partial attention mechanisms, it might be enough to just somehow mix the information.
The Fourier transform is one attractive option because it doesn't have parameters, fast and makes sense on a conceptual level.

"Adding only a few self-attention sublayers to FNet offers a simple way to trade off speed for accuracy"

A very intuitive source code can be found here

The official implementation is also available

Regards

要查看或添加评论，请登录

Ibrahim Sobh - PhD的更多文章

The Evolution and Applications of Attention Mechanisms in Deep Learning: A Comprehensive Survey

2025年3月1日

The Evolution and Applications of Attention Mechanisms in Deep Learning: A Comprehensive Survey

Article created by Perplexity Deep Research. Prompt: "You are a deep-learning experienced researcher.

1 条评论
The Judicial Cognitive Process: From Case Inception to Judgment and the Promise of AI Augmentation

2025年3月1日

The Judicial Cognitive Process: From Case Inception to Judgment and the Promise of AI Augmentation

Research Report Created by Perplexity Deep Research My Research Question : "Now I want to dig deeper in the human judge…

3 条评论
How to Learn Artificial Intelligence: A Beginner’s Guide

2024年5月31日

How to Learn Artificial Intelligence: A Beginner’s Guide

Artificial Intelligence (AI) is a fascinating field that simulates human intelligence and task performance using…
[????????????] ?????????????????? ???????????? explained with code ??

2023年1月28日

[????????????] ?????????????????? ???????????? explained with code ??

"During the last two years there has been a plethora of large generative models such as ChatGPT or Stable Diffusion…

2 条评论
A conversation with ChatGPT about AI, study roadmap, applications, interview questions with answers, salaries, and more!

2023年1月21日

A conversation with ChatGPT about AI, study roadmap, applications, interview questions with answers, salaries, and more!

Hello everyone, and thank you all for being here today! Let me introduce our new star, the ChatGPT, who will discuss…
10 Object detectors with code [YOLOF, YOLOX, DETR, Deformable DETR, SparseR-CNN, VarifocalNet, PAA, SABL, ATSS, Double Heads]

2022年2月17日

10 Object detectors with code [YOLOF, YOLOX, DETR, Deformable DETR, SparseR-CNN, VarifocalNet, PAA, SABL, ATSS, Double Heads]

In this article, 10 well-known pre-trained object detectors are loaded and used in a standard and easy way. YOLOF: You…

6 条评论
Patches Are All You Need! [with code]

2021年10月28日

Patches Are All You Need! [with code]

"It is only a matter of time before Transformers become the dominant architecture for vision domains, just as they have…
MLP is all you need! [with code]

2021年10月23日

MLP is all you need! [with code]

From Google: MLP-Mixer: An all-MLP Architecture for Vision Main idea: "While convolutions and attention are both…

2 条评论
9 Steps for solving any machine learning problem

2021年8月28日

9 Steps for solving any machine learning problem

In this article, we will present a universal blueprint that we can use to attack and solve any machine-learning…

2 条评论
Anatomy of the Beast with many heads! [with code]

2021年6月12日

Anatomy of the Beast with many heads! [with code]

1) Introduction: In previous articles, we discussed the Transfomers, where Learning Representations of Variable Length…

2 条评论

See all articles

FNet: Do we need the attention layer at all? [Explained with code]

Ibrahim Sobh - PhD

?? Senior Expert of Artificial Intelligence, Valeo Group | LinkedIn Top Voice | Machine Learning | Deep Learning | Data Science | Computer Vision | NLP | Developer | Researcher | Lecturer

Efficient Transformers

Fourier Transform

FNet architecture

领英推荐

Results

Conclusions

Ibrahim Sobh - PhD的更多文章

社区洞察

其他会员也浏览了

Neural network architecture in 30 minutes

Model Sharding + Layer Parallelism = Model Parallelism

First-principles approach to feature/blob Detection using SIFT

Predicting Transition State Structures with Tensor Field Networks and Transfer Learning

WorkFlow for Neural Layer Splitting

MobileNet-V2 vs MobileNet-V3

Computer Vision in a nutshell

Everything you need to know about VGG16

Mixture of Experts (MoE)architecture

Efficient Transformers

Fourier Transform

FNet architecture

领英推荐

Results

Conclusions

Ibrahim Sobh - PhD的更多文章

The Evolution and Applications of Attention Mechanisms in Deep Learning: A Comprehensive Survey

The Judicial Cognitive Process: From Case Inception to Judgment and the Promise of AI Augmentation

How to Learn Artificial Intelligence: A Beginner’s Guide

[????????????] ?????????????????? ???????????? explained with code ??

A conversation with ChatGPT about AI, study roadmap, applications, interview questions with answers, salaries, and more!

10 Object detectors with code [YOLOF, YOLOX, DETR, Deformable DETR, SparseR-CNN, VarifocalNet, PAA, SABL, ATSS, Double Heads]

Patches Are All You Need! [with code]

MLP is all you need! [with code]

9 Steps for solving any machine learning problem

Anatomy of the Beast with many heads! [with code]

社区洞察

其他会员也浏览了

Neural network architecture in 30 minutes

Model Sharding + Layer Parallelism = Model Parallelism

First-principles approach to feature/blob Detection using SIFT

Predicting Transition State Structures with Tensor Field Networks and Transfer Learning

WorkFlow for Neural Layer Splitting

MobileNet-V2 vs MobileNet-V3

Computer Vision in a nutshell

Everything you need to know about VGG16

Mixture of Experts (MoE)architecture