登录查看更多内容

MLP is all you need! [with code]

Ibrahim Sobh - PhD

?? Senior Expert of Artificial Intelligence, Valeo Group | LinkedIn Top Voice | Machine Learning | Deep Learning | Data Science | Computer Vision | NLP | Developer | Researcher | Lecturer

发布日期: 2021年10月23日

+ 关注

From Google: MLP-Mixer: An all-MLP Architecture for Vision

Main idea:

"While convolutions and attention are both sufficient for good performance, neither of them is necessary!"

“Mixer is a competitive but conceptually and technically simple alternative, that does not use convolutions or self-attention"

The MLP-Mixer is an architecture based exclusively on multi-layer perceptrons (MLPs), that contains two types of MLP layers:

One applied independently to image patches, which mixes the per-location features.
The other is applied across patches (along channels), which mixes spatial information.

The idea behind the Mixer architecture is to clearly separate the per-location (channel-mixing) operations, and cross-location (token-mixing) operations. Both operations are implemented with MLPs

Mixer relies only on basic matrix multiplication routines, changes to data layout (reshapes and transpositions), and scalar nonlinearities.

"A convolution is more complex than the plain matrix multiplication"

How it works

Mixer accepts a sequence of linearly projected image patches (tokens) shaped as a “patches × channels” table as an input, and maintains this dimensionality. Mixer makes use of two types of MLP layers:

Channel-mixing MLPs: allow communication between different channels, they operate on each token independently and take individual rows of the table as inputs
Token-mixing MLPs: allow communication between different spatial locations (tokens); they operate on each channel independently and take individual columns of the table as inputs.

These two types of layers are interleaved to enable interaction of both input dimensions.

"Despite its simplicity, Mixer attains competitive results. When pre-trained on large datasets it reaches near state-of-the-art performance, previously claimed by CNNs and Transformers"

"The computational complexity of the network is linear in the number of input patches, unlike ViT whose complexity is quadratic"

"Each layer in Mixer takes an input of the same size. This “isotropic” design is most similar to Transformers"

Unlike ViTs, Mixer does not use position embeddings because the token-mixing MLPs are sensitive to the order of the input tokens.

"As expected, Mixer is invariant to the order of patches and pixels within the patches"

"Mixer uses a standard classification head with the global average pooling layer followed by a linear classifier"

Code

Simple and intuitive Keras implementation is here

Results

A simple MLP-based model is competitive with today’s best convolutional and attent

ion-based models

Visualization

It is commonly observed that the first layers of CNNs tend to learn detectors that act on pixels in local regions of the image. In contrast, Mixer allows for global information exchange in the token-mixing MLPs

The figure shows hidden units of the first three token-mixing MLPs of Mixer trained on JFT-300M.

Recall that the token-mixing MLPs allow global communication between different spatial locations.

Some of the learned features operate on the entire image, while others operate on smaller regions.
Deeper layers appear to have no clearly identifiable structure. Similar to CNNs, we observe many pairs of feature detectors with opposite phases

Conclusions

Very simple architecture for vision.
A good as existing state-of-the-art methods in terms of the trade-off between accuracy and computational resources required for training and inference
We hope that our results spark further research, beyond the realms of established models based on convolutions and self-attention

Educational code applied on CFAR10 with visualization is available.

This figure shows the hidden units of the four token-mixing MLPs of Mixer trained on CIFAR10.

Regards

Ayush Pandey

Student Data Scientist

3 年

Really Interesting

1 次回应

查看更多评论

要查看或添加评论，请登录

Ibrahim Sobh - PhD的更多文章

The Evolution and Applications of Attention Mechanisms in Deep Learning: A Comprehensive Survey

2025年3月1日

The Evolution and Applications of Attention Mechanisms in Deep Learning: A Comprehensive Survey

Article created by Perplexity Deep Research. Prompt: "You are a deep-learning experienced researcher.

1 条评论
The Judicial Cognitive Process: From Case Inception to Judgment and the Promise of AI Augmentation

2025年3月1日

The Judicial Cognitive Process: From Case Inception to Judgment and the Promise of AI Augmentation

Research Report Created by Perplexity Deep Research My Research Question : "Now I want to dig deeper in the human judge…

3 条评论
How to Learn Artificial Intelligence: A Beginner’s Guide

2024年5月31日

How to Learn Artificial Intelligence: A Beginner’s Guide

Artificial Intelligence (AI) is a fascinating field that simulates human intelligence and task performance using…
[????????????] ?????????????????? ???????????? explained with code ??

2023年1月28日

[????????????] ?????????????????? ???????????? explained with code ??

"During the last two years there has been a plethora of large generative models such as ChatGPT or Stable Diffusion…

2 条评论
A conversation with ChatGPT about AI, study roadmap, applications, interview questions with answers, salaries, and more!

2023年1月21日

A conversation with ChatGPT about AI, study roadmap, applications, interview questions with answers, salaries, and more!

Hello everyone, and thank you all for being here today! Let me introduce our new star, the ChatGPT, who will discuss…
10 Object detectors with code [YOLOF, YOLOX, DETR, Deformable DETR, SparseR-CNN, VarifocalNet, PAA, SABL, ATSS, Double Heads]

2022年2月17日

10 Object detectors with code [YOLOF, YOLOX, DETR, Deformable DETR, SparseR-CNN, VarifocalNet, PAA, SABL, ATSS, Double Heads]

In this article, 10 well-known pre-trained object detectors are loaded and used in a standard and easy way. YOLOF: You…

6 条评论
FNet: Do we need the attention layer at all? [Explained with code]

2021年10月30日

FNet: Do we need the attention layer at all? [Explained with code]

FNet: Mixing Tokens with Fourier Transforms "In this work, we investigate whether simpler token mixing mechanisms can…
Patches Are All You Need! [with code]

2021年10月28日

Patches Are All You Need! [with code]

"It is only a matter of time before Transformers become the dominant architecture for vision domains, just as they have…
9 Steps for solving any machine learning problem

2021年8月28日

9 Steps for solving any machine learning problem

In this article, we will present a universal blueprint that we can use to attack and solve any machine-learning…

2 条评论
Anatomy of the Beast with many heads! [with code]

2021年6月12日

Anatomy of the Beast with many heads! [with code]

1) Introduction: In previous articles, we discussed the Transfomers, where Learning Representations of Variable Length…

2 条评论

See all articles

From Google: MLP-Mixer: An all-MLP Architecture for Vision

Main idea:

How it works

Code

Results

Visualization

Conclusions

Ibrahim Sobh - PhD的更多文章

The Evolution and Applications of Attention Mechanisms in Deep Learning: A Comprehensive Survey

The Judicial Cognitive Process: From Case Inception to Judgment and the Promise of AI Augmentation

How to Learn Artificial Intelligence: A Beginner’s Guide

[????????????] ?????????????????? ???????????? explained with code ??

A conversation with ChatGPT about AI, study roadmap, applications, interview questions with answers, salaries, and more!

10 Object detectors with code [YOLOF, YOLOX, DETR, Deformable DETR, SparseR-CNN, VarifocalNet, PAA, SABL, ATSS, Double Heads]

FNet: Do we need the attention layer at all? [Explained with code]

Patches Are All You Need! [with code]

9 Steps for solving any machine learning problem

Anatomy of the Beast with many heads! [with code]