MLP is all you need! [with code]
Ibrahim Sobh - PhD
?? Senior Expert of Artificial Intelligence, Valeo Group | LinkedIn Top Voice | Machine Learning | Deep Learning | Data Science | Computer Vision | NLP | Developer | Researcher | Lecturer
From Google: MLP-Mixer: An all-MLP Architecture for Vision
Main idea:
"While convolutions and attention are both sufficient for good performance, neither of them is necessary!"
“Mixer is a competitive but conceptually and technically simple alternative, that does not use convolutions or self-attention"
The MLP-Mixer is an architecture based exclusively on multi-layer perceptrons (MLPs), that contains two types of MLP layers:
The idea behind the Mixer architecture is to clearly separate the per-location (channel-mixing) operations, and cross-location (token-mixing) operations. Both operations are implemented with MLPs
Mixer relies only on basic matrix multiplication routines, changes to data layout (reshapes and transpositions), and scalar nonlinearities.
"A convolution is more complex than the plain matrix multiplication"
How it works
Mixer accepts a sequence of linearly projected image patches (tokens) shaped as a “patches × channels” table as an input, and maintains this dimensionality. Mixer makes use of two types of MLP layers:
These two types of layers are interleaved to enable interaction of both input dimensions.
"Despite its simplicity, Mixer attains competitive results. When pre-trained on large datasets it reaches near state-of-the-art performance, previously claimed by CNNs and Transformers"
"The computational complexity of the network is linear in the number of input patches, unlike ViT whose complexity is quadratic"
"Each layer in Mixer takes an input of the same size. This “isotropic” design is most similar to Transformers"
Unlike ViTs, Mixer does not use position embeddings because the token-mixing MLPs are sensitive to the order of the input tokens.
"As expected, Mixer is invariant to the order of patches and pixels within the patches"
"Mixer uses a standard classification head with the global average pooling layer followed by a linear classifier"
Code
Simple and intuitive Keras implementation is here
Results
A simple MLP-based model is competitive with today’s best convolutional and attent
ion-based models
Visualization
It is commonly observed that the first layers of CNNs tend to learn detectors that act on pixels in local regions of the image. In contrast, Mixer allows for global information exchange in the token-mixing MLPs
The figure shows hidden units of the first three token-mixing MLPs of Mixer trained on JFT-300M.
Recall that the token-mixing MLPs allow global communication between different spatial locations.
Conclusions
This figure shows the hidden units of the four token-mixing MLPs of Mixer trained on CIFAR10.
Regards
Student Data Scientist
3 年Really Interesting