MLP is all you need! [with code]

MLP is all you need! [with code]

From Google: MLP-Mixer: An all-MLP Architecture for Vision

Main idea:

"While convolutions and attention are both sufficient for good performance, neither of them is necessary!"


Mixer is a competitive but conceptually and technically simple alternative, that does not use convolutions or self-attention"

The MLP-Mixer is an architecture based exclusively on multi-layer perceptrons (MLPs), that contains two types of MLP layers:

  1. One applied independently to image patches, which mixes the per-location features.
  2. The other is applied across patches (along channels), which mixes spatial information.

The idea behind the Mixer architecture is to clearly separate the per-location (channel-mixing) operations, and cross-location (token-mixing) operations. Both operations are implemented with MLPs


Mixer relies only on basic matrix multiplication routines, changes to data layout (reshapes and transpositions), and scalar nonlinearities.


"A convolution is more complex than the plain matrix multiplication"


How it works

Mixer accepts a sequence of linearly projected image patches (tokens) shaped as a “patches × channels” table as an input, and maintains this dimensionality. Mixer makes use of two types of MLP layers:

  • Channel-mixing MLPs: allow communication between different channels, they operate on each token independently and take individual rows of the table as inputs
  • Token-mixing MLPs: allow communication between different spatial locations (tokens); they operate on each channel independently and take individual columns of the table as inputs.

These two types of layers are interleaved to enable interaction of both input dimensions.

No alt text provided for this image


"Despite its simplicity, Mixer attains competitive results. When pre-trained on large datasets it reaches near state-of-the-art performance, previously claimed by CNNs and Transformers"


"The computational complexity of the network is linear in the number of input patches, unlike ViT whose complexity is quadratic"


"Each layer in Mixer takes an input of the same size. This “isotropic” design is most similar to Transformers"


Unlike ViTs, Mixer does not use position embeddings because the token-mixing MLPs are sensitive to the order of the input tokens.

"As expected, Mixer is invariant to the order of patches and pixels within the patches"


"Mixer uses a standard classification head with the global average pooling layer followed by a linear classifier"

Code

Simple and intuitive Keras implementation is here

No alt text provided for this image


Results

A simple MLP-based model is competitive with today’s best convolutional and attent
ion-based models
No alt text provided for this image

Visualization

No alt text provided for this image
It is commonly observed that the first layers of CNNs tend to learn detectors that act on pixels in local regions of the image. In contrast, Mixer allows for global information exchange in the token-mixing MLPs

The figure shows hidden units of the first three token-mixing MLPs of Mixer trained on JFT-300M.

Recall that the token-mixing MLPs allow global communication between different spatial locations.

  • Some of the learned features operate on the entire image, while others operate on smaller regions.
  • Deeper layers appear to have no clearly identifiable structure. Similar to CNNs, we observe many pairs of feature detectors with opposite phases

Conclusions

  • Very simple architecture for vision.
  • A good as existing state-of-the-art methods in terms of the trade-off between accuracy and computational resources required for training and inference
  • We hope that our results spark further research, beyond the realms of established models based on convolutions and self-attention

Educational code applied on CFAR10 with visualization is available.


No alt text provided for this image

This figure shows the hidden units of the four token-mixing MLPs of Mixer trained on CIFAR10.

Regards

Ayush Pandey

Student Data Scientist

3 年

Really Interesting

要查看或添加评论,请登录

Ibrahim Sobh - PhD的更多文章