Patches Are All You Need! [with code]
Ibrahim Sobh - PhD
?? Senior Expert of Artificial Intelligence, Valeo Group | LinkedIn Top Voice | Machine Learning | Deep Learning | Data Science | Computer Vision | NLP | Developer | Researcher | Lecturer
"It is only a matter of time before Transformers become the dominant architecture for vision domains, just as they have for language processing"
Introduction
ConvMixer, an extremely simple model that is similar in many aspects to the ViT and the even-more-basic MLP-Mixer:
Unlike the Vision Transformer and MLP-Mixer, ConvMixer uses only standard convolutions to achieve the mixing steps.
Despite its simplicity, ConvMixer outperforms the ViT, MLP-Mixer, and some of their variants for similar parameter counts and data set sizes, in addition to outperforming classical vision models such as the ResNet.
Transformers and images
Because the computational cost of the self-attention layers used in Transformers would scale quadratically with the number of pixels per image if applied naively at the per-pixel level, the compromise was to first split the image into multiple “patches”, linearly embed them, and then apply the transformer directly to this collection of patches.
Q: Is the performance of ViTs due to the inherently more powerful Transformer architecture, or is it at least partly due to using patches as the input representation?
The patch representation itself may be the most critical component to the “superior” performance of newer architectures like Vision Transformers!
ConvMixer
ConvMixer, consists of a patch embedding layer followed by repeated applications of a simple fully-convolutional block, where the spatial structure of the patch embeddings is maintained.
Nice idea: Patch embeddings with patch size p and embedding dimension h can be implemented as convolution with cin input channels, h output channels, kernel size p, and stride p.
While self-attention and MLPs are theoretically more flexible, allowing for large receptive fields and content-aware behavior, the inductive bias of convolution is well-suited to vision tasks and leads to high data efficiency.
领英推荐
Experiments
ConvMixer-h/d where h is the width or hidden dimension, and d is the depth or the number of repetitions of the ConvMixer layer.
ConvMixers achieve competitive accuracies for a given parameter budget
ConvMixers are substantially slower at inference than the competitors!, likely due to their smaller patch size; hyperparameter tuning and optimizations could narrow this gap!
Deeper networks take longer to converge while wider networks converge faster.
Increasing the width or the depth is an effective way to increase accuracy; a doubling of depth incurs less compute than a doubling of width
Conclusions
Our title, while an exaggeration, points out that attention isn’t the only export from language processing into computer vision: tokenizing inputs, i.e., using patch embeddings, is also a powerful and important takeaway
Code
An implementation of the ConvMixer model in exactly 280 characters!
A more readable code is here.
Regards