Patches Are All You Need! [with code]

Patches Are All You Need! [with code]

"It is only a matter of time before Transformers become the dominant architecture for vision domains, just as they have for language processing"

Introduction

ConvMixer, an extremely simple model that is similar in many aspects to the ViT and the even-more-basic MLP-Mixer:

  1. ConvMixer operates directly on patches as input separates the mixing of spatial and channel dimensions
  2. ConvMixer maintains equal size and resolution throughout the network (isotropic).

Unlike the Vision Transformer and MLP-Mixer, ConvMixer uses only standard convolutions to achieve the mixing steps.

Despite its simplicity, ConvMixer outperforms the ViT, MLP-Mixer, and some of their variants for similar parameter counts and data set sizes, in addition to outperforming classical vision models such as the ResNet.


Transformers and images

Because the computational cost of the self-attention layers used in Transformers would scale quadratically with the number of pixels per image if applied naively at the per-pixel level, the compromise was to first split the image into multiple “patches”, linearly embed them, and then apply the transformer directly to this collection of patches.

Q: Is the performance of ViTs due to the inherently more powerful Transformer architecture, or is it at least partly due to using patches as the input representation?


The patch representation itself may be the most critical component to the “superior” performance of newer architectures like Vision Transformers!


ConvMixer

No alt text provided for this image

ConvMixer, consists of a patch embedding layer followed by repeated applications of a simple fully-convolutional block, where the spatial structure of the patch embeddings is maintained.

Nice idea: Patch embeddings with patch size p and embedding dimension h can be implemented as convolution with cin input channels, h output channels, kernel size p, and stride p.

  • The ConvMixer block consists of depthwise convolution (mix spatial locations), followed by pointwise convolution (mix channel locations).
  • Convolutions with an unusually large kernel size are used to mix distant spatial locations
  • After several applications of this block, global pooling is performed to get a feature vector of size h, which we pass to a softmax classifier.

While self-attention and MLPs are theoretically more flexible, allowing for large receptive fields and content-aware behavior, the inductive bias of convolution is well-suited to vision tasks and leads to high data efficiency.


Experiments

ConvMixer-h/d where h is the width or hidden dimension, and d is the depth or the number of repetitions of the ConvMixer layer.

  • CIFAR-10: ConvMixers achieve over 96% accuracy with as few as 0.7M parameters, demonstrating the data efficiency.
  • ImageNet-1k classification without any pretraining or additional data: A ConvMixer-1536/20 (h/depth) with 52M parameters can achieve 81.4% top-1 accuracy, and a ConvMixer-768/32 with 21M parameters 80.2%.
  • Wider ConvMixers seem to converge in fewer epochs, but are memory- and compute-hungry. They also work best with large kernel sizes

ConvMixers achieve competitive accuracies for a given parameter budget

  • ConvMixer-1536/20 outperforms both ResNet-152 and ResMLP-B24 despite having substantially fewer parameters
  • ConvMixer-1536/20 is competitive with DeiT-B. ConvMixer-768/32 uses just a third of the parameters of ResNet-152, but is similarly accurate.


ConvMixers are substantially slower at inference than the competitors!, likely due to their smaller patch size; hyperparameter tuning and optimizations could narrow this gap!


Deeper networks take longer to converge while wider networks converge faster.


Increasing the width or the depth is an effective way to increase accuracy; a doubling of depth incurs less compute than a doubling of width


Conclusions

  • ConvMixers, an extremely simple class of models that independently mixes the spatial and channel locations of patch embeddings using only standard convolutions
  • ConvMixers outperform the Vision Transformer and MLP-Mixer, and are competitive with ResNets, DeiTs, and ResMLPs

Our title, while an exaggeration, points out that attention isn’t the only export from language processing into computer vision: tokenizing inputs, i.e., using patch embeddings, is also a powerful and important takeaway

  • A deeper ConvMixer with larger patches could reach a desirable tradeoff between accuracy, parameters, and throughput after longer training and more regularization and hyperparameter tuning.
  • Low-level optimization of large-kernel depthwise convolution could substantially increase throughput.
  • Similarly, small enhancements to our architecture like the addition of bottlenecks or a more expressive classifier could trade simplicity for performance.

Code

No alt text provided for this image
An implementation of the ConvMixer model in exactly 280 characters!

A more readable code is here.

Regards

要查看或添加评论,请登录

Ibrahim Sobh - PhD的更多文章

社区洞察

其他会员也浏览了