登录查看更多内容

Patches Are All You Need! [with code]

Ibrahim Sobh - PhD

?? Senior Expert of Artificial Intelligence, Valeo Group | LinkedIn Top Voice | Machine Learning | Deep Learning | Data Science | Computer Vision | NLP | Developer | Researcher | Lecturer

发布日期: 2021年10月28日

"It is only a matter of time before Transformers become the dominant architecture for vision domains, just as they have for language processing"

Introduction

ConvMixer, an extremely simple model that is similar in many aspects to the ViT and the even-more-basic MLP-Mixer:

ConvMixer operates directly on patches as input separates the mixing of spatial and channel dimensions
ConvMixer maintains equal size and resolution throughout the network (isotropic).

Unlike the Vision Transformer and MLP-Mixer, ConvMixer uses only standard convolutions to achieve the mixing steps.

Despite its simplicity, ConvMixer outperforms the ViT, MLP-Mixer, and some of their variants for similar parameter counts and data set sizes, in addition to outperforming classical vision models such as the ResNet.

Transformers and images

Because the computational cost of the self-attention layers used in Transformers would scale quadratically with the number of pixels per image if applied naively at the per-pixel level, the compromise was to first split the image into multiple “patches”, linearly embed them, and then apply the transformer directly to this collection of patches.

Q: Is the performance of ViTs due to the inherently more powerful Transformer architecture, or is it at least partly due to using patches as the input representation?

The patch representation itself may be the most critical component to the “superior” performance of newer architectures like Vision Transformers!

ConvMixer

ConvMixer, consists of a patch embedding layer followed by repeated applications of a simple fully-convolutional block, where the spatial structure of the patch embeddings is maintained.

Nice idea: Patch embeddings with patch size p and embedding dimension h can be implemented as convolution with cin input channels, h output channels, kernel size p, and stride p.

The ConvMixer block consists of depthwise convolution (mix spatial locations), followed by pointwise convolution (mix channel locations).
Convolutions with an unusually large kernel size are used to mix distant spatial locations
After several applications of this block, global pooling is performed to get a feature vector of size h, which we pass to a softmax classifier.

While self-attention and MLPs are theoretically more flexible, allowing for large receptive fields and content-aware behavior, the inductive bias of convolution is well-suited to vision tasks and leads to high data efficiency.

Voxel51 1 年前

Feature Store Architecture, the Year of Large Language…

Open Data Science Conference (ODSC) 10 个月前

??Top ML Papers of the Week

DAIR.AI 2 个月前

Experiments

ConvMixer-h/d where h is the width or hidden dimension, and d is the depth or the number of repetitions of the ConvMixer layer.

CIFAR-10: ConvMixers achieve over 96% accuracy with as few as 0.7M parameters, demonstrating the data efficiency.
ImageNet-1k classification without any pretraining or additional data: A ConvMixer-1536/20 (h/depth) with 52M parameters can achieve 81.4% top-1 accuracy, and a ConvMixer-768/32 with 21M parameters 80.2%.
Wider ConvMixers seem to converge in fewer epochs, but are memory- and compute-hungry. They also work best with large kernel sizes

ConvMixers achieve competitive accuracies for a given parameter budget

ConvMixer-1536/20 outperforms both ResNet-152 and ResMLP-B24 despite having substantially fewer parameters
ConvMixer-1536/20 is competitive with DeiT-B. ConvMixer-768/32 uses just a third of the parameters of ResNet-152, but is similarly accurate.

ConvMixers are substantially slower at inference than the competitors!, likely due to their smaller patch size; hyperparameter tuning and optimizations could narrow this gap!

Deeper networks take longer to converge while wider networks converge faster.

Increasing the width or the depth is an effective way to increase accuracy; a doubling of depth incurs less compute than a doubling of width

Conclusions

ConvMixers, an extremely simple class of models that independently mixes the spatial and channel locations of patch embeddings using only standard convolutions
ConvMixers outperform the Vision Transformer and MLP-Mixer, and are competitive with ResNets, DeiTs, and ResMLPs

Our title, while an exaggeration, points out that attention isn’t the only export from language processing into computer vision: tokenizing inputs, i.e., using patch embeddings, is also a powerful and important takeaway

A deeper ConvMixer with larger patches could reach a desirable tradeoff between accuracy, parameters, and throughput after longer training and more regularization and hyperparameter tuning.
Low-level optimization of large-kernel depthwise convolution could substantially increase throughput.
Similarly, small enhancements to our architecture like the addition of bottlenecks or a more expressive classifier could trade simplicity for performance.

Code

An implementation of the ConvMixer model in exactly 280 characters!

A more readable code is here.

Regards

要查看或添加评论，请登录

Ibrahim Sobh - PhD的更多文章

How to Learn Artificial Intelligence: A Beginner’s Guide

2024年5月31日

How to Learn Artificial Intelligence: A Beginner’s Guide

Artificial Intelligence (AI) is a fascinating field that simulates human intelligence and task performance using…
[????????????] ?????????????????? ???????????? explained with code ??

2023年1月28日

[????????????] ?????????????????? ???????????? explained with code ??

"During the last two years there has been a plethora of large generative models such as ChatGPT or Stable Diffusion…

2 条评论
A conversation with ChatGPT about AI, study roadmap, applications, interview questions with answers, salaries, and more!

2023年1月21日

A conversation with ChatGPT about AI, study roadmap, applications, interview questions with answers, salaries, and more!

Hello everyone, and thank you all for being here today! Let me introduce our new star, the ChatGPT, who will discuss…
10 Object detectors with code [YOLOF, YOLOX, DETR, Deformable DETR, SparseR-CNN, VarifocalNet, PAA, SABL, ATSS, Double Heads]

2022年2月17日

10 Object detectors with code [YOLOF, YOLOX, DETR, Deformable DETR, SparseR-CNN, VarifocalNet, PAA, SABL, ATSS, Double Heads]

In this article, 10 well-known pre-trained object detectors are loaded and used in a standard and easy way. YOLOF: You…

6 条评论
FNet: Do we need the attention layer at all? [Explained with code]

2021年10月30日

FNet: Do we need the attention layer at all? [Explained with code]

FNet: Mixing Tokens with Fourier Transforms "In this work, we investigate whether simpler token mixing mechanisms can…
MLP is all you need! [with code]

2021年10月23日

MLP is all you need! [with code]

From Google: MLP-Mixer: An all-MLP Architecture for Vision Main idea: "While convolutions and attention are both…

2 条评论
9 Steps for solving any machine learning problem

2021年8月28日

9 Steps for solving any machine learning problem

In this article, we will present a universal blueprint that we can use to attack and solve any machine-learning…

2 条评论
Anatomy of the Beast with many heads! [with code]

2021年6月12日

Anatomy of the Beast with many heads! [with code]

1) Introduction: In previous articles, we discussed the Transfomers, where Learning Representations of Variable Length…

2 条评论
The magic of XLM-R: Unsupervised Cross-lingual Representation Learning at Scale

2021年1月16日

The magic of XLM-R: Unsupervised Cross-lingual Representation Learning at Scale

The goal of this paper, by Facebook AI, is to improve cross-lingual language understanding (XLU). Previously, we…
How multilingual is Multilingual BERT?

2021年1月11日

How multilingual is Multilingual BERT?

This article is basically an extractive summary of the paper "How multilingual is Multilingual BERT?" by Google…

3 条评论

See all articles

Patches Are All You Need! [with code]

Ibrahim Sobh - PhD

?? Senior Expert of Artificial Intelligence, Valeo Group | LinkedIn Top Voice | Machine Learning | Deep Learning | Data Science | Computer Vision | NLP | Developer | Researcher | Lecturer

Introduction

Transformers and images

ConvMixer

领英推荐

Experiments

Conclusions

Code

Ibrahim Sobh - PhD的更多文章

社区洞察

其他会员也浏览了

How to Write an Algorithm?

Reviewing ‘Compression Algorithms for Large Language Models’ — shrinking the model size and reducing the cost of the hardware accelerators

Architecting Solid Foundations for Scalable Knowledge Graphs

??Top ML Papers of the Week

Advancements in Approximate Nearest Neighbor Algorithms: The Evolution of HNSW Algorithm

Transformer Architectures for Dummies - Part 1 (Encoder Only Models)

Do LLMs Really Understand? Recent Papers Reveal

Why is Mamba creating waves? Is it a replacement for transformers?

TimesFM: A Foundation Model Revolutionizing Time-Series Forecasting

Introduction

Transformers and images

ConvMixer

领英推荐

Experiments

Conclusions

Code

Ibrahim Sobh - PhD的更多文章

How to Learn Artificial Intelligence: A Beginner’s Guide

[????????????] ?????????????????? ???????????? explained with code ??

A conversation with ChatGPT about AI, study roadmap, applications, interview questions with answers, salaries, and more!

10 Object detectors with code [YOLOF, YOLOX, DETR, Deformable DETR, SparseR-CNN, VarifocalNet, PAA, SABL, ATSS, Double Heads]

FNet: Do we need the attention layer at all? [Explained with code]

MLP is all you need! [with code]

9 Steps for solving any machine learning problem

Anatomy of the Beast with many heads! [with code]

The magic of XLM-R: Unsupervised Cross-lingual Representation Learning at Scale

How multilingual is Multilingual BERT?

社区洞察

其他会员也浏览了

How to Write an Algorithm?

Reviewing ‘Compression Algorithms for Large Language Models’ — shrinking the model size and reducing the cost of the hardware accelerators

Architecting Solid Foundations for Scalable Knowledge Graphs

??Top ML Papers of the Week

Advancements in Approximate Nearest Neighbor Algorithms: The Evolution of HNSW Algorithm

Transformer Architectures for Dummies - Part 1 (Encoder Only Models)

Do LLMs Really Understand? Recent Papers Reveal

Why is Mamba creating waves? Is it a replacement for transformers?

TimesFM: A Foundation Model Revolutionizing Time-Series Forecasting