登录查看更多内容

What is the Vision Transformer?

Damien Benveniste, PhD

Founder @ TheAiEdge | Building the largest AI professional community | Become an expert with an expert!

发布日期: 2024年6月3日

I find the Vision Transformer to be quite an interesting model! The self-attention mechanism and the transformer architecture were designed to help fix some of the flaws we saw in previous models that had applications in natural language processing. With the Vision Transformer, a few scientists at Google realized they could take images instead of text as input data and use that architecture as a computer vision model. And they built a model that now has state of the art results on image classification and other computer vision learning tasks! Let me show you how to do that!

The Vision Transformer has been developed by Google and is a good example of how easy it is to adapt the Transformer architecture for any data type.

The idea is to break down the input image into small patches and transform each of the patches into an input vector for the Transformer model. It can be a simple linear transformation such that we obtain vectors in the right format.

If we use a convolution layer as a linear transformation, we can process the whole image in one shot. The convolutional layer just needs to have the right dimension to get the vectors with the right format.

Once we transform the image into vectors, the process is very similar to the one we saw with the typical encoder transformer. We use a position embedding to capture the position of the patches, and we add it to the vectors coming from the convolutional layer. It is typical to add an additional learnable vector if we want to perform a classification task.

Open Data Science Conference (ODSC) 5 个月前

?? Mamba > Transformers?

Pascal Biese 11 个月前

Artificial Intelligence #194

Andriy Burkov 1 年前

After that, we can feed those resulting vectors to the first encoder block.

We add as many encoder blocks in the encoder as we need. In the end, we obtain the encoder output.

A linear layer is used as the prediction head, and it projects from the hidden states dimension to the prediction vectors dimension. In the case of classification, we only use the first vector that corresponds to the classification token we added at the beginning of the model.

Watch the video for more information!

Articles You May Have Missed!

The AiEdge

49,418 位关注者

DR. ARIADNE-ANNE DeTSAMBALI

PhD Research Student on Mechanics & Materials

5 个月

I'll keep this in mind

The Artificially Intelligent Enterprise Network

5 个月

Well said!

Jay Zou

GenAI-Blockchain Advising and Consulting

5 个月

From word of language to cluster of image is a gigantic jump but necessary at the same time.

Amit Shinde

Need IT solutions that drive success? Transforming businesses with cutting-edge tech and e-governance | Your expert in marketing & branding | The only IT partner you'll ever need | Founder & CEO - Voidspy Pvt. Ltd

5 个月

?? Pessimistic View: ? It may lead to job loss in the industry. ? Will it make human creativity obsolete? ? What about privacy concerns with using images? ??Optimistic View: ?? Innovation in AI applications is exciting. ?? Potential for breakthroughs in computer vision. ?? Opens up new possibilities for advancing technology. PS: Who knew images could speak louder than words in AI!

Hardeep Chawla

Enterprise Sales Director at Zoho | Enabling Business Success with Scalable CRM & Digital Transformation Solutions

5 个月

The Vision Transformer's journey from NLP to computer vision is truly fascinating! It's incredible how a model originally designed for text processing can now lead the way in image classification and other computer vision tasks!

查看更多评论

要查看或添加评论，请登录

查看全部

What is the Vision Transformer?

Damien Benveniste, PhD

Founder @ TheAiEdge | Building the largest AI professional community | Become an expert with an expert!

领英推荐

Articles You May Have Missed!

The AiEdge

49,418 位关注者

更多精彩文章

社区洞察

其他会员也浏览了

Artificial Intelligence #194

??Top ML Papers of the Week

Watch#7: Small Tweaks with Big Impact

Artificial Intelligence #133

??Top ML Papers of the Week

Unraveling the Technical Intricacies of GANMF Models: A Comprehensive Analysis

Artificial Intelligence #104

??Top ML Papers of the Week

Artificial Intelligence #24

AI Is Not A Puppy That Needs Adoption.

领英推荐

Articles You May Have Missed!

The AiEdge

49,418 位关注者

How To Bring Machine Learning Projects to Success

2024年8月9日

LLMs MasterClass: Last Day for Early-Bird Price

2024年7月22日

Float32 vs Float16 vs BFloat16?

2024年7月19日

Train, Fine-Tune, and Deploy Large Language Models Bootcamp!

2024年7月15日

The Position Encoding In Transformers!

2024年7月12日

Introduction to Machine Learning System Design!

2024年7月2日

Understanding How LoRA Adapters Work!

2024年6月28日

The Backpropagation Algorithm!

2024年6月25日

Understanding The Computational Graph in Neural Networks

2024年6月21日

How to Approach Model Optimization for AutoML

2024年6月14日

社区洞察

其他会员也浏览了

Artificial Intelligence #194

??Top ML Papers of the Week

Watch#7: Small Tweaks with Big Impact

Artificial Intelligence #133

??Top ML Papers of the Week

Unraveling the Technical Intricacies of GANMF Models: A Comprehensive Analysis

Artificial Intelligence #104

??Top ML Papers of the Week

Artificial Intelligence #24

AI Is Not A Puppy That Needs Adoption.