What is the Vision Transformer?

What is the Vision Transformer?

I find the Vision Transformer to be quite an interesting model! The self-attention mechanism and the transformer architecture were designed to help fix some of the flaws we saw in previous models that had applications in natural language processing. With the Vision Transformer, a few scientists at Google realized they could take images instead of text as input data and use that architecture as a computer vision model. And they built a model that now has state of the art results on image classification and other computer vision learning tasks! Let me show you how to do that!


The Vision Transformer has been developed by Google and is a good example of how easy it is to adapt the Transformer architecture for any data type.

The idea is to break down the input image into small patches and transform each of the patches into an input vector for the Transformer model. It can be a simple linear transformation such that we obtain vectors in the right format.

If we use a convolution layer as a linear transformation, we can process the whole image in one shot. The convolutional layer just needs to have the right dimension to get the vectors with the right format.

Once we transform the image into vectors, the process is very similar to the one we saw with the typical encoder transformer. We use a position embedding to capture the position of the patches, and we add it to the vectors coming from the convolutional layer. It is typical to add an additional learnable vector if we want to perform a classification task.

After that, we can feed those resulting vectors to the first encoder block.

We add as many encoder blocks in the encoder as we need. In the end, we obtain the encoder output.

A linear layer is used as the prediction head, and it projects from the hidden states dimension to the prediction vectors dimension. In the case of classification, we only use the first vector that corresponds to the classification token we added at the beginning of the model.

Watch the video for more information!

Articles You May Have Missed!

DR. ARIADNE-ANNE DeTSAMBALI

PhD Research Student on Mechanics & Materials

5 个月

I'll keep this in mind

回复
Jay Zou

GenAI-Blockchain Advising and Consulting

5 个月

From word of language to cluster of image is a gigantic jump but necessary at the same time.

回复
Amit Shinde

Need IT solutions that drive success? Transforming businesses with cutting-edge tech and e-governance | Your expert in marketing & branding | The only IT partner you'll ever need | Founder & CEO - Voidspy Pvt. Ltd

5 个月

?? Pessimistic View: ? It may lead to job loss in the industry. ? Will it make human creativity obsolete? ? What about privacy concerns with using images? ??Optimistic View: ?? Innovation in AI applications is exciting. ?? Potential for breakthroughs in computer vision. ?? Opens up new possibilities for advancing technology. PS: Who knew images could speak louder than words in AI!

回复
Hardeep Chawla

Enterprise Sales Director at Zoho | Enabling Business Success with Scalable CRM & Digital Transformation Solutions

5 个月

The Vision Transformer's journey from NLP to computer vision is truly fascinating! It's incredible how a model originally designed for text processing can now lead the way in image classification and other computer vision tasks!

回复

要查看或添加评论,请登录

社区洞察

其他会员也浏览了