登录查看更多内容

Transformers in Computer Vision

Gildson Santos

Desenvolvedor Python sênior | Engenheiro de Machine Learning | Desenvolvedor Mobile Flutter.

发布日期: 2024年6月11日

Transformers are a type of deep learning model architecture that has revolutionized natural language processing (NLP) and extended its influence to other domains such as computer vision. They were introduced in the 2017 paper "Attention is All You Need" by Vaswani et al.

Here are key aspects of ML transformers:

1. Self-Attention Mechanism:

The core innovation of the transformer model is the self-attention mechanism, which allows the model to weigh the importance of different words in a sentence relative to each other. This helps capture the context and dependencies between words more effectively than previous models like RNNs (Recurrent Neural Networks) or LSTMs (Long Short-Term Memory networks).

2. Encoder-Decoder Structure:

Encoder: Processes the input sequence and produces a set of continuous representations.
Decoder: Uses the encoder's output and the target sequence (shifted right) to generate the output sequence.

3. Multi-Head Attention:

Transformers use multiple attention mechanisms in parallel, known as multi-head attention. This allows the model to focus on different parts of the sentence simultaneously, capturing various aspects of the context.

4. Positional Encoding:

Since transformers do not inherently understand the order of words (unlike RNNs), positional encoding is added to the input embeddings to give the model information about the position of each word in the sequence.

5. Feed-Forward Networks:

Each position's representation from the attention mechanism is passed through a feed-forward neural network, which is the same across all positions but with different parameters.

Now let's move on to the icing on the cake, all the explanation and important points about transformers already show how important they are in the context of AI, but there is one area of them that really interested me, which was Visi?o Transformers (ViT), for those who work With computer vision you know how training models can be time-consuming and highly expensive, ViT brings a new scenario in the field of computer vision that I believe will bring a lot of benefits.

Using HuggingFace I brought a small implementation of this method. I admit that at first I was a little skeptical, but I found it very interesting and I hope to bring you more about it in the coming days. Let's go to the code.

I used Google Colab and an image taken from Google Images (I hope I don't have to pay copyright, hahahaa), I chose Colab because it's easy and everyone has access to it and it's easier than setting up a virtual environment.

The first step is to download the transformers library to the Colab environment.

!pip install transformers

# Library for the project
from transformers import ViTImageProcessor, ViTForImageClassification
from PIL import Image
from google.colab import files

To load the test image into the Colab environment, we will use the method below that allows us to load a file from our computer to the Colab storage location.

files.upload()

Once this is done, it's time to prepare the image for classification. As the model was trained with images sized 224x224, we need to make the image the same size.

Shobhit Tiwari 3 个月前

The power to revolutionize AI lies upon passive…

Zander Labs 1 年前

Understanding the Encoder-Decoder Transformer: A Deep…

Kumar Preeti Lata 3 周前

img = Image.open('images.jpg')
img = img.resize((224, 224))
img

Feel free to use other images and do your own tests.

# processor of the model
processor = ViTImageProcessor.from_pretrained('google/vit-base-patch16-224')
# model pre-trained
model = ViTForImageClassification.from_pretrained('google/vit-base-patch16-224')

After instantiating the model objects, we are ready to use it. I will explain the lines and at the end of the newsletter there will be some links for anyone who wants to go beyond what they read here.

inputs = processor(images=img, return_tensors="pt")

In this line we generate the model input, to generate this input we use the instanced object of the ViTImageProcessor class. Passing two parameters, where the first is the array with the pixels of the image and the second indicates the type of batch return, which in my case I left the default 'pt' which returns a batch of type torch.Tensor.

outputs = model(**inputs)

In this line, the model receives the processed image and returns an object of type ImageClassifierOutput with the loss data and the array with the probabilities of the images being from each class present in ImageNet, in this case there are 1000 probabilities.

logits = outputs.logits

As what is important to us at the moment is knowing the probability of the image being from one of the classes, we take the probability array.

predicted_class_idx = logits.argmax(-1).item()

Returns the position where the highest probability was found.

print("Predicted class:", model.config.id2label[predicted_class_idx])

Using the returned position we can find the class to which the image refers.

inputs = processor(images=img, return_tensors="pt")
outputs = model(**inputs)
logits = outputs.logits
predicted_class_idx = logits.argmax(-1).item()
print("Predicted class:", model.config.id2label[predicted_class_idx])

Output

Predicted class: Rottweiler

Before using any image, it is good to know which classes make up the ImageNet dataset.

References

https://huggingface.co/docs/transformers/model_doc/vit

https://huggingface.co/models?pipeline_tag=image-classification&sort=downloads

https://viso.ai/deep-learning/vision-transformer-vit/

Complete code

!pip install transformers

from transformers import ViTImageProcessor, ViTForImageClassification
from PIL import Image
from google.colab import files

files.upload()

img = Image.open('images.jpg')
img = img.resize((224, 224))
img

processor = ViTImageProcessor.from_pretrained('google/vit-base-patch16-224')
model = ViTForImageClassification.from_pretrained('google/vit-base-patch16-224')

inputs = processor(images=img, return_tensors="pt")
outputs = model(**inputs)
logits = outputs.logits
# model predicts one of the 1000 ImageNet classes
predicted_class_idx = logits.argmax(-1).item()
print("Predicted class:", model.config.id2label[predicted_class_idx])

Transformers in Computer Vision

Gildson Santos

Desenvolvedor Python sênior | Engenheiro de Machine Learning | Desenvolvedor Mobile Flutter.

领英推荐

DataBytes: Python e Além

190 位关注者

更多精彩文章

社区洞察

其他会员也浏览了

Transformer Architecture in Deep Learning

Navigating the GenAI Frontier : Transformers, GPT and the path to Accelerated Innovation

The Age of Transformers: A Leap Beyond RNNs and LSTMs

Is Attention all you need?

Every Old AI is New Again

What are the different transformers for LLMs like Bert, ChatGPT, and Google Flan T5?

Navigating the GenAI Frontier: Transformers, GPT, and the Path to Accelerated Innovation

GenAI - A Brief 101

How Artificial Intelligence Works

The Most Basic Guide to Understanding Transformers - The Backbone of LLMs

领英推荐

DataBytes: Python e Além

190 位关注者

Rastreamento de objetos (Object Tracking)

2024年10月9日

Como melhorar a qualidade da imagem?

2024年6月15日

Texto para imagem

2024年6月14日

Inova??o Tecnológica para a Inclus?o: Um Projeto de Tradutor de Libras com Inteligência Artificial

2024年4月19日

Detec??o de m?os em tempo real

2024年4月5日

Detec??o de faces

2024年2月3日

Query com JOIN ou Sem JOIN?

2023年12月10日

SELECT (SQL Server)

2023年10月1日

SQL

2023年9月28日

Fun??es Lambda

2023年9月26日