Transformers in Computer Vision
Created by author

Transformers in Computer Vision

Transformers are a type of deep learning model architecture that has revolutionized natural language processing (NLP) and extended its influence to other domains such as computer vision. They were introduced in the 2017 paper "Attention is All You Need" by Vaswani et al.

Here are key aspects of ML transformers:

1. Self-Attention Mechanism:

The core innovation of the transformer model is the self-attention mechanism, which allows the model to weigh the importance of different words in a sentence relative to each other. This helps capture the context and dependencies between words more effectively than previous models like RNNs (Recurrent Neural Networks) or LSTMs (Long Short-Term Memory networks).

2. Encoder-Decoder Structure:

  • Encoder: Processes the input sequence and produces a set of continuous representations.
  • Decoder: Uses the encoder's output and the target sequence (shifted right) to generate the output sequence.

3. Multi-Head Attention:

Transformers use multiple attention mechanisms in parallel, known as multi-head attention. This allows the model to focus on different parts of the sentence simultaneously, capturing various aspects of the context.

4. Positional Encoding:

Since transformers do not inherently understand the order of words (unlike RNNs), positional encoding is added to the input embeddings to give the model information about the position of each word in the sequence.

5. Feed-Forward Networks:

Each position's representation from the attention mechanism is passed through a feed-forward neural network, which is the same across all positions but with different parameters.

Now let's move on to the icing on the cake, all the explanation and important points about transformers already show how important they are in the context of AI, but there is one area of them that really interested me, which was Visi?o Transformers (ViT), for those who work With computer vision you know how training models can be time-consuming and highly expensive, ViT brings a new scenario in the field of computer vision that I believe will bring a lot of benefits.

Using HuggingFace I brought a small implementation of this method. I admit that at first I was a little skeptical, but I found it very interesting and I hope to bring you more about it in the coming days. Let's go to the code.

I used Google Colab and an image taken from Google Images (I hope I don't have to pay copyright, hahahaa), I chose Colab because it's easy and everyone has access to it and it's easier than setting up a virtual environment.

The first step is to download the transformers library to the Colab environment.

!pip install transformers        
# Library for the project
from transformers import ViTImageProcessor, ViTForImageClassification
from PIL import Image
from google.colab import files        

To load the test image into the Colab environment, we will use the method below that allows us to load a file from our computer to the Colab storage location.

files.upload()        

Once this is done, it's time to prepare the image for classification. As the model was trained with images sized 224x224, we need to make the image the same size.

img = Image.open('images.jpg')
img = img.resize((224, 224))
img        

Feel free to use other images and do your own tests.

# processor of the model
processor = ViTImageProcessor.from_pretrained('google/vit-base-patch16-224')
# model pre-trained
model = ViTForImageClassification.from_pretrained('google/vit-base-patch16-224')        

After instantiating the model objects, we are ready to use it. I will explain the lines and at the end of the newsletter there will be some links for anyone who wants to go beyond what they read here.

  • inputs = processor(images=img, return_tensors="pt")

In this line we generate the model input, to generate this input we use the instanced object of the ViTImageProcessor class. Passing two parameters, where the first is the array with the pixels of the image and the second indicates the type of batch return, which in my case I left the default 'pt' which returns a batch of type torch.Tensor.

  • outputs = model(**inputs)

In this line, the model receives the processed image and returns an object of type ImageClassifierOutput with the loss data and the array with the probabilities of the images being from each class present in ImageNet, in this case there are 1000 probabilities.

  • logits = outputs.logits

As what is important to us at the moment is knowing the probability of the image being from one of the classes, we take the probability array.

  • predicted_class_idx = logits.argmax(-1).item()

Returns the position where the highest probability was found.

  • print("Predicted class:", model.config.id2label[predicted_class_idx])

Using the returned position we can find the class to which the image refers.

inputs = processor(images=img, return_tensors="pt")
outputs = model(**inputs)
logits = outputs.logits
predicted_class_idx = logits.argmax(-1).item()
print("Predicted class:", model.config.id2label[predicted_class_idx])        

Output

Predicted class: Rottweiler

Before using any image, it is good to know which classes make up the ImageNet dataset.

References

https://huggingface.co/docs/transformers/model_doc/vit

https://huggingface.co/models?pipeline_tag=image-classification&sort=downloads

https://viso.ai/deep-learning/vision-transformer-vit/

Complete code

!pip install transformers

from transformers import ViTImageProcessor, ViTForImageClassification
from PIL import Image
from google.colab import files

files.upload()

img = Image.open('images.jpg')
img = img.resize((224, 224))
img

processor = ViTImageProcessor.from_pretrained('google/vit-base-patch16-224')
model = ViTForImageClassification.from_pretrained('google/vit-base-patch16-224')

inputs = processor(images=img, return_tensors="pt")
outputs = model(**inputs)
logits = outputs.logits
# model predicts one of the 1000 ImageNet classes
predicted_class_idx = logits.argmax(-1).item()
print("Predicted class:", model.config.id2label[predicted_class_idx])        

要查看或添加评论,请登录

社区洞察

其他会员也浏览了