Vision Transformers in Computer Vision
Axelera AI
Creating a powerful, efficient and competitive AI-native hardware & software platform for edge computing
by?Bert Moons?–?System Architect?at AXELERA AI
Summary: Convolutional Neural Networks (CNN) have been dominant in Computer Vision applications for over a decade. Today, they are being outperformed and replaced by Vision Transformers (ViT) with a higher learning capacity. The fastest ViTs are essentially a CNN/Transformer hybrid, combining the best of both worlds: (A) CNN-inspired hierarchical and pyramidal feature maps, where embedding dimensions increase and spatial dimensions decrease throughout the network are combined with local receptive fields to reduce model complexity, while (B) Transformer-inspired self-attention increases modeling capacity and leads to higher accuracies. Even though ViTs outperform CNNs in specific cases, their dominance has not yet been asserted. We illustrate and conclude that SotA CNNs are still on-par, or better, than ViTs in ImageNet validation, especially when (1) trained from scratch without distillation, (2) in the lower-accuracy <80% regime, and (3) for lower network complexities optimized for Edge devices.
Convolutional Neural Networks
Convolutional Neural Networks (CNN) have been the dominant Neural Network architectures in Computer Vision for almost a decade, after the breakthrough performance of AlexNet[1]on the ImageNet[2] image classification challenge. From this baseline architecture, CNNs have evolved into variations of bottlenecked architectures with residual connections such as ResNet[3],?RegNet[4] or into more lightweight networks optimized for mobile contexts using grouped convolutions and inverted bottlenecks, such as Mobilenet[5] or EfficientNet[6]. Typically, such networks are benchmarked and compared by training them on small images on the ImageNet data set. After this pretraining, they can be used for applications outside of image classification such as object detection, panoptic vision, semantic segmentation, or other specialized tasks. This can be done by using them as a backbone in an end-to-end application-specific Neural Network and finetuning the resulting network to the appropriate data set and application....continue the reading on our website