登录查看更多内容

Vision Transformers in Computer Vision

Axelera AI

Creating a powerful, efficient and competitive AI-native hardware & software platform for edge computing

发布日期: 2022年1月20日

by?Bert Moons?–?System Architect?at AXELERA AI

Summary: Convolutional Neural Networks (CNN) have been dominant in Computer Vision applications for over a decade. Today, they are being outperformed and replaced by Vision Transformers (ViT) with a higher learning capacity. The fastest ViTs are essentially a CNN/Transformer hybrid, combining the best of both worlds: (A) CNN-inspired hierarchical and pyramidal feature maps, where embedding dimensions increase and spatial dimensions decrease throughout the network are combined with local receptive fields to reduce model complexity, while (B) Transformer-inspired self-attention increases modeling capacity and leads to higher accuracies. Even though ViTs outperform CNNs in specific cases, their dominance has not yet been asserted. We illustrate and conclude that SotA CNNs are still on-par, or better, than ViTs in ImageNet validation, especially when (1) trained from scratch without distillation, (2) in the lower-accuracy <80% regime, and (3) for lower network complexities optimized for Edge devices.

Convolutional Neural Networks

Convolutional Neural Networks (CNN) have been the dominant Neural Network architectures in Computer Vision for almost a decade, after the breakthrough performance of AlexNet[1]on the ImageNet[2] image classification challenge. From this baseline architecture, CNNs have evolved into variations of bottlenecked architectures with residual connections such as ResNet[3],?RegNet[4] or into more lightweight networks optimized for mobile contexts using grouped convolutions and inverted bottlenecks, such as Mobilenet[5] or EfficientNet[6]. Typically, such networks are benchmarked and compared by training them on small images on the ImageNet data set. After this pretraining, they can be used for applications outside of image classification such as object detection, panoptic vision, semantic segmentation, or other specialized tasks. This can be done by using them as a backbone in an end-to-end application-specific Neural Network and finetuning the resulting network to the appropriate data set and application....continue the reading on our website

要查看或添加评论，请登录

Axelera AI的更多文章

See all articles

Axelera AI的更多文章

DeepSeek-R1-Zero and R1: Shaking the AI Ecosystem and Redefining the Future of Computing?

Harnessing the RISC-V Wave: The Future is Now

Decoding Transformers on Edge Devices

Cheap Computing and the Balancing Act of Population Decline

Is GPT-4 showing ‘Sparks’ of AGI?

Ten questions with Axelera AI’s Scientific Advisor Luca Benini

Multilayer perceptrons (MLPs) in Computer Vision

Interview with Torsten Hoefler, Axelera AI’s Scientific Advisor

An Interview with Marian Verhelst, Axelera AI’s Scientific Advisor

What's Next for Data Processing? A Closer Look at In-Memory Computing