Anatomy of a High-Performance MBConv Block
Abstract
Neural network efficiency is paramount for those deploying the most advanced computer vision technology on edge devices. Whether constrained by cost or power, edge devices have a limited computational budget. While server-side applications can throw unlimited computational resources at a neural network, edge applications perform latency-critical tasks on small, cost-effective low-power devices. Therefore we must ensure that inference engine software runs as efficiently as possible on the edge. Yet benchmarks show that some of the most advanced neural network blocks run inefficiently, especially on the smaller, narrower models used on edge devices. We examine one such block, MBConv, and by implementing a memory-efficient algorithm, produce an NVIDIA GPU kernel that runs MBConv up to 4 times faster than TensorRT. This efficient algorithm has immediate implications for edge applications. It also suggests the far-reaching conclusion that network architecture search has been looking in the wrong place for efficient models, as the inefficient algorithms used by today’s inference engine software distort the search landscape.
Introduction
The introduction of the MobileNetV2 convolutional neural network signaled a breakthrough in model efficiency, achieving greater accuracy per operation and parameter than its predecessors [1]. The MobileNetV2 paper also presented an algorithm for efficient computation. Curiously, we find this algorithm has yet to be implemented.?
The MobileNetV2 inverted residual block (MBConv) lives on as a central component of highly efficient models like EfficientNetV2 [4], yet it has earned a reputation for being slow. This limitation has caused some researchers to avoid MBConv altogether.
We find this conclusion erroneous and believe that MBConv is slow in practice only because of the naive algorithm that inference engine software uses to compute it. We adapt the memory-efficient MBConv algorithm to run in parallel. We supplement the algorithm with a Squeeze-and-Excitation (SE) layer while maintaining efficiency. Our implementation of MBConv+SE runs up to 5.6 times faster than PyTorch Inductor and four times faster than NVIDIA TensorRT.
We conclude that the widespread use of inefficient algorithms in deep learning has distorted the network architecture search landscape. Thus inefficient software leads to suboptimal models and over-allocation of hardware resources.
MBConv
MBConv is an inverted residual block, as it maps a relatively narrow input layer to a wider inner layer, which in turns maps onto a narrow output layer. The output layer is added to the input, forming a residual shortcut. The original residual blocks found in the ResNet model used wider input and output layers and narrower inner layers. Thus MBConv “inverts” the bottleneck from the original sense.
The individual layers of MBConv are:
Batch normalization follows each layer, and the expansion and convolution layers have activation functions. The projection layer has no activation function, as the MobileNetV2 paper found that the nonlinearities decrease accuracy when used on the bottleneck layers.
Successors to MobiletNetV2 [3] often add a Squeeze-and-Excitation (SE) [9] layer to the MBConv block after the depth-wise convolution layer. The SE layer significantly increases model accuracy but decreases computational efficiency.
We also modify the MBConv block by replacing the depth-wise convolution with a grouped convolution, using group-width equal to eight, so that the structure of the computation matches the shape of NVIDIA tensor cores. This change causes a slight decrease in model efficiency but significantly increases computational efficiency.
Memory-Efficient Algorithm
We implemented MBConv as a single GPU kernel that computes all block layers simultaneously. Inspired by the “memory efficient inference” section of the MobileNetV2 paper, we partition the inner layer into channel groups and process each in its own thread-block. Likewise, each thread-block computes the terms of the projection point-wise convolution corresponding to its hidden-layer channel group.?
The thread-blocks add their outputs together using a global memory reduction onto the input tensor. Thus the action of all thread-blocks produces the MBConv block with a residual shortcut.
A typical kernel configuration partitions the inner layer into groups of 64 channels each. Therefore, MBConv with 128 input channels and an expansion factor of four produces an inner layer with 512 channels, which we partition into eight groups. With 256 input channels and 1024 inner channels, we partition into 16 groups. We assign each channel group to a different thread-block.
We focus on the feature map sizes found in the last two stages of a convolutional neural network, where the feature maps have low resolution, typically 1/16 and 1/32 the input resolution. These scales agree with how EfficientNetV2 uses the MBConv block, for example.
A typical ImageNet network might have an input resolution equal to 256 x 256; the last two stages would have a resolution of 16x16 and 8x8. Thus the inner feature map in our kernel is no larger than? 16h x 16w x 64c, which equals 16K elements. With float16 precision, the hidden layer buffer requires 32KB of memory, which fits comfortably in the local shared memory of a thread-block. Thus our kernel writes projection outputs to a shared local memory buffer, reads and writes convolutional layer inputs and outputs using the same buffer, and finally reads projection layer inputs from there. This same computation typically requires two round trips to the global memory system in the naive, layer-wise algorithm.
Having completed the computation for a single MBConv block, our kernel proceeds to the next block in the network, using the same global memory tensor as the input and output until it computes all residual blocks in the network stage. Thus a single kernel launch processes several layers of the convolutional neural network.
A bottleneck layer with size 16h x 16w x 128c [float16] uses just 64 KB of memory and distributes across eight thread-blocks. The 64 SMs of the NVIDIA Ampere A5000 GPU gives our kernel 128 active thread-blocks, implying a batch size of 16 images, thus requiring a 1 MB L2 cache to keep all activations in on-chip memory.
The A5000 has 6 MB of L2 cache. Thus there is plenty of excess on-chip memory to cache the weights for each layer. When using a large batch size, our GPU kernel computes several waves with the kernel weights loaded from DRAM on the first wave, and the remaining global memory accesses hit the L2 cache.
Optionally, we add the Squeeze-and-Excitation (SE) layer after the grouped-convolution layer. SE is defined by a global average pool in each channel, producing a vector of values, one element per hidden-layer channel. This vector is then “squeezed” to a small number of channels (1/4th the bottleneck width in our experiments), passed through a ReLU activation, and then expanded again through a second linear transformation. The expanded vector is passed through a sigmoid activation, producing “excitations” that multiply the activations. We can think of SE as a channel-wise attention mechanism.
We compute the global average pool in each hidden-layer channel separately in each thread-block, and squeeze the result using the corresponding columns of the SE layer squeeze weights matrix. We then reduce the squeezed vectors through a global memory workspace, adding the contributions from each hidden-layer group’s thread-block. After synchronization, each thread-block loads the squeeze vector, applies the ReLU activation, and computes the segment of the excitation vector corresponding to its channel group using the corresponding rows of the excitation weights matrix and sigmoid activation. Each thread-block completes the SE layer by multiplying its excitations by its convolution layer channels, which are retained in shared memory.
Thus the SE layer is computed in parallel across multiple thread-blocks, with the cost of one inter-block synchronization and a small global memory workspace for the squeezed-vector. This fused algorithm contrasts with the naive algorithm, which uses one load from global memory for all hidden layer activations to compute the global average pool and another round-trip to multiply the excitations.
Operational Intensity
Operational intensity is the number of operations (FLOPs) performed per byte of DRAM traffic [5]. Algorithms with low operational intensity are memory bound, limited by the speed of the DRAM. If operational intensity exceeds the memory bandwidth ratio of the system, measured as peak arithmetic performance divided by DRAM bandwidth (again FLOP/byte), then the algorithm is compute bound.
We assume the batch size is large and the weights for each layer fit in the processor cache. Therefore only the activation memory contributes to the operational intensity.
We estimate the operational intensity for the individual layers of the MBConv block and our fused kernel. Our fused kernel’s operational intensity is about six times greater than the expansion and projection layers and about 20 times greater than the grouped convolution layer. Even with a relatively small number of channels, the fused kernel far exceeds the memory bandwidth ratio of the GPU. Therefore a naive implementation of the MBConv block will be memory bound, especially when the number of channels is small, while the fused MBConv algorithm will be compute bound regardless.
领英推荐
The fused MBConv algorithm has an operational intensity advantage because it partitions the inner layer activations tensor into small chunks stored in shared local memory, thus significantly reducing the DRAM traffic. Also, it computes the residual shortcut while the input tensor is still in the processor cache. Therefore the refactoring of the MBConv algorithm to compute the entire block for a small chunk of inner layer activations creates temporal locality that dramatically reduces the DRAM traffic for activation tensors.?
Experiments
We use an NVIDIA Ampere A5000 GPU with 64 SMs [6]. We set the GPU clock to the base frequency of 1.170 GHz and the memory clock to 2.0 GHz to measure consistent results between experiments. We use float16 arithmetic with float32 accumulation for all computations, except for the bottleneck global memory reductions, which use float16 addition. These settings yield a device peak performance of 76.7 TFLOP/s. Similar to other reports on large discrete GPU performance, we use a large batch size of 128 for all experiments [8]. A smaller batch size would be relevant on a smaller GPU, such as that 16 SM GPU used in the NVIDIA Orin SoC for edge applications.
We compute a sequence of eight MBConv blocks, representing one deep neural network stage. Our kernel processes the stage with a single kernel launch. We use input sizes of 16x16 and 8x8, typical of the last two stages of an ImageNet classification model. We vary the bottleneck width from 128 to 256 channels. We run one set of experiments with Squeeze-and-Excitation layers and the other without.?
We report performance in TFLOP/s and as a percentage of device peak performance. We use the traditional definition of 2 FLOP = 1 MULTIPLY + 1 ADD, consistent with the peak performance units reported by hardware manufacturers.
We compare the performance of our kernels with the PyTorch v2.0.0 Inductor compiler and TensorRT 8.5.3.
Results
Our kernel is 3.0 to 5.6 times faster than PyTorch Inductor and 2.1 to 4.0 times faster than TensorRT. The speedups versus TensorRT are the greatest among the MBConv blocks with SE layers.
The trend among PyTorch and TensorRT is that the performance gradually increases as the number of channels increases. This trend is what one would expect of a naive algorithm that computes each layer of the MBConv block as a separate kernel launch with a round trip to the global memory system, storing inner-layer activations in the global memory system. These kernels are memory-bound, and operational intensity increases with the number of input channels. The same observation has led some researchers to conclude that wider models are more efficient than narrow ones and that model latency varies directly with the number of activations.
Our Phantom kernel has no difficulty with a smaller number of channels. In fact, our kernels have more difficulty when the number of hidden-layer channel groups does not evenly divide the number of processor cores (SMs). This mismatch causes some processor cores to sit idle. We see the problem decrease if we increase the batch size to 512. Conversely, channel counts of 128 or 256 run efficiently with batch sizes as small as 32. Because our algorithm writes no hidden layer activations to the global memory system, our kernel is not memory bound. Our performance contradicts the belief that model latency varies directly with the number of activations.
Future Directions: NVIDIA Hopper
The next-generation NVIDIA Hopper GPU architecture extends the concept of a thread-block by adding a higher level of abstraction called the thread-block cluster [7]. A thread-block cluster is a group of thread-blocks all running on the same graphics processing cluster (GPC). Furthermore, Hopper adds distributed shared memory, allowing all threads in a cluster to access each other’s shared memory directly. This allows inter-block communication within a cluster that is much faster than accessing the global memory system.
Our memory-efficient MBConv algorithm maps quite naturally to the NVIDIA Hopper architecture. The multiple thread-blocks we use to compute an MBConv block would become a thread-block cluster. The bottleneck activations and squeeze workspace would use distributed shared memory instead of global memory. Thus the costly inter-block synchronization and global memory reductions we employ on the Ampere architecture become cluster synchronization and distributed shared memory reductions on Hopper.
Additionally, the speed of distributed shared memory would still yield a compute-bound algorithm when using smaller hidden-layer partitions, including spatial divisions in the height and width of the feature map. This partitioning would enable our algorithm to compute MBConv blocks with high-resolution feature maps.
The fact that the memory-efficient MBConv algorithm maps directly to the architecture of the Hopper thread-block cluster and distributed shared memory architecture is no coincidence. Instead, it suggests a trend in parallel algorithms and computer architecture, stressing the importance of small memory workspaces and local communication for efficient computation.
Conclusion
By applying memory-efficient algorithms to the computation of the MBConv block, we implemented NVIDIA GPU kernels that run up to 4 times faster than the state-of-the-art TensorRT software. The magnitude of the speedup has significant implications for the feasibility of high-accuracy neural network inference on resource-constrained edge devices. It also questions the assumptions made by model designers, as suboptimal algorithms used by popular inference engine software distort the network search landscape.
References
[1] Sandler, Mark, et al. "Mobilenetv2: Inverted residuals and linear bottlenecks." Proceedings of the IEEE conference on computer vision and pattern recognition. 2018.
[2] Goto, Kazushige, and Robert A. van de Geijn. "Anatomy of high-performance matrix multiplication." ACM Transactions on Mathematical Software (TOMS) 34.3 (2008): 1-25.
[3] Koonce, Brett, and Brett Koonce. "EfficientNet." Convolutional Neural Networks with Swift for Tensorflow: Image Recognition and Dataset Categorization (2021): 109-123.
[4] Tan, Mingxing, and Quoc Le. "Efficientnetv2: Smaller models and faster training." International conference on machine learning. PMLR, 2021.
[5] Williams, Samuel, Andrew Waterman, and David Patterson. "Roofline: an insightful visual performance model for multicore architectures." Communications of the ACM 52.4 (2009): 65-76.
[6] NVIDIA AMPERE GA102 GPU ARCHITECTURE https://www.nvidia.com/content/PDF/nvidia-ampere-ga-102-gpu-architecture-whitepaper-v2.pdf
[7] NVIDIA H100 Tensor Core GPU Architecture: EXCEPTIONAL PERFORMANCE, SCALABILITY, AND SECURITY FOR THE DATA CENTER https://resources.nvidia.com/en-us-tensor-core/gtc22-whitepaper-hopper
[8] Wightman, Ross, Hugo Touvron, and Hervé Jégou. "Resnet strikes back: An improved training procedure in timm." arXiv preprint arXiv:2110.00476 (2021).
[9] Hu, Jie, Li Shen, and Gang Sun. "Squeeze-and-excitation networks."?Proceedings of the IEEE conference on computer vision and pattern recognition. 2018.
Contributor to GPU, AI and HPC.
1 年I invented space2channel transform (patent pending from roche) for NVIDIA GPUS. It converts space dimension to channel dimension, to increate the # of channels for networks with low or no channel count.
Computer Vision @ Hugging Face, Technology Dreamer, Angel Investor
1 年Where does this optimized MBConv kernel sit in performance terms relative to your optimized (confusing naming) Fused-MBConv aka Edge with DW3x3->SE->PW1x1?