Techniques to make deep learning efficient: Pruning and Leverage Sparse Tensor Cores of A100

Techniques to make deep learning efficient: Pruning and Leverage Sparse Tensor Cores of A100

Deep learning has revolutionized the field of computer vision, natural language processing, generative ai and more. However this leads to models with higher number of parameters, latency and computational resources requirement. Neural network pruning can reduce the parameter counts of neural networks by more than 90% and hence decreasing the storage requirements and improving computation efficiency of neural networks.?

A data practitioner may face following challenges when trying to deploy a model for inference:

  • Running inference for a long time on scale could lead to higher cost due to consumption of more server side CPU, GPU, RAM etc.
  • Some deep learning models need to run on edge devices such as IoT and smart devices. These devices are limited in terms of resources and model optimization is a must in such case.

What is Efficient Inferencing?

Some pertinent question to ask before deploying a model are:

Is the model small?

Is it fast?

How many parameters model have?

What is RAM consumption during inference?

What is inference latency?

How to achieve efficient inferencing?

Compression Techniques: In these techniques layers of models are compressed. Two of these techniques are:

  1. Pruning
  2. Quantization

What is Pruning?

In simple words pruning is to make neural networks smaller by removing synapses and neurons.

Pruning in Human Brain

Pruning happens in the human brain. A newborn has nearly 2500 synapses per neuron which surges in the first few years of child growth but after nearly 4 years they start decreasing. It is quite intuitive to grasp as the brain optimizes the neural networks by removing some of the connections or synapses.

No alt text provided for this image
No alt text provided for this image

Source:Source

Given a neural network ?? (??,?? ), where ?? is the input and?? is the set of parameters (or weights), pruning is a technique for coming up with a minimal subset ?? ′ such that the rest of the parameters of ?? are pruned (or set to 0), while ensuring that the quality of the model remains above the desired threshold. After pruning, we can say the network has been made sparse, where the sparsity can be quantified as the ratio of the number of parameters that were pruned to the number of parameters in the original network (?? = (1 ? |?? ′ | |?? | )). The higher the sparsity, the lesser the number of non-zero parameters in the pruned networks. (Source:?https://arxiv.org/abs/2106.08962 ).

A typical workflow to construct a pruned network has following three steps:

  1. Train a dense network until convergence
  2. Prune the network to remove unwanted structure
  3. Retrain the network?

Lottery Ticket Hypothesis:?The idea of sparse structure within a dense model is inspired from lottery ticket hypothesis which states that:

“A randomly-initialized, dense neural network contains a subnetwork that is initialized such that — when trained in isolation — it can match the test accuracy of the original network after training for at most the same number of iterations“

No alt text provided for this image

Learning Both Weights and Connections for Efficient Neural Network [Han et al., NeurIPS 2015]

No alt text provided for this image

Efficient Methods and Hardware for Deep Learning [Han S., Stanford University]

How to Prune?

What synapses and neurons should we prune?

When removing parameters from a neural network model, the less important the parameters being removed are, better performance of the pruned neural network.

If only some weights have to be removed, which one? Why?

Magnitude-based Pruning:?Magnitude-based pruning considers weights with larger absolute values are more important than other weights. For element-wise pruning, Importance = |W|.

No alt text provided for this image

Learning Both Weights and Connections for Efficient Neural Network [Han et al., NeurIPS 2015]

N:M sparsity in A100 via pruning

The NVIDIA A100 GPU adds support for fine-grained structured sparsity to its Tensor Cores.?Sparse Tensor Cores accelerate a 2:4 sparsity pattern. In each contiguous block of four values, two values must be zero. This naturally leads to a sparsity of 50%, which is fine-grained. There are no vector or block structures pruned together. Such a regular pattern is easy to compress and has a low metadata overhead.?

No alt text provided for this image

(Source:??Source )

Routine for training a pruned network following a N:M structured sparsity pattern is:

  1. Start with a dense network
  2. On the dense network, prune the weights to satisfy the 2:4 structured sparsity criteria. Out of every four elements, remove just two.
  3. Repeat the original training procedure.

Turning half of a network's weights to zero can affect the accuracy of the network, as you might expect. Step 3 recovers that accuracy with enough weight update steps to let the weights converge and a high enough learning rate to let the weights move around sufficiently.

Performance in TensorRT 8.0

No alt text provided for this image

Source

Along with using techniques for efficient deep learning you can optimize cost by choosing appropriate cloud GPU platforms. E2E Cloud provides a range of GPUs for all kinds of deep learning and graphics workload at the most affordable price in the market. Try our platform and CloudGPUs with a free trial. To get your free credits contact:[email protected]

Signup for free trial - bit.ly/3HAhxcJ

要查看或添加评论,请登录

社区洞察

其他会员也浏览了