登录查看更多内容

Techniques to make deep learning efficient: Pruning and Leverage Sparse Tensor Cores of A100

Ashwani Patel

Senior- Cloud Consultant @E2E Networks- NVIDIA Partners | AI hyperscaler | Migrate to E2E Cloud and save upto 50% Best performance to price ratio and zero unpredictable billing

发布日期: 2023年4月10日

Deep learning has revolutionized the field of computer vision, natural language processing, generative ai and more. However this leads to models with higher number of parameters, latency and computational resources requirement. Neural network pruning can reduce the parameter counts of neural networks by more than 90% and hence decreasing the storage requirements and improving computation efficiency of neural networks.?

A data practitioner may face following challenges when trying to deploy a model for inference:

Running inference for a long time on scale could lead to higher cost due to consumption of more server side CPU, GPU, RAM etc.
Some deep learning models need to run on edge devices such as IoT and smart devices. These devices are limited in terms of resources and model optimization is a must in such case.

What is Efficient Inferencing?

Some pertinent question to ask before deploying a model are:

Is the model small?

Is it fast?

How many parameters model have?

What is RAM consumption during inference?

What is inference latency?

How to achieve efficient inferencing?

Compression Techniques: In these techniques layers of models are compressed. Two of these techniques are:

Pruning
Quantization

What is Pruning?

In simple words pruning is to make neural networks smaller by removing synapses and neurons.

Pruning in Human Brain

Pruning happens in the human brain. A newborn has nearly 2500 synapses per neuron which surges in the first few years of child growth but after nearly 4 years they start decreasing. It is quite intuitive to grasp as the brain optimizes the neural networks by removing some of the connections or synapses.

Source:Source

Given a neural network ?? (??,?? ), where ?? is the input and?? is the set of parameters (or weights), pruning is a technique for coming up with a minimal subset ?? ′ such that the rest of the parameters of ?? are pruned (or set to 0), while ensuring that the quality of the model remains above the desired threshold. After pruning, we can say the network has been made sparse, where the sparsity can be quantified as the ratio of the number of parameters that were pruned to the number of parameters in the original network (?? = (1 ? |?? ′ | |?? | )). The higher the sparsity, the lesser the number of non-zero parameters in the pruned networks. (Source:?https://arxiv.org/abs/2106.08962 ).

A typical workflow to construct a pruned network has following three steps:

Train a dense network until convergence
Prune the network to remove unwanted structure
Retrain the network?

Rany ElHousieny, PhD??? 8 个月前

Artificial Intelligence #33: Implications of the…

Ajit Jaokar 2 年前

Transformers without pain ??

Ibrahim Sobh - PhD 3 年前

Lottery Ticket Hypothesis:?The idea of sparse structure within a dense model is inspired from lottery ticket hypothesis which states that:

“A randomly-initialized, dense neural network contains a subnetwork that is initialized such that — when trained in isolation — it can match the test accuracy of the original network after training for at most the same number of iterations“

Learning Both Weights and Connections for Efficient Neural Network [Han et al., NeurIPS 2015]

Efficient Methods and Hardware for Deep Learning [Han S., Stanford University]

How to Prune?

What synapses and neurons should we prune?

When removing parameters from a neural network model, the less important the parameters being removed are, better performance of the pruned neural network.

If only some weights have to be removed, which one? Why?

Magnitude-based Pruning:?Magnitude-based pruning considers weights with larger absolute values are more important than other weights. For element-wise pruning, Importance = |W|.

Learning Both Weights and Connections for Efficient Neural Network [Han et al., NeurIPS 2015]

N:M sparsity in A100 via pruning

The NVIDIA A100 GPU adds support for fine-grained structured sparsity to its Tensor Cores.?Sparse Tensor Cores accelerate a 2:4 sparsity pattern. In each contiguous block of four values, two values must be zero. This naturally leads to a sparsity of 50%, which is fine-grained. There are no vector or block structures pruned together. Such a regular pattern is easy to compress and has a low metadata overhead.?

(Source:??Source )

Routine for training a pruned network following a N:M structured sparsity pattern is:

Start with a dense network
On the dense network, prune the weights to satisfy the 2:4 structured sparsity criteria. Out of every four elements, remove just two.
Repeat the original training procedure.

Turning half of a network's weights to zero can affect the accuracy of the network, as you might expect. Step 3 recovers that accuracy with enough weight update steps to let the weights converge and a high enough learning rate to let the weights move around sufficiently.

Performance in TensorRT 8.0

Source

Along with using techniques for efficient deep learning you can optimize cost by choosing appropriate cloud GPU platforms. E2E Cloud provides a range of GPUs for all kinds of deep learning and graphics workload at the most affordable price in the market. Try our platform and CloudGPUs with a free trial. To get your free credits contact:[email protected]

Techniques to make deep learning efficient: Pruning and Leverage Sparse Tensor Cores of A100

Ashwani Patel

Senior- Cloud Consultant @E2E Networks- NVIDIA Partners | AI hyperscaler | Migrate to E2E Cloud and save upto 50% Best performance to price ratio and zero unpredictable billing

领英推荐

Signup for free trial - bit.ly/3HAhxcJ

更多精彩文章

社区洞察

其他会员也浏览了

Transformers without pain ??

Machine Learning Model Analysis using TensorBoard

Machine Learning – Neural Networks and Artificial Intelligence – Is the situation seen in “The Matrix/Her/Minority Report” becoming a reality?

Noisy by Nature: How AI Learns to Shush the Static

Autoencoders

VARIATIONAL AUTOENCODERS (VAE)