登录查看更多内容

A very short extract of SLIDE: Sub LInear Deep-learning Engine

Yaser A.

Software Team Lead/Manager

发布日期: 2020年3月15日

Sharing this amazing piece of research (1). The principles look so simple (in hindsight of course), yet results are incredible - 3.5x faster performance on 44-core CPU compared to NVIDIA GPU. It doesn’t use TensorFlow or any other existing framework. This is basic research and innovation. Looks like this has the potential for making Deep Learning even more accessible. Here are some salient points:

Diminished returns in Matrix Multiplication algorithms: Existing Deep Learning/Neural Nets are based on matrix multiplication. Matrix operations have been optimized for the last several years, but have reached a saturation point and rely on hardware acceleration for further improvements.
Exploiting Adaptive Sparsity: The idea of dropouts in Deep Learning is a case of exploiting sparsity in the activated neurons. There has been work on showing the computational advantage of Locality Sensitive Hashing (LSH) to identify sparse neurons while training, but previous work didn’t show an implementation that could beat matrix multiplication on specialized hardware, when factoring in overheads of LSH. This paper presents an implementation that does.
Batch Gradient Descent: Input data is still divided into batches, each batch consisting of hundreds of samples.
Sequential Feedforward and Backpropagation for each input sample: In the feedforward phase, first active neurons are retrieved for each layer using hash functions on the input data. Then activations are calculated only for the active neurons (this is how sparsity is exploited). In the backpropagation phase, the gradient of the error is propagated back to the same active neurons at each layer using a classical message passing implementation, rather than vector multiplication.
Exploiting Parallelism: Parallelism is achieved by parallelizing feedforward and backpropagation on the input data. Each sample of input data in the batch runs in its own thread and updates weights in parallel across all layers. To ensure calculations don’t conflict with each other, two arrays are maintained for each neuron, one for the activation of each neuron for each data, and another for the error gradient for each neuron for each data.
Accumulation of weight updates is done randomly and asynchronously: Key insight utilized is that it doesn’t hurt convergence if the weight updates are accumulated randomly. So each weight is updated asynchronously by each thread. This avoids synchronization during batch accumulation. This design choice is said to be directly responsible for near perfect scaling with increasing cores. (Comparatively, TensorFlow shows poor scaling on CPUs beyond 16 cores.)

(1) See the full research paper here: (https://www.cs.rice.edu/~as143/Papers/SLIDE_MLSys.pdf)

Yaser A.的更多文章

Thoughts on Deep Learning

2020年3月15日

Thoughts on Deep Learning

I recently implemented Behavioral Cloning for Self-driving Car using Deep Learning on Keras and Tensorflow. This was…

A very short extract of SLIDE: Sub LInear Deep-learning Engine

Yaser A.

Software Team Lead/Manager

Yaser A.的更多文章

社区洞察

其他会员也浏览了

Face Recognition with VGG16 Transfer Learning

Optimising GPU Utilisation: Finding the Ideal Batch Size for Maximum Efficiency

Power of GPU Acceleration in Deep Learning: Elevating Model Training Performance ????

Vectorization & Broadcasting: Single Instruction Multiple Data (SIMD)

Interview with Victor Jakubiuk , Chief Science Officer, OnSpecta Inc - Speaker at 5th Annual Global Big Data Conference Santa Clata Aug 29-31, 2017

Scalable Deep Learning: an Industry & Birds-of-a-Feather perspective at the International Super Computing 2019 (Part 3)

Breaking O(n2): Efficient Attention with FlashAttention-2 in CUDA

Intel Bets Big On Kubernetes For Nauta Deep Learning Platform

AI Drives Startup to Map Deep Learning Computer

The Hardware required for Deep Learning network:

Yaser A.的更多文章

Thoughts on Deep Learning

社区洞察

其他会员也浏览了

Face Recognition with VGG16 Transfer Learning

Optimising GPU Utilisation: Finding the Ideal Batch Size for Maximum Efficiency

Power of GPU Acceleration in Deep Learning: Elevating Model Training Performance ????

Vectorization & Broadcasting: Single Instruction Multiple Data (SIMD)

Interview with Victor Jakubiuk , Chief Science Officer, OnSpecta Inc - Speaker at 5th Annual Global Big Data Conference Santa Clata Aug 29-31, 2017

Scalable Deep Learning: an Industry & Birds-of-a-Feather perspective at the International Super Computing 2019 (Part 3)

Breaking O(n2): Efficient Attention with FlashAttention-2 in CUDA

Intel Bets Big On Kubernetes For Nauta Deep Learning Platform

AI Drives Startup to Map Deep Learning Computer

The Hardware required for Deep Learning network: