A very short extract of SLIDE: Sub LInear Deep-learning Engine
Sharing this amazing piece of research (1). The principles look so simple (in hindsight of course), yet results are incredible - 3.5x faster performance on 44-core CPU compared to NVIDIA GPU. It doesn’t use TensorFlow or any other existing framework. This is basic research and innovation. Looks like this has the potential for making Deep Learning even more accessible. Here are some salient points:
- Diminished returns in Matrix Multiplication algorithms: Existing Deep Learning/Neural Nets are based on matrix multiplication. Matrix operations have been optimized for the last several years, but have reached a saturation point and rely on hardware acceleration for further improvements.
- Exploiting Adaptive Sparsity: The idea of dropouts in Deep Learning is a case of exploiting sparsity in the activated neurons. There has been work on showing the computational advantage of Locality Sensitive Hashing (LSH) to identify sparse neurons while training, but previous work didn’t show an implementation that could beat matrix multiplication on specialized hardware, when factoring in overheads of LSH. This paper presents an implementation that does.
- Batch Gradient Descent: Input data is still divided into batches, each batch consisting of hundreds of samples.
- Sequential Feedforward and Backpropagation for each input sample: In the feedforward phase, first active neurons are retrieved for each layer using hash functions on the input data. Then activations are calculated only for the active neurons (this is how sparsity is exploited). In the backpropagation phase, the gradient of the error is propagated back to the same active neurons at each layer using a classical message passing implementation, rather than vector multiplication.
- Exploiting Parallelism: Parallelism is achieved by parallelizing feedforward and backpropagation on the input data. Each sample of input data in the batch runs in its own thread and updates weights in parallel across all layers. To ensure calculations don’t conflict with each other, two arrays are maintained for each neuron, one for the activation of each neuron for each data, and another for the error gradient for each neuron for each data.
- Accumulation of weight updates is done randomly and asynchronously: Key insight utilized is that it doesn’t hurt convergence if the weight updates are accumulated randomly. So each weight is updated asynchronously by each thread. This avoids synchronization during batch accumulation. This design choice is said to be directly responsible for near perfect scaling with increasing cores. (Comparatively, TensorFlow shows poor scaling on CPUs beyond 16 cores.)
(1) See the full research paper here: (https://www.cs.rice.edu/~as143/Papers/SLIDE_MLSys.pdf)