The Deep Learning Hardware Battle
Originally published on Tractica Blogs
There is an ongoing race among semiconductor companies, including the established market heavyweights and startups alike, to define the hardware platform that will run compute-intensive deep learning algorithms quickly and efficiently. Until now, NVIDIA has dominated the deep learning market with its graphics processor unit (GPU) chips, which bring massive parallelization, however field programmable gate arrays (FPGAs) and digital signal processors (DSPs) are starting to catch up. Deep learning is largely characterized by deep neural networks (DNNs) and convolutional neural networks (CNNs), which can become massively complex. Google’s cat recognition neural network back had 1 billion connections using 16,000 processors. GPUs are known to achieve the best speed and throughput, around 100x faster compared to an FPGA, while FPGAs are known to have better power efficiency, around 50x better compared to a GPU. This illustrates the tradeoff we see between GPUs and FPGAs when running high-intensity deep learning algorithms. Microsoft has already extended its use of FPGAs for deep learning algorithms, where it makes up for GPU performance gap using scale, bundling multiple FPGAs together. A more detailed look at FPGAs versus GPUs for deep learning was covered in an earlier Tractica blog post. Since then, we have seen FPGAs gain more traction, with startups like DeePhi banking on the fact that deep learning requires changeable workloads depending on the type of neural network. Another startup called Knupath is building a custom DSP chip for deep learning and machine learning applications, with plans to integrate FPGAs on its roadmap. Knupath is targeting a specific problem area in high-performance computing, called sparse matrix based problems.
The majority of the hardware focus for deep learning has been on cloud server computing, both for training and execution of deep learning models. Intel has been trying to build up its deep learning capabilities with its recent Nervana acquisition and its Xeon Phi Knights Landing announcement, in an effort to compete with NVIDIA. Google has also put its foot into the market with its Tensor Processing Unit (TPU) hardware platform, which was used for DeepMind’s AlphaGo triumph. In addition, a UK-based startup called Graphcore has recently come out of stealth mode and is focused on building a neural network accelerator called the Intelligent Processor Unit (IPU).
At the same time, we are also seeing the algorithm space evolving rapidly, with algorithms moving to using more temporal (or memory intensive) architectures, such as recurrent neural networks (RNNs). Deep learning algorithms that focus more on context, like the placement of an object in an image, or word in a sentence, are expected to see increasing usage to improve their effectiveness. We have also seen a lot of excitement around generative adversarial networks (GANs) and one-shot learning, all of could require different hardware processing architectures, or a combination of architectures, compared to the current dominant deep learning method, CNNs. Therefore, the hardware market for deep learning might end up with multiple “bespoke” architectures rather than “one size fits all.”
In conjunction with the rapidly evolving algorithmic space, another shift that is beginning to happen is with deep learning execution (while training still remains in the cloud) starting to move to the device, where power consumption comes at a premium. Apple’s recent announcement of neural network libraries available for iOS, and Samsung’s M1 architecture revealing a neural network predictor illustrate the growing shift toward neural network processors on devices. The on-device approach has major applications in drones, robots, autonomous vehicles, and smartphones. The biggest drivers for neural networks on devices are image recognition and natural language processing (NLP) to be done on the fly. Qualcomm has shown its Zeroth platform being able to perform on-device deep learning using existing Snapdragon processors. Qualcomm’s focus right now is to extract optimizations in existing hardware and software to perform on-device deep learning. However, in the long run, Qualcomm and other device makers are likely to move toward a neural network processor (NNU) architecture which will be dedicated to running and accelerating on-device deep learning, in addition to existing central processing units (CPUs), GPUs, and DSPs. The evolving nature of deep learning algorithms and workloads could also have a role to play in how the processing is distributed between the cloud and the device, and ultimately which architecture is best suited.
While hardware vendors are keen to get ahead of the game in terms of speeding up and powering down server-based deep learning, the ultimate prize is in defining and establishing itself in the high-volume on-device deep learning processor market.