Upping the game – the imminent arrival of Deep Learning Exaops
#MachineLearning #AI #IamIntel #DeepLearning #Exaops #Exascale #HPC

Upping the game – the imminent arrival of Deep Learning Exaops

I expect to see the first deep learning (DL) systems capable of exaops (10^18) operations within the coming year, and more importantly broader availability of exaops systems by the year 2020. That capability will bring enormous capacity to deep learning systems, leading to solutions of problems from astronomy to micro-biology with scale that cannot be tackled today, as well as accelerating some solutions to enable real-time handling of highly complex tasks.

Currently, we measure most deep learning solutions in TOPs – teraops - which are 10^12 operations per second. The higher-end HPC systems that are being deployed today are performing in petaops (10^15). For example, in 2017 NERSC's Cray XC40 supercomputer named Cori achieved 15 petaflops while performing Climate Pattern Discovery. The expected ~50x gain in scale to be achieved in the coming few years will allow for full exaops DL solutions.

The primary force behind this is the drive of HPC systems to achieve exascale. This has been a long journey with periods of high expectations and others with setbacks related to power efficiency as well as parallelization and effective distribution of the tasks to allow effective use of 10s of thousands of CPUs concurrently. There is an intense global race to reach exascale with the US, China, Japan and EU all making significant investments in exascale. Projections in leading programs like the Exascale Computing Project indicate are that it might be achieved in HPC systems as soon as 2021, and no later than 2023.

It is safe to say that when HPC hits exascale, DL running on HPC will be capable of DL exaops. The measurement used for exascale is that of LINPACK operations. Those are 32-bit floating point operations that are highly concurrent but still not as computationally dense as DL Ops.

Moreover, as DL software continues to improve in utilizing the underlying hardware, it is very likely that DL exaops will be achievable on systems that are still at 100s of petaflops, before they hit the full exascale on more traditional HPC workloads. The reason for that is two-fold. The tensor operations used in DL are denser than LINPACK operations, and such increased level of data-level parallelism can be translated directly to higher compute density. DL has a massive amount of arithmetic that can be done in parallel and back-to-back without any sequential code in between. The other factor is that the operations which are measured in exaops (multiply and add/accumulate, each adding to the count) are very likely at lower precision in DL. In particular, deep learning Inference tasks are effectively mapped to 8-bit integer operations. The power requirements of 8-bit integer multiplication, for example, are an order of magnitude lower than the one for 32-bit operations – think of the ratio of 8x8 to roughly 32x32.

Software is a crucial differentiator. Most top scale DL solutions are today, and will continue to be, based on CPU-centric HPC systems – in particular Intel Xeon processors. The SW running DL on large HPC systems has been substantially improved over the last couple of years, and there is still high potential for improving the utilization of large count CPU systems for Deep Learning. The execution of DL tasks on a single CPU has been shown to improve more than 100x in 2016/17. Also, it was shown that the DL execution of a single node scales very nicely across multiple nodes due to the inherent concurrency of the DL tasks, and the scalability of NUMA-based Intel Xeon systems. If you add to that the hardware architecture enhancements that are done in CPU generations for adding vector and matrix native operations, you can expect a significant progress in DL scale and efficiency on Intel Xeon processor-based HPC systems.

Other avenue for reaching exaops will be through more targeted accelerators such as the upcoming Intel NervanaTM Neural Network Processor (NNP) DL accelerator, GPUs, FPGAs and dedicated solutions like TPU. Targeted solutions have strong results per individual ASIC or as smaller clusters. However, when looking at an infrastructure of many thousands of compute cores managed effectively across a very large scale memory and throughput, most such deployed systems will be CPU-based, due to their high flexibility across applications, large developer pool and prevalence of deployment in HPC centers.

I would also suggest that DL is not just about doing the same computations with higher efficiency; in many tasks, it creates an abstracted approximation that is substantially more efficient than the explicit mathematical-based approach. It does many of the same tasks as those implemented with formula-heavy computations, but achieves them using DL mapping with comparable or even superior results.

It is rather straight forward to demonstrate this with tasks that the human brain does very well. Such tasks were proven to be very difficult to resolve with traditional compute prior to DL. Visual recognition, speech recognition and Natural Language Processing are immediate examples of tasks that were ineffective and inefficient before the introduction of machine learning in general and DL in particular. The difference in approaches between mathematical modelling and neural networks can also be exemplified by the way a human hand catches a ball thrown in an arched orbit. A 10 year old child will catch the ball by using a tight brain-muscle visual and travel approximation in her neural networks without solving a single differential equation. A robotic arm today will be guided by complex mathematical computations to achieve similar, or more likely inferior, score in catching balls.

There are some initial proof points that this ‘abstracted approximation’ performed by DL can be surprisingly effective in replacing or augmenting very detailed mathematical models. A case in point is the work done at University of Florida in collaboration with researchers from University of North Carolina at Chapel Hill. They created a DL model to approximate the results of very accurate, and compute intensive Kohn-Sham density-functional theory (DFT) equations to predict the behavior of organic molecules. DFT was used on a database of 20 million conformations, with a DL system learning its results on a training set. The resulting DL model was shown to be chemically accurate compared to reference DFT calculations on much larger molecular systems with extremely low root mean square errors. This shift from the accurate DFT solver to highly correlated results of the DL solution resulted in exceptional speedup of more than 10^5, with power consumption improvement of ~6x 10^-4 .

Laser Interferometer Gravitational-Wave Observatory (LIGO) labs, which are continuously searching the skies for signs of gravitational waves, are another good example. They deploy several broad sensors, and it takes multiple days to filter out the noise and identify a likely direction where the wave is coming from. A timely identification would allow them to better point a highly-focused radio telescope to the right direction for detection of gravitational waves from binary black hole mergers. Eliu Huerta’s team at NCSA developed a Deep Neural Network model to enable what is called Real-time Multimessenger detection, achieving similar results with several orders of magnitude speedup compared with the detailed numeric simulations, allowing real-time processing of raw big data with minimal resources.

Those initial results of gaining orders of magnitude in efficiency by DL model approximation are an indication of a major shift in deploying DL solutions alongside more traditional HPC approaches.

To summarize, I expect deep learning exaops to ride the HPC journey to exascale, and even achieve that milestone somewhat ahead LINPACK exascale due to its dense parallelism and lower required precision for inference. It is also expected that DL exaops scale will be achieved in some instances with targeted acceleration, but mostly by large scale CPU systems utilizing flexible scaled infrastructure. DL exaops will make marked impact on the scale of problems being addressed, as well as on latency of results, contributing significantly to the advancement of science and discovery, and solving the next big human problems that take a lot of data and calculations. It will do it in no small part by introducing an ‘abstracted approximation’ that gets comparable outcomes with orders of magnitude less computations. This substantial change is expected to be materialized with the multiple DL exaops systems around the world by 2020.

Thanks for your interest,

Gadi

Nice article. But also keep in mind that we have much to learn from the creator’s design to obtain faster learning from fewer samples, like the recent MothNet paper from UW...

要查看或添加评论,请登录

Gadi Singer的更多文章

社区洞察