Upping the game – the imminent arrival of Deep Learning Exaops

Gadi Singer

VP & Director, Emergent AI Research, Intel Labs

发布日期: 2018年3月26日

I expect to see the first deep learning (DL) systems capable of exaops (10^18) operations within the coming year, and more importantly broader availability of exaops systems by the year 2020. That capability will bring enormous capacity to deep learning systems, leading to solutions of problems from astronomy to micro-biology with scale that cannot be tackled today, as well as accelerating some solutions to enable real-time handling of highly complex tasks.

Currently, we measure most deep learning solutions in TOPs – teraops - which are 10^12 operations per second. The higher-end HPC systems that are being deployed today are performing in petaops (10^15). For example, in 2017 NERSC's Cray XC40 supercomputer named Cori achieved 15 petaflops while performing Climate Pattern Discovery. The expected ~50x gain in scale to be achieved in the coming few years will allow for full exaops DL solutions.

The primary force behind this is the drive of HPC systems to achieve exascale. This has been a long journey with periods of high expectations and others with setbacks related to power efficiency as well as parallelization and effective distribution of the tasks to allow effective use of 10s of thousands of CPUs concurrently. There is an intense global race to reach exascale with the US, China, Japan and EU all making significant investments in exascale. Projections in leading programs like the Exascale Computing Project indicate are that it might be achieved in HPC systems as soon as 2021, and no later than 2023.

It is safe to say that when HPC hits exascale, DL running on HPC will be capable of DL exaops. The measurement used for exascale is that of LINPACK operations. Those are 32-bit floating point operations that are highly concurrent but still not as computationally dense as DL Ops.

Moreover, as DL software continues to improve in utilizing the underlying hardware, it is very likely that DL exaops will be achievable on systems that are still at 100s of petaflops, before they hit the full exascale on more traditional HPC workloads. The reason for that is two-fold. The tensor operations used in DL are denser than LINPACK operations, and such increased level of data-level parallelism can be translated directly to higher compute density. DL has a massive amount of arithmetic that can be done in parallel and back-to-back without any sequential code in between. The other factor is that the operations which are measured in exaops (multiply and add/accumulate, each adding to the count) are very likely at lower precision in DL. In particular, deep learning Inference tasks are effectively mapped to 8-bit integer operations. The power requirements of 8-bit integer multiplication, for example, are an order of magnitude lower than the one for 32-bit operations – think of the ratio of 8x8 to roughly 32x32.

Software is a crucial differentiator. Most top scale DL solutions are today, and will continue to be, based on CPU-centric HPC systems – in particular Intel Xeon processors. The SW running DL on large HPC systems has been substantially improved over the last couple of years, and there is still high potential for improving the utilization of large count CPU systems for Deep Learning. The execution of DL tasks on a single CPU has been shown to improve more than 100x in 2016/17. Also, it was shown that the DL execution of a single node scales very nicely across multiple nodes due to the inherent concurrency of the DL tasks, and the scalability of NUMA-based Intel Xeon systems. If you add to that the hardware architecture enhancements that are done in CPU generations for adding vector and matrix native operations, you can expect a significant progress in DL scale and efficiency on Intel Xeon processor-based HPC systems.

Other avenue for reaching exaops will be through more targeted accelerators such as the upcoming Intel NervanaTM Neural Network Processor (NNP) DL accelerator, GPUs, FPGAs and dedicated solutions like TPU. Targeted solutions have strong results per individual ASIC or as smaller clusters. However, when looking at an infrastructure of many thousands of compute cores managed effectively across a very large scale memory and throughput, most such deployed systems will be CPU-based, due to their high flexibility across applications, large developer pool and prevalence of deployment in HPC centers.

I would also suggest that DL is not just about doing the same computations with higher efficiency; in many tasks, it creates an abstracted approximation that is substantially more efficient than the explicit mathematical-based approach. It does many of the same tasks as those implemented with formula-heavy computations, but achieves them using DL mapping with comparable or even superior results.

It is rather straight forward to demonstrate this with tasks that the human brain does very well. Such tasks were proven to be very difficult to resolve with traditional compute prior to DL. Visual recognition, speech recognition and Natural Language Processing are immediate examples of tasks that were ineffective and inefficient before the introduction of machine learning in general and DL in particular. The difference in approaches between mathematical modelling and neural networks can also be exemplified by the way a human hand catches a ball thrown in an arched orbit. A 10 year old child will catch the ball by using a tight brain-muscle visual and travel approximation in her neural networks without solving a single differential equation. A robotic arm today will be guided by complex mathematical computations to achieve similar, or more likely inferior, score in catching balls.

There are some initial proof points that this ‘abstracted approximation’ performed by DL can be surprisingly effective in replacing or augmenting very detailed mathematical models. A case in point is the work done at University of Florida in collaboration with researchers from University of North Carolina at Chapel Hill. They created a DL model to approximate the results of very accurate, and compute intensive Kohn-Sham density-functional theory (DFT) equations to predict the behavior of organic molecules. DFT was used on a database of 20 million conformations, with a DL system learning its results on a training set. The resulting DL model was shown to be chemically accurate compared to reference DFT calculations on much larger molecular systems with extremely low root mean square errors. This shift from the accurate DFT solver to highly correlated results of the DL solution resulted in exceptional speedup of more than 10^5, with power consumption improvement of ~6x 10^-4 .

Laser Interferometer Gravitational-Wave Observatory (LIGO) labs, which are continuously searching the skies for signs of gravitational waves, are another good example. They deploy several broad sensors, and it takes multiple days to filter out the noise and identify a likely direction where the wave is coming from. A timely identification would allow them to better point a highly-focused radio telescope to the right direction for detection of gravitational waves from binary black hole mergers. Eliu Huerta’s team at NCSA developed a Deep Neural Network model to enable what is called Real-time Multimessenger detection, achieving similar results with several orders of magnitude speedup compared with the detailed numeric simulations, allowing real-time processing of raw big data with minimal resources.

Those initial results of gaining orders of magnitude in efficiency by DL model approximation are an indication of a major shift in deploying DL solutions alongside more traditional HPC approaches.

To summarize, I expect deep learning exaops to ride the HPC journey to exascale, and even achieve that milestone somewhat ahead LINPACK exascale due to its dense parallelism and lower required precision for inference. It is also expected that DL exaops scale will be achieved in some instances with targeted acceleration, but mostly by large scale CPU systems utilizing flexible scaled infrastructure. DL exaops will make marked impact on the scale of problems being addressed, as well as on latency of results, contributing significantly to the advancement of science and discovery, and solving the next big human problems that take a lot of data and calculations. It will do it in no small part by introducing an ‘abstracted approximation’ that gets comparable outcomes with orders of magnitude less computations. This substantial change is expected to be materialized with the multiple DL exaops systems around the world by 2020.

Thanks for your interest,

Gadi

Abraham Si

6 年

Nice article. But also keep in mind that we have much to learn from the creator’s design to obtain faster learning from fewer samples, like the recent MothNet paper from UW...

1 次回应

查看更多评论

要查看或添加评论，请登录

Gadi Singer的更多文章

Seat of Knowledge: Information-Centric Classification in AI - Class 2

2021年3月23日

Seat of Knowledge: Information-Centric Classification in AI - Class 2

Class 2 - Semi-Structured Information Repository Second in a series on the choices for capturing information and using…

4 条评论
Seat of Knowledge: Information-Centric Classification in AI

2021年2月16日

Seat of Knowledge: Information-Centric Classification in AI

Class 1 – Fully Encapsulated Information First in a series on the choices for capturing information and using knowledge…

2 条评论
Deep Knowledge as the Key to Higher Machine Intelligence

2021年1月26日

Deep Knowledge as the Key to Higher Machine Intelligence

Third in a series on Cognitive Computing Research – The Age of Knowledge Emerges By Gadi Singer, Intel Labs The next…

15 条评论
Efficiency, Extensibility and Cognition: Charting the Frontiers

2020年8月31日

Efficiency, Extensibility and Cognition: Charting the Frontiers

Second in a series on Cognitive Computing Research: The Age of Knowledge Emerges By Gadi Singer, Intel Labs The rapid…

1 条评论
Next, Machines Get Wiser

2020年7月15日

Next, Machines Get Wiser

By Gadi Singer, Intel Labs Where is machine intelligence headed in the next five years? All signs indicate a…

2 条评论
Accelerating Healthcare’s AI Transformation - Assessing the Drivers

2019年1月10日

Accelerating Healthcare’s AI Transformation - Assessing the Drivers

AI will fundamentally change healthcare. It has the potential to greatly boost disease prevention and early detection…

1 条评论
AI Taking Action – Emergence of Decision Making and Generative Capabilities

2018年2月14日

AI Taking Action – Emergence of Decision Making and Generative Capabilities

My previous article explored the evolution of AI capabilities from a recognition-based competence to context-rich…

6 条评论
Toward truly intelligent AI: From ‘Recognition’ to ‘Understanding’

2018年1月31日

Toward truly intelligent AI: From ‘Recognition’ to ‘Understanding’

What was it that people saw in this photo of Usain Bolt nearing the finish line at the 2016 Summer Olympics that caused…

6 条评论
Key Transitions: Broadening adoption of Deep Learning in HPC

2018年1月24日

Key Transitions: Broadening adoption of Deep Learning in HPC

In my last post, I articulated the unique capabilities that AI bring to HPC (HPC-on-AI) and the great opportunities…

5 条评论
Boosting HPC Solutions with Deep Learning

2018年1月10日

Boosting HPC Solutions with Deep Learning

In my last post, I explored the opportunities created by running AI applications on HPC infrastructure (AI-on-HPC). Now…

3 条评论

See all articles

Gadi Singer的更多文章

Seat of Knowledge: Information-Centric Classification in AI - Class 2

Seat of Knowledge: Information-Centric Classification in AI

Deep Knowledge as the Key to Higher Machine Intelligence

Efficiency, Extensibility and Cognition: Charting the Frontiers

Next, Machines Get Wiser

Accelerating Healthcare’s AI Transformation - Assessing the Drivers

AI Taking Action – Emergence of Decision Making and Generative Capabilities

Toward truly intelligent AI: From ‘Recognition’ to ‘Understanding’

Key Transitions: Broadening adoption of Deep Learning in HPC

Boosting HPC Solutions with Deep Learning

社区洞察