Deep Learning Can’t Progress With IEEE-754 Floating Point. Here’s Why Google, Microsoft, And Intel Are Leaving It Behind
Theodore Omtzigt
Accelerating innovation: solving problems with high-performance compute
Today, terms such as Artificial Intelligence and Machine Learning get thrown around as casually as the ubiquitous umbrella term “the Internet.” AI technologies have been around for decades and are now integrated into all aspects of digital life in 2019. Artificial Neural Networks (ANN), for instance, have been with us for 60 years.
But while ANN are not new, today, there’s a renaissance taking place. The acceleration began roughly a decade ago, when researchers realized that it was not the quality of the model, but the size of the training set that made ANNs better. Companies such as Google and Microsoft, whose business models depend on categorizing vast amounts of information, were in a prime position to monetize AI, and fund the R&D required.
Here’s how it works:
Deep learning is a process that takes billions of labeled information examples to construct a classification system that can be more accurate than a human being.
The problem is that the data sets are enormous, and the convergence of the algorithm is very slow. It is not uncommon that state-of-the-art deep learning classification systems take several weeks to run on a large cluster of machines. And that’s clearly not scalable when the business lines need continuously improving services.
And the most expensive step in creating deep learning is the training phase.
Conceptually, this phase is a big optimization problem where the training process is learning a function that takes input samples, such as an image, and produces a category, such as a cat. The training process is continuously adjusting the weights of millions of “synapses” that control the contribution of the input to the final categorization, and that process requires floating point to capture enough dynamic range at a given precision to find good solutions. If the number system doesn’t have enough of either, the training process can fail to find any solutions, or find unacceptably bad solutions.
The de facto standard for floating point is IEEE-754. It’s available in all processors sold by Intel, AMD, IBM, and NVIDIA. But as the deep learning renaissance blossomed researches quickly realized that IEEE-754 would be a major constraint limiting the progress they could make. IEEE floating point was designed 30 years ago when processing was expensive, and memory access was cheap. The current technology stack is reversed: memory access is expensive, and processing is cheap.
And deep learning is memory bound.
For commercial companies competing in the marketplace of services enabled or augmented by AI, delivering these services at scale requires efficiency.
That’s simply not something IEEE-754 can provide. Google developed the first version of its Deep Learning accelerator in 2014, which delivered two orders of magnitude more performance than the NVIDIA processors that were used prior, simply by abandoning IEEE-754. Subsequent versions have incorporated a new floating-point format, called bfloat16, optimized for deep learning to further their lead.
Now, even Intel is abandoning IEEE-754 floating point for deep learning. Its Cooper Lake Xeon processor, for example, offers Google’s bfloat16 format for deep learning acceleration. Thus, it comes as no surprise that competitors in the AI race are all following suit and replacing IEEE-754 floating point with their own custom number systems. And researchers are demonstrating that other number systems, such as posits and Facebook’s DeepFloat, can even improve on Google’s bfloat16.
Cloud vendors are searching for more efficient solutions than IEEE-754 floating point.
Large-scale Cloud Native applications have put a spotlight on the inefficiencies of the IEEE floating point format. Amazon, Google and Microsoft have massive and incredibly successful cloud service businesses that rely on endless optimization in order to continuously improve their highly competitive services. In order to gain better performance over their competitors, these cloud vendors are abandoned IEEE floating point for something more efficient and effective per Watt, for key application categories, such as deep learning, media processing, security, big data, and analytic processing.
But these Cloud vendors are not the only industrials that are letting go of the old IEEE floating-point format in search for a competitive advantage.
Telecommunication giant Huawei has been using custom number systems in their base station silicon for more than a decade. Motivated by improving the performance per Watt of its processors, Huawei developed custom arithmetic to give it an edge in the market and make it the largest telecommunication business in the world.
The shift to 5G, with significantly higher compute requirements than previous generations, and new AI-based optimization techniques to improve cellular communications—such as Cognitive Radio—will accelerate the adoption of better, more efficient representations of real numbers.
Similarly, IoT, autonomous vehicle, and smart city applications that are sensor-rich and need advanced algorithms to deliver real-time collective intelligence on very limited embedded power budgets are sure to transition away from IEEE floating point as well. That’s because, when performance and power efficiency are differentiating attributes for a given application, the complexity of IEEE floating point simply can’t compete with number systems tailored to its specific needs. Unfortunately, Google’s bfloat16 does not work for computational science and engineering applications, and in general applications that combine AI with high-performance models and analytics need a more capable number system than bfloat16.
Now, it’s supercomputing to the rescue.
The supercomputing community has historically been more motivated by accuracy and absolute performance than efficiency—but they have run into efficiency limits as well.
This community, accustomed to building million core computers, encounters the weaknesses of a technology first. Many of the largest parallel programs are memory bound, and the only way to improve their performance is to increase computational efficiency. This group of computer designers has been exploring more efficient number systems for as long as they have been designing supercomputers.
One number system that has been proposed by the community to replace IEEE floating point, is called posits. Posits are a tapered floating-point format designed to provide a more robust computational arithmetic for the real numbers. Posits are, at this point, the only number system that is equally at home in deep learning as it is in computational science and business intelligence.
For example:
Researchers at Rochester Institute of Technology and Singapore National University have demonstrated that small posits generate better learning rates than Google’s bfloat16. Meanwhile, at Lawrence Livermore National Laboratories, researchers have been working to quantify the benefits of different number systems in Computational Fluid Dynamics application. They’ve found that IEEE floating point and related types do quite poorly in relation to posits and other tapered precision numerical types, most evident in tough computational problems, where 64-bit posits outperforms 64-bit IEEE double precision by nearly three orders of magnitude.
That’s why such organizations as the Atmospheric, Oceanic and Planetary Physics team at Oxford University and the European Centre for Medium-Range Weather Forecasting are applying posits to climate modeling and weather forecasting. And the piece-de-resistance, next-generation computational engineering approaches that combine AI and scientific computing to automatically design and optimize complex structures to have maximum strength or minimum weight, known in the literature as Iso-Geometric Analysis (IGA), have demonstrated that the mathematics don’t even work with IEEE floating point, but require new approaches, such as posits to take advantage of the new methodology.
Here at Stillwater, since we see tremendous value in empowering developers with a ready-to-use arithmetic library to incorporate this new number system into applications, we’ve created an open source library available on Github. We’re currently working with different research groups throughout the world to enable their applications with posits, including the University of Washington, TVM/VTA, where an end-to-end deep learning stack is being developed and recently promoted to an Apache Incubator Project. And at Delft University of Technology, an Iso-Geometric Analysis package called G+SMO has been augmented with posits to demonstrate the benefits of very high-order elements to improve computational efficiencies. The goal of G+SMO is to realize the seamless integration of Finite Element Analysis (FEA) and Computer-aided design (CAD) with open-source code from and for the isogeometric analysis community.
The shift towards cloud computing has opened a broad set of opportunities to innovate.
In the cloud, efficiency is a business differentiator. The cloud makes it possible to deliver AI services at scale, and Deep Learning has shown the benefits of leaving the old behind and venturing into new areas of innovation. Internet of Things and 5G services have the same economics as the cloud, and thus it is expected that these applications will follow suit and start new computational approaches for their applications.
Central to these innovations will be the number system, and posits are raking in the accolades.
Join our 6th of June Global B2B Conference | Up to 50 Exhibitors | 10 plus sponsor | 200+ Attendees
2 年Theodore, thanks for sharing!
Entrepreneurship, Robotics, Exponential Technologies BusDev/Consulting/Pinch Hitting in Negotiations
5 年Really interesting and presented in an easy to read and understand format. I did not realize the relationships of IEEE-754, posits and the reversal of computation/memory constraints that Machine Learning brings. Thank you.?
Compiler engineer
5 年Thanks for the article. Is there any evidence to support the claim that posit system provides more performance per watt than traditional floating point numbers (ieee or variations) ? For bfloat16 (or ibm competitor dlfloat16) the variation over ieee binary 16 is that the subnormal support is dropped, the number of exception numbers is reduced and the exponent field is bigger to allow a bigger dynamic range at a price of a reduced precision. All of this simplifies the underlying hardware and as it seems that machine learning applications doesn't need a high precision the values to work correctly, it's quite direct to understand why the performance per watt is increased. However posit numbers are more complicated to handle as you have a first to decode the value before performing arithmetic operation on it (variable length fields in the encoding is not hardware friendly). As far as I know, all the mentioned results use software simulation to run with posits, except for the Rochester University study which mentioned that even if they get better results, they have worst performance compared to using standard ieee float8 (if I recall correctly they didn't compare with bfloat8 like format).
Cloud Leader - Principal - Ranger - Change Agent - Solution Architect - AWS - Team Builder
5 年Great entry!
TECHNOLOGY / ENGINEERING / PROGRAM / PRODUCT / OPERATIONAL MANAGEMENT - High Technology and Manufacturing Industries
5 年Great article!