The future of AI Compute Servers? The answer is Blowin In The Wind.
Eric Schnatterly
Global Vice President - helping clients and teams optimize multi-cloud, data protection, data management, and AI investments
Commodity compute server buyers, beware the winds of change!
Read about server trends today, and you will find that public cloud is on the rise, and that hyperscalers and manufacturers of "white box" commodity servers - like Super Micro, Huawei, and Inspur - seem to be gaining traction.
The uptick in white box commodity servers follows the trend towards "good enough". It has long been acknowledged that x86-based commodity servers lack the scale, performance, reliability and security of mainframes and RISC-based systems, but for most workloads, the belief was that they are "good enough".
Cue the winds of change
A new trend is upon us. Researchers are now pointing out that servers used by hyperscalers, and x86-based servers in general, may not have the compute architecture needed to support AI workloads. This is especially the case for Deep Neural Networks, image processing and natural language processing, which require high bandwidth interconnects between CPU and GPU memory as terabytes of information are used during training and inferencing of AI models.
Gartner’s ‘Market Guide for Compute Platforms 2018’ recommends specifically to IT leaders that on-premises IT infrastructures should be built with support for business-critical artificial intelligence and in-memory applications, by including servers that can exploit technologies such as accelerators and persistent memory. Another Gartner research document ‘Market Guide for Machine Learning infrastructure 2018’ states that due to factors including total cost of ownership (TCO), data gravity, ease of use and lack of data scientists across end users, a majority of organizations leverage on-premises ecosystems for building machine learning models.
Read my blog, ‘Looking under the hood of your AI infrastructure’ for more details. It talks about how x86 servers and the IBM Power AC922 were used for model training and evaluation, highlighting the significant difference in performance between them.
Jim McGregor, Principal Analyst at TIRIAS Research, in a Forbes article entitled "The Winds Are Changing In Servers -- AI Leads To Opportunity for IBM & Power," has stated "IBM appears the best positioned to benefit from the tremendous interest in AI." With various stumbles by Intel, there's "...a renewed opportunity for IBM with its Power architecture." Click here to read the article.
It states that IBM Power is the only processor architecture using NVIDIA’s NVLink interface directly in the processor itself which significantly improves performance. As Jim McGregor, Gartner, and other researchers have emphasized – while GPUs and FPGAs provide the raw compute power required for AI workloads; their effectiveness is severely limited unless there are high bandwidth interconnects between these powerful processors and memory.
It goes beyond just hardware
It is commonly understood and agreed, among those who study machine learning and infrastructure: that while popular deep learning frameworks, including TensorFlow, Caffe, Torch and Chainer can efficiently leverage multiple GPUs in a single system, scaling to multiple, clustered servers with GPUs is difficult, at best.
Let’s take an example. A powerful convolutional neural network, called ResNet 101, was trained on IBM Power Servers for image classification on a dataset of 7.5 million images. Not only was the IBM model more accurate, but it took just 7 hours vs. 10 days for x86 servers (published by Microsoft) for the same task. Much of the efficiency in training was due to software called Distributed Deep Learning (DDL), which was able to scale training across 256 GPUs with 95% efficiency!
The Forbes article mentions not just DDL, but Machine Learning libraries like Snap ML, which further optimize the training of neural networks. Some of these libraries were developed specifically for image and video recognition tasks, key components in the AI field of computer vision.
And speaking of software:
IBM PowerAI enterprise: IBM’s enterprise software distribution which combines popular open source deep learning frameworks, efficient AI development tools, and accelerated IBM Power Systems. The popular deep learning frameworks mentioned above are deployed swiftly with pre-built binaries, instead of data scientists having to spend time and effort downloading them and their dependencies (no easy task).
IBM PowerAI Vision: IBM’s computer vision technology which can be used to train object detection and image classification models, as well as perform video analytics, without deep learning or coding expertise. The software is so simple to use that we actually gave it to school children to learn how to train models!
A final note
Did you know that Google deployed the IBM POWER9 chip (the ‘Zaius’ platform) in its datacenter for production workloads? Some reasons given were more cores and threads for Google Search and more memory bandwidth for RNN machine learning execution. Click here to read about it. And of course, you would have heard about Summit, the world’s most powerful supercomputer, which is also based on IBM POWER9 AC922.
As businesses race to drive digital transformation and improved client experience with technologies like AI and Big Data & Analytics; commodity servers will not be sufficient. The Google story should be ample proof – they chose to use Power servers in their hyperscale datacentres rather than x86 based systems.
“Cheap and good’ is not good enough anymore. Beware commodity compute makers, the winds of change are here!
Please share your experiences with accelerated compute and AI models.
Be Happy,
Eric
Strategic Clients - Asia Pacific Markets, Broadcom Software | Connecting Everything
6 年Well written Eric, an interesting insight.
Director Asia Pacific - High Performance Computing, CSP and Artificial Intelligence at Lenovo
6 年Eric,? I think you have hit the nail on the head here.?? Thank you for the share and insight.?