登录查看更多内容

Looking under the hood of your AI infrastructure

Eric Schnatterly

Global Vice President - helping clients and teams optimize multi-cloud, data protection, data management, and AI investments

发布日期: 2018年8月1日

At a recent gathering of Machine Learning (ML) experts, which I had the temerity to attend, many spoke enthusiastically about the latest models with which they were experimenting. Popular discussion topics included advances in object detection and image recognition, as well as the ImageNet challenge – complete with boastful predictions of winning the contest, as if they were a professional boxer before a championship fight.

Yes, data science is like a sport to some. But not all the banter was about exciting new innovations or competitive challenges. The ML developers had plenty about which to gripe. Most of the sniping revolved around hassles and time wasted with data preparation and model training.

Eavesdropping on the chatter, I over heard statements like:

“Took me 4 minutes to upload just one image on that cloud–based computer vision solution….4 minutes for ONE image!”

“Wish they would invent some fast stuff – takes me days and even weeks sometimes to get a decent model going.”

“Create the model, train it for days, then find out I need to tweak it…. train it for days again!”

I suspect that this illustrious gathering of data scientists and ML engineers were quite familiar with software configuration and coding tricks to speed up training and inferencing -– like tuning their hyper parameters, using different optimization algorithms to make learning converge faster, and all the other nifty tools that AI experts know best.

But what about looking beyond the model?

It is no secret that the rise of deep learning in the last few years is due largely to the enhances in computing power that has been made available - like GPUs and FPGAs. Vendors are obliging, delivering cutting-edge on-premises and as-a-service solutions running on powerful hardware. But consider this: IDC states in their paper ‘Hitting the wall with server infrastructure,’ that an astounding 90 percent of organizations which took part in their survey experienced bottlenecks in their infrastructure when they started running their AI applications on the cloud. The numbers for on-premises are slightly better but still far from reassuring: 77 percent of respondents faced limitations with their on-premises infrastructure for AI workloads!

The most common action taken, according to IDC, is moving to a system with greater processing power and accelerators. But is this enough?

Have you thought about what’s going on under the hood of these powerful servers?

Let’s take a look

You may have raw, blazing speed, with Ferrari-lke GPUs in your server (or cloud provider’s infrastructure), but still suffer bottlenecks. Why? Because having compute power is not enough. You will need your CPU memory, together with the GPU memory to play its part in delivering the needed performance. And, you will need the fastest possible interconnects between CPUs and GPUs, to prevent eventual bottlenecks.

To better understand the interplay between CPUs, GPUs, memory, interconnects, and clustered servers, I recommend that you have a look at this blog about a 3D image segmentation study of tumours with a 3DUnet model analyzing MRI images [don’t worry, the article is well composed and easy to follow, even for an AI newbie]. The performance analysis, explained in the article, makes use of servers equipped with accelerated computing features. Specifically, x86 servers and the IBM Power AC922 are used for model training and evaluation. The AC922 is equipped with NVDIA Tesla V100 GPUs and Power9 CPUs, with NVLink 2.0 interconnects between them. This allows for a 150 GB/s bandwidth in each direction between CPU and GPU. For comparison, the model was also trained on an x86 server with Intel CPUs and NVIDIA Tesla V100 GPUs, with the fastest available x86 PCIe interconnects between CPUs and GPUs.

In short:

· Faster training: Using TensorFlow Large Model Support (TFLMS, a TensorFlow module which essentially enables better training of neural networks by swapping data between CPUs and GPUs), the training times were 2.5x longer on one epoch for the x86 servers vs. the Power9 servers. The memory copies between CPU and GPU for tensor swapping (transferring large amounts of numerical data between CPU and GPU as needed) took considerably longer in the x86 machines, and led to GPUs becoming idle.

· Scaling beyond a single GPU: When two GPUs that shared the same PCI bus were used with TFLMS they saw a reduction of 30% in memory copy throughput due to the contention (of using the same bus). The AC922 does not have this contention issue since the NVLink 2.0 connections between the CPU and GPU are dedicated per GPU, and GPUs do not have to compete for the available bandwidth.

· Using the IBM PowerAI Distributed Deep Learning library to scale across multiple servers: Last year, IBM already demonstrated greater efficiency and accuracy by using distributed deep learning for image recognition. Using this same technology in this study, training times reduced considerably (590 seconds per epoch on a single server and GPU to 40 seconds per epoch on 16 GPUs across 4 servers).

2.5x longer training times may not seem a lot to a layman, but would be a really sore point with a machine learning engineer. The training data used were high resolution medical images, which are notoriously difficult to transfer in bulk to your compute engine, where your neural networks can do their analysis. And what would be the point of investing in expensive servers with multiple GPUs if they are not used efficiently and the GPUs lie idle?

The figure below captures the all-important differences in the interconnects between CPUs and GPUs and the corresponding changes in bandwidths.

Connecting what’s under the hood to real world results

Talking about federated and distributed learning: click here to have a look at the video which shows how a model trained using IBM’s PowerAI Vision was embedded in cars to recognise objects in poor conditions of visibility and adjust the vehicle’s headlights accordingly. This is putting the power of federated learning in the hands of the end user: analogous to how your mobile voice assistant is trained by Apple, Google or Samsung in their own compute environments, but is still able to learn on your phone by interacting with you, with far less compute resources.

Now think about the potential use cases of IBM PowerAI vision, which is IBM’s computer vision software, running on the AC922. Imagine scenarios like:

Determining which trees in public spaces are unhealthy and in danger of falling using AI vision (an actual use case being worked on by environment ministries in some countries)
Visual inspection for quality in manufacturing. Click here to access a demo of how PowerAI vision was used in our own IBM manufacturing plants to reduce inspection time from 15 mins to 1 minute per product.
In retail, identifying which products are in short supply on shelves and managing inventory accordingly.
At airports, security cameras being able to identify suspicious objects, such as concealed weapons, without human intervention.
Even in entertainment: IBM China Research labs worked with a leading entertainment broadcaster (HunanTV) to create a 1 minute clip of highlights extracted from a month’s worth of their reality show called I AM THE FUTURE. Click here to read a translated article.

The next time I attend a machine learning meetup and engineers complain about training and inference times, I will ask them to take a look under the hood of what is powering their algorithms, or talk to their infrastructure guys. I hope you will do the same!

Click here to explore the IBM AC922: the best server for enterprise AI.

Click here to explore IBM PowerAI Vision.

I wish you all the best in your AI journey.

You can connect with me on LinkedIn here and Twitter too. Reach out anytime.

Ian Nash

6 年

I think it's reasonable to start with public-cloud based AI offerings to experiment, but once you have real-world business problems that need AI to help resolve, then often the best solution is to locate the AI infrastructure where your data resides. The POWER AC922 with PowerAI makes great sense for building your on-premise AI capability.

1 次回应

Eloise Tan

6 年

This is great!!!!!

Gilbert Thomas

6 年

Great article . BTW the link that is supposed to show how AI Vision helps drivers see in low visibility links to the manufacturing one instead.

彭子宸 Anne Phey

6 年

Traditionally, folks start with a small systems and then upgrade as the project expands. What folks don't realise is that with AI, the infrastructure needs to be right from the start. ?If one has to tweak & re-tweak and then upgrade, it means there is insufficient data being analysed, which totally loses the capability of AI's ability. The beauty of AI is that it takes in ALL the data and learns. If the infrastructure does not provide that capacity and speed, then the game is already lost from the start. Leaders know that this is the age where size and speed does matter!

2 次回应

Sinisa (Sin) Nikolic

Director Asia Pacific - High Performance Computing, CSP and Artificial Intelligence at Lenovo

6 年

Eric, excellent article. Speed to insight is a goal for all those entering the ML domain. But a goal needs to be attainable, with many vendors stating they can , but fail in the architectural vision ie the balance of memory to cpu, gpu to cpu and gpu to gpu communication along with the ability to distribute the problem across multiple nodes. Power9 AC922 and the associated Power AI stack is really the only platform solution that can achieve the ultimate AI goal.

2 次回应

查看更多评论

要查看或添加评论，请登录

Eric Schnatterly的更多文章

Just-In-Time or Just-In-Case

2020年5月29日

Just-In-Time or Just-In-Case

The coronavirus pandemic has tested the resolve of every nation and has exposed the frailties of our healthcare…

8 条评论
Quantum "Cubism"

2020年5月15日

Quantum "Cubism"

Quantum, like art, blurs the lines Just like an impressionistic painting, from Vincent van Gogh or Henri de…

3 条评论
When you're right to be wrong

2018年11月21日

When you're right to be wrong

I think we can all agree, Amazon founder Jeff Bezos is no dummy. You don't build such companies and amass such wealth…

11 条评论
Breaking News: Data Storage Matters

2018年11月8日

Breaking News: Data Storage Matters

Storage should be Big News! Say what? In the November 7, 2018 publication of Forbes, an article appeared whose title…

1 条评论
Killing the Mainframe

2018年11月8日

Killing the Mainframe

Fat chance. Over the past 7 decades, and with each new technology trend, prognosticators have predicted the demise of…

56 条评论
The future of AI Compute Servers? The answer is Blowin In The Wind.

2018年10月17日

The future of AI Compute Servers? The answer is Blowin In The Wind.

Commodity compute server buyers, beware the winds of change! Read about server trends today, and you will find that…

3 条评论
Grandma knows best - Be Kind

2018年8月15日

Grandma knows best - Be Kind

I learned "The Golden Rule" from my Grandmother, who embodied the principles of this maxim. In simple terms, The Golden…

10 条评论
4 out of 5 Fortune 100 Companies do this...

2018年8月8日

4 out of 5 Fortune 100 Companies do this...

Companies large and small, from every industry, and from around the globe, depend on IBM Power Systems to "get the job…

2 条评论
The Hazards of Early Success

2018年6月26日

The Hazards of Early Success

Recently, I had the opportunity to hear astronaut – Colonel Chris Hadfield – speak about his space expeditions and how…

2 条评论
Data is the new currency (and dial-tone)

2018年6月16日

Data is the new currency (and dial-tone)

This notion that data has become fungible currency is not a new idea. A decade ago, Clive Humby declared that “data was…

3 条评论

See all articles

Looking under the hood of your AI infrastructure

Eric Schnatterly

Global Vice President - helping clients and teams optimize multi-cloud, data protection, data management, and AI investments

Eric Schnatterly的更多文章

社区洞察

其他会员也浏览了

My 2025 AI Predictions

AI Inference: The DeepSeek Disruption and the Future of Compute

Charting the AI Superpower Path: Strategies for National Advancement

Building Robust AI Infrastructure for Modern Applications

The Edge Manifesto: Everywhere Computing

Azure AI-900 - Notes!

The Gravity Model of AI Infrastructure: Does proximity matter?

Bringing Your AI Computer Vision To Life With Tuba.ai.

Launching AI and Machine Learning as a Service @MarkiTech

HOMOSAPIENT TO AUTOSAPIENT AGE

Eric Schnatterly的更多文章

Just-In-Time or Just-In-Case

Quantum "Cubism"

When you're right to be wrong

Breaking News: Data Storage Matters

Killing the Mainframe

The future of AI Compute Servers? The answer is Blowin In The Wind.

Grandma knows best - Be Kind

4 out of 5 Fortune 100 Companies do this...

The Hazards of Early Success

Data is the new currency (and dial-tone)

社区洞察

其他会员也浏览了

My 2025 AI Predictions

AI Inference: The DeepSeek Disruption and the Future of Compute

Charting the AI Superpower Path: Strategies for National Advancement

Building Robust AI Infrastructure for Modern Applications

The Edge Manifesto: Everywhere Computing

Azure AI-900 - Notes!

The Gravity Model of AI Infrastructure: Does proximity matter?

Bringing Your AI Computer Vision To Life With Tuba.ai.

Launching AI and Machine Learning as a Service @MarkiTech

HOMOSAPIENT TO AUTOSAPIENT AGE