Looking under the hood of your AI infrastructure
Eric Schnatterly
Global Vice President - helping clients and teams optimize multi-cloud, data protection, data management, and AI investments
At a recent gathering of Machine Learning (ML) experts, which I had the temerity to attend, many spoke enthusiastically about the latest models with which they were experimenting. Popular discussion topics included advances in object detection and image recognition, as well as the ImageNet challenge – complete with boastful predictions of winning the contest, as if they were a professional boxer before a championship fight.
Yes, data science is like a sport to some. But not all the banter was about exciting new innovations or competitive challenges. The ML developers had plenty about which to gripe. Most of the sniping revolved around hassles and time wasted with data preparation and model training.
Eavesdropping on the chatter, I over heard statements like:
“Took me 4 minutes to upload just one image on that cloud–based computer vision solution….4 minutes for ONE image!”
“Wish they would invent some fast stuff – takes me days and even weeks sometimes to get a decent model going.”
“Create the model, train it for days, then find out I need to tweak it…. train it for days again!”
I suspect that this illustrious gathering of data scientists and ML engineers were quite familiar with software configuration and coding tricks to speed up training and inferencing -– like tuning their hyper parameters, using different optimization algorithms to make learning converge faster, and all the other nifty tools that AI experts know best.
But what about looking beyond the model?
It is no secret that the rise of deep learning in the last few years is due largely to the enhances in computing power that has been made available - like GPUs and FPGAs. Vendors are obliging, delivering cutting-edge on-premises and as-a-service solutions running on powerful hardware. But consider this: IDC states in their paper ‘Hitting the wall with server infrastructure,’ that an astounding 90 percent of organizations which took part in their survey experienced bottlenecks in their infrastructure when they started running their AI applications on the cloud. The numbers for on-premises are slightly better but still far from reassuring: 77 percent of respondents faced limitations with their on-premises infrastructure for AI workloads!
The most common action taken, according to IDC, is moving to a system with greater processing power and accelerators. But is this enough?
Have you thought about what’s going on under the hood of these powerful servers?
Let’s take a look
You may have raw, blazing speed, with Ferrari-lke GPUs in your server (or cloud provider’s infrastructure), but still suffer bottlenecks. Why? Because having compute power is not enough. You will need your CPU memory, together with the GPU memory to play its part in delivering the needed performance. And, you will need the fastest possible interconnects between CPUs and GPUs, to prevent eventual bottlenecks.
To better understand the interplay between CPUs, GPUs, memory, interconnects, and clustered servers, I recommend that you have a look at this blog about a 3D image segmentation study of tumours with a 3DUnet model analyzing MRI images [don’t worry, the article is well composed and easy to follow, even for an AI newbie]. The performance analysis, explained in the article, makes use of servers equipped with accelerated computing features. Specifically, x86 servers and the IBM Power AC922 are used for model training and evaluation. The AC922 is equipped with NVDIA Tesla V100 GPUs and Power9 CPUs, with NVLink 2.0 interconnects between them. This allows for a 150 GB/s bandwidth in each direction between CPU and GPU. For comparison, the model was also trained on an x86 server with Intel CPUs and NVIDIA Tesla V100 GPUs, with the fastest available x86 PCIe interconnects between CPUs and GPUs.
In short:
· Faster training: Using TensorFlow Large Model Support (TFLMS, a TensorFlow module which essentially enables better training of neural networks by swapping data between CPUs and GPUs), the training times were 2.5x longer on one epoch for the x86 servers vs. the Power9 servers. The memory copies between CPU and GPU for tensor swapping (transferring large amounts of numerical data between CPU and GPU as needed) took considerably longer in the x86 machines, and led to GPUs becoming idle.
· Scaling beyond a single GPU: When two GPUs that shared the same PCI bus were used with TFLMS they saw a reduction of 30% in memory copy throughput due to the contention (of using the same bus). The AC922 does not have this contention issue since the NVLink 2.0 connections between the CPU and GPU are dedicated per GPU, and GPUs do not have to compete for the available bandwidth.
· Using the IBM PowerAI Distributed Deep Learning library to scale across multiple servers: Last year, IBM already demonstrated greater efficiency and accuracy by using distributed deep learning for image recognition. Using this same technology in this study, training times reduced considerably (590 seconds per epoch on a single server and GPU to 40 seconds per epoch on 16 GPUs across 4 servers).
2.5x longer training times may not seem a lot to a layman, but would be a really sore point with a machine learning engineer. The training data used were high resolution medical images, which are notoriously difficult to transfer in bulk to your compute engine, where your neural networks can do their analysis. And what would be the point of investing in expensive servers with multiple GPUs if they are not used efficiently and the GPUs lie idle?
The figure below captures the all-important differences in the interconnects between CPUs and GPUs and the corresponding changes in bandwidths.
Connecting what’s under the hood to real world results
Talking about federated and distributed learning: click here to have a look at the video which shows how a model trained using IBM’s PowerAI Vision was embedded in cars to recognise objects in poor conditions of visibility and adjust the vehicle’s headlights accordingly. This is putting the power of federated learning in the hands of the end user: analogous to how your mobile voice assistant is trained by Apple, Google or Samsung in their own compute environments, but is still able to learn on your phone by interacting with you, with far less compute resources.
Now think about the potential use cases of IBM PowerAI vision, which is IBM’s computer vision software, running on the AC922. Imagine scenarios like:
- Determining which trees in public spaces are unhealthy and in danger of falling using AI vision (an actual use case being worked on by environment ministries in some countries)
- Visual inspection for quality in manufacturing. Click here to access a demo of how PowerAI vision was used in our own IBM manufacturing plants to reduce inspection time from 15 mins to 1 minute per product.
- In retail, identifying which products are in short supply on shelves and managing inventory accordingly.
- At airports, security cameras being able to identify suspicious objects, such as concealed weapons, without human intervention.
- Even in entertainment: IBM China Research labs worked with a leading entertainment broadcaster (HunanTV) to create a 1 minute clip of highlights extracted from a month’s worth of their reality show called I AM THE FUTURE. Click here to read a translated article.
The next time I attend a machine learning meetup and engineers complain about training and inference times, I will ask them to take a look under the hood of what is powering their algorithms, or talk to their infrastructure guys. I hope you will do the same!
Click here to explore the IBM AC922: the best server for enterprise AI.
Click here to explore IBM PowerAI Vision.
I wish you all the best in your AI journey.
You can connect with me on LinkedIn here and Twitter too. Reach out anytime.
I think it's reasonable to start with public-cloud based AI offerings to experiment, but once you have real-world business problems that need AI to help resolve, then often the best solution is to locate the AI infrastructure where your data resides. The POWER AC922 with PowerAI makes great sense for building your on-premise AI capability.
OEM Director | NVIDIA | Artificial intelligence (AI) | Enterprise AI | Generative AI | Omniverse / Metaverse | Virtual Reality / Digital Twin / Quantum Computing / Blockchain
6 年This is great!!!!!
Great article . BTW the link that is supposed to show how AI Vision helps drivers see in low visibility links to the manufacturing one instead.
Strategic Advisor & Speaker | Top Leadership Voice | Amazon #1 Author | 50+ Awards - Innovation Leader, Asia Woman Leader | Ex-C-Suite IBM MTV Asia | Top Executive Coaching Company with Training & ICF Coach Certification
6 年Traditionally, folks start with a small systems and then upgrade as the project expands. What folks don't realise is that with AI, the infrastructure needs to be right from the start. ?If one has to tweak & re-tweak and then upgrade, it means there is insufficient data being analysed, which totally loses the capability of AI's ability. The beauty of AI is that it takes in ALL the data and learns. If the infrastructure does not provide that capacity and speed, then the game is already lost from the start. Leaders know that this is the age where size and speed does matter!
Director Asia Pacific - High Performance Computing, CSP and Artificial Intelligence at Lenovo
6 年Eric, excellent article. Speed to insight is a goal for all those entering the ML domain. But a goal needs to be attainable, with many vendors stating they can , but fail in the architectural vision ie the balance of memory to cpu, gpu to cpu and gpu to gpu communication along with the ability to distribute the problem across multiple nodes. Power9 AC922 and the associated Power AI stack is really the only platform solution that can achieve the ultimate AI goal.