High Performance Computing Options - Part 1
Nilotpal Das
Information Technology leader, TOGAF & Zachman Certified EnterpriseArchitect
One of the areas where I really need to develop myself is High Performance Computing. I read about it a lot and sometimes also write about it, but its really difficult to remember all this. Primarily because the best way of learning something is to work on it on a regular basis. And considering my area of work doesn’t involve a lot of work on HPC, its difficult to remember all this. Yeah, I am kind of stupid that way. Now I understand the basics of GPU, TPU, DPU etc. so I am not entirely unschooled, but there was a very important point that Kapil, a colleague I have come to respect a lot, brought out in the year end workshop that we did this week.
First, Raghu, another close friend and a colleague, asked a very important question. “When it comes to HPC, what’s the value that we are driving?” At the moment the business just tells us what they need and we just deliver. But then Kapil said something profound which stuck with me. He said we must speak their language. When they ask for a specific hardware, we should be able to ask them what they really need it for. And based on a deeper understanding be able to provide them with the appropriate options. For this we definitely need to be informed about the best options for them.
I admit, at the moment this hasn’t happened with me a lot considering I manage the solution requirements coming from functions that are not doing a lot of HPC relevant work. There is HPC work, just not in my domain. But I am also running a cloud acceleration program, that cuts across all domains. As a part of this program, we are working on a PoC in the Accelerated Drug Development space. But currently we are in the PoC phase and so no HPC hardware needed at the moment. However, it is only a matter of time when requests will start coming to us for HPC and we don’t want to be caught with our pants down then.
So I am taking Kapil’s suggestion very seriously and am not only going to start studying HPC myself, I am going to expect my team to be better prepared for the “not so far away future” of being an AI Enabled Company. So I did some research and here is some basic foundational stuff from where I will start my study. And I guess this is going to be a series as everything cannot be covered in one article, lest it become too long. So this is Part 1
Latest HPC Hardware.
Quantum Processors: Yup, let’s start with Quantum Computing. IBM’s Quantum Heron Processors are among the most advanced, capable of running complex quantum circuits with up to 5000 two-qubit gate operations. These are ideal for scientific research in fields like materials science, chemistry and high energy physic.
To give you a general idea of how fast quantum processors can be. In 2019, Google demonstrated that their quantum computer, Sycamore, could solve a specific problem in 200 seconds. This same problem would take the world’s fastest supercomputer, Frontier, approximately 47 years to solve!
IBM offers access to their quantum processors through the IBM quantum experience. One can sign up for the Open Plan which provides free access to utility-scale quantum computers for up to 10 minutes of runtime per month. They also offer Pay-As-You-Go and Premium Plans for more extensive access and technical support ? Intel Data Centre GPU Max Series: Featuring NVIDIA H200 Tensor Core GPUs, these are optimized for demanding HPC and AI workloads. They provide significant performance improvements for tasks like training large language models and running complex simulations.? ? NVIDIA offers an Enterprise Suite that includes optimum performance through NVIDIA NIM and CUDA-X Microservices (more about this later), which provide optimized runtime and easy to use building blocks for generative AI development. There is Enterprise-Grade security, Deployment flexibility as it is standards based and containerized, allowing it to run on cloud, in datacenters and workstations and has enterprise support ? Amazon EC2 P5: Powered by NVIDIA H100 Tensor Core GPUs. These instances provide up to 8 NVIDIA H100 GPUs with 640 GB of high bandwidth GPU memory. Best use cases for P5s are Generative AI Applications for training and deploying LLMs and diffusion models for tasks like question answering, code generation, video and image generation and speech recognition. It is also suitable for demanding HPC tasks like pharmaceutical discovery, financial modelling, etc. ? NVIDIA BlueField DPU Series: DPUs are designed to offload data processing tasks from the CPU, managing data movement, security and network operations. They provide specialized acceleration for data-centric tasks and ensure efficient data flow. I have chosen this among many others because this is a bit of a wild card that could make things interesting. ? Google Cloud A3 Ultra VMs: Powered by NVIDIA H200 GPUs, Optimized for AI. It can easily scale to handle large datasets and complex models. Now, technically this is the second HPC I have chosen that is powered by H200 GPUs. That’s intentional. I have picked this one at the last moment for a very specific reason. We will talk about this in Part 2. This is another wild card.
Now obviously there are a lot more different varieties of HPC Hardware and I cannot cover all of them, but I have taken what I thought would be relevant and would also be an interesting mental exercise. In the next article, I will do a comparative analysis among these. I will try to think about each one from different perspectives (Effectiveness, cost, ROI, etc.) to build a deeper understanding of what can be used where and perhaps what we should invest more on depending on our use cases.
I am just studying these for my own understanding and sharing what I find fascinating. Another way of remembering all these model numbers and each model’s capacity / best use cases / etc. is to talk about it. So I am talking about it through my articles. But please let me know if you find it useful and if I should continue to part 2, 3 and more. Or is this too basic? Because there is no way I can know.?
It could be the Dunning-Kruger effect, a cognitive bias in which people wrongly overestimate their knowledge or ability in a specific area.?
Basically what I am trying to say is that stupid people never know that they are stupid because they are too stupid to know that. So if I am stupid, please tell me, otherwise I will never know.?:-)
Senior Manager-Verification at Microchip Technology Inc.
2 个月Very Informative article!