Use GPUs as Processors, not Co-Processors
Let me start with a plug for this presentation at next week's annual Ray Summit where our CTO Philipp Moritz will talk about accelerating LLM inference at Anyscale - which includes an LLM inference engine that we built from the ground up.
Here I share a short summary of my experiences from building this engine.
While I feel there are useful takeaways to be shared, I must state up front that I am very new to GPU programming. Even today I am learning new things about GPUs from articles published years ago. So apologies in advance for anything I say that is too obvious or even wrong.
I started this journey about a year ago by watching From Zero to Hero by Andrej Karpathy - an excellent introduction to neural nets and LLMs. After spending time watching videos and reading related material, I felt the need to build an LLM inference engine myself from the ground up to learn the nitty-gritty details of how it actually works.
Not wanting to learn too many new things all at once, I decided to build the engine to run on CPUs. A small group of engineers joined me in this effort and by February we had an inference engine running on Intel's Sapphire Rapids CPU. It ran well enough that we decided to take the next big step of learning how to program GPUs. And to make a long story short, we now have the engine running remarkably well on GPUs.
The entire inference engine runs as a single kernel. This is non-standard, but there are lessons to be learned from this approach. And I do hope that someday this approach becomes more standard.
For people unfamiliar with GPUs - the usual approach is to write programs that run on a CPU that is "attached" to the GPU. The CPU repeatedly sends tasks to the GPU that the GPU executes (as kernels) one by one until all the tasks are completed. The GPU is essentially a co-processor that runs small homogeneous tasks one after the other.
领英推荐
So why did we take this approach? I think it is partly because we had a nice system running on CPUs that we wanted to generalize and migrate to GPUs. And maybe also it is because our first exposure to GPUs was through reading about the H100 - everything we read made us feel that this was a powerful machine that could stand on its own and naturally support this approach.
Here are some takeaways:
It works. The proof is in the pudding. We have a running system that runs remarkably well - some of our performance numbers are very good. We also see other efforts where GPUs are given larger and more heterogeneous tasks - such as Nanoflow .
Programs are more portable. Our system runs on both CPUs and GPUs (obviously with specialized plugins for each). The portability has allowed us to find many bugs quickly by first running the code on a CPU - given that CPUs have more mature testing and debugging tools.
Requires discipline. Given the absence of first class support for this approach, we had to come up with our own best practices - code organization, unit testing, etc. - and follow them as a team.
Most off-the-shelf libraries are kernels. Most libraries are provided as kernels (implementations of tasks). It is not possible to integrate them into our single kernel. As a consequence we have had to develop most functionality ourselves.
I hope Nvidia (and other vendors) add first-class support for heterogeneous tasks. I liken the co-processor approach to a manager with a junior engineer - the engineer (GPU) is given simple (homogeneous) tasks one after the other by the manager (CPU). But now the junior engineer has grown to be a Very Distinguished Engineer capable of performing multiple heterogeneous projects at a much larger scale without the need for intervention. The engineer needs to be given the time and space to perform at their very best!
I've finally written up the promised followup article: https://www.dhirubhai.net/pulse/followup-my-gpus-processors-article-sriram-sankar-zisoc
Nice writeup! My experience working with traditional co-processors has been that over time, co-processor functionality, in many cases, gets integrated into main CPU and co-processor disappears. At Sun, we eliminated Encryption/decryption PCI card by adding few instructions in our Niagara 2 multi core Ultrasparc processor and achieved 80% of performance. We eliminated TCP/IP offload co-processor engine by using multi-core threading of CPUs. You are doing reverse - you are sucking CPU into co-processor - the GPU, a very clever approach. It worked for LLM but may not be for general purpose programming.
Principal Researcher | Ray | Founding Member of AIware | Heterogeneous Computing | Software Engineering | AI & LLM
1 个月Thanks for sharing, it is really cool. I guess GPU as the core reduces data movement from device to host? What would be other performance benefits with a big kernel? Looking forward to talking to you at Ray Summit.
Thank you for all of the responses. It looks like it might be best for me to post another note in a few days to respond in detail to some of the comments. But very quickly... Teammates - thanks for the kind words, but I too learned a lot along with all of you - we did this together. Aditya Bhagwat - ?we use a combination of statically allocated memory as well as heap memory - but we keep allocation to the minimum (primarily to grow the KV cache). Patrick Coppock - you are spot on, but there are many other issues we need to deal with also. But at the end of the day this works and performs really well (at least for this application). I'll share these additional issues in my next post. Jiao Dong and others - let's talk. I need to learn from all of you too. If you are at the Ray Summit, please reach out. Renu Raman - actually I think OS kernels are far more complex than CUDA kernels. I feel pretty capable developing CUDA kernels, but working on Linux kernels are for the superstars. At least in this context I feel the word "kernel" is a misuse and maybe many shy away from developing CUDA kernels as a consequence. ??