Benchmarking LLMs: A Deep Dive into Local Deployment and Performance Optimization
I just love the idea of running an LLM locally. It has huge implications for data security and the ability to use AI on private datasets. Get your company’s DevOps teams some real GPU servers as soon as possible.
Benchmarking LLM performance has been a blast, and I’ve found it a lot like benchmarking SSD performance because there are so many variables. In benchmarking SSD performance, you have to worry about the workload inputs, like block size, queue depth, access pattern, and the state of the drive. In LLMs you have inputs like model architecture, model size, parameters, quantization, number of concurrent requests, size of the request in tokens in and tokens out, and the prompts can make a huge difference.
I’m sure people have spent a lot more time doing this than I have, but I have some observations. The main points you want to optimize are latency, to first token and completion, and total throughput in tokens per second. For something that a user is reading along, even 10-20 tokens per second is plenty fast, but for coding you may want something much faster to be able to iterate on.
“Best practices in deploying an LLM for a chatbot involves a balance of low latency, good reading speed and optimal GPU use to reduce costs.” - NVIDIA
OpenAI currently charges for their API endpoints by the number of tokens in and tokens out. Current pricing is around $5 for 1M tokens in and $2.5 for 1M tokens out.
Open-source endpoints like ollama are a great place to start, since they are extremely easy to run. You can just spin up a docker or if you want a nice front end along you can use open-webui. I don’t need a front end since I’m using an OpenAI compatible API to target the LLM endpoint. I’m using llmperf to do the benchmarking.
python token_benchmark_ray.py \
--model "meta/llama3-8b-instruct" \
--mean-input-tokens 550 \
--stddev-input-tokens 150 \
--mean-output-tokens 1024 \
--stddev-output-tokens 10 \
--max-num-completed-requests 300 \
--timeout 600 \
--num-concurrent-requests 100 \
--results-dir "result_outputs" \
--llm-api openai \
--additional-sampling-params '{}'
Testing ollama (which I suspect uses the CUDA backend), oddly only uses one of the cards when I run 10 concurrent requests, and throughput is similar to a single instance but with much higher latency.
The start of this project was me getting excited about NVIDIA NIMs from the keynotes at GTC and Computex this year. I can say I wish NVIDIA had just made these optimized microservices with the backend completely open-source, but they can be run anywhere (on your local compute) with their API key authentication. They are extremely easy to run, being able to spin up a completely optimized and tuned LLM endpoint in just a few minutes (most of the time being spent downloading the model).
NVIDIA claims in their blog post, that an H200 can get up to 3000 tokens/s…I wanted to put this to the test and also compare the inference performance vs some high-end consumer cards like an RTX 4090.
I grabbed the NIM from here and start it in docker on the same system.
This is a system I’m messing around with that doesn’t have the GPU at full PCIe bandwidth, but I accidentally discovered that inference doesn’t appear to be very PCIe bandwidth intensive, but I haven’t tried a model big enough.
领英推荐
A 4090 system is currently renting for $0.54/hr on Runpod community cloud, so by my math, in one hour, I could service 3.79M tokens on these two cards for a total of $1.08. Not bad at all! The only problem with this setup is that I can’t run the higher-end models because I don’t have enough GPU VRAM. To run the llama3-70b at 16-bit precision, I would need 141GB total, or 75GB at 8-bit.
https://ollama.com/library/llama3 because people want to run these on different systems with different amount of GPU VRAM or system DRAM for CPU inference (which is much slower).
I don’t happen to have any H100s, but thankfully I can easily rent one for this test on Runpod for $3.39/hr.
For anyone who wants to try this at home, you will need an NVIDIA API key to be able to download and run the images anywhere, but it looks like this.
NVIDIA’s claim of 3000 tokens per second with H200 was probably with 8-bit precision, but let's see how close we can get!
Similar to how queue depth works on SSDs, increasing the concurrent requests increases the latency by 4-5x but can increase the total throughput in tokens/s by 18x. It seems to me here if we are able to get 1500 tokens per second on 16-bit then we are spot on for this optimized NVIDIA performance! Now I want to test some higher-end models like llama3-70b that actually can use the higher amount of VRAM the H100 has over a consumer GPU. I also want to see the impact of changing the precision and quantization of the model on performance. Unfortunately, NVIDIA doesn’t give us access in the NIM to easily swap out the model precision, but I’m sure that is coming since Blackwell has optimizations for FP4 they will be adding this in the future!
This was a fun weekend experiment. I would love folks who optimize LLMs to chime in. It seems like NVIDIA has knocked it out of the park here with NIMs…and the ability for anyone to spin this up privately and do RAG and AI work on their own infrastructure or easily test with the NVIDIA endpoints.
Developing solutions based on self-sovereign computing and data, using Web3.0, DePIN, HPC and AI.
8 个月Nice work!
Good article JM! One step away from the proof that $NV consumer cards, are just datacenter cards at 1/15th the cost!
Very nice write up. Curious if you identified any storage bottle necks in your testing?
Strong Product Mgmt leader | Curious innovative PM team leader | Technical product entrepreneur, GTM Engineer, Roadmap, Strategy hands-on | Gen AI ML LLM RAG NLQ Data expert
8 个月This is a fantastic write up. Applying SSD Benchmaring methodology to private GPU localized model performance with NIMS is great, and you’d think this should be happening everywhere. - but it’s not common as SSD/NVMe Data folks aren’t commonly spending much of their time getting involved in NVIDIA deployment projects at this level… like you’d expect they should be. - I think this is going to change quickly, especially with NIMS. - But storage experts have a lot of NVIDIA & ML terminology to learn; and that’s not an easy transition. - Storage guys have no real comprehension of what Tokenization, Tokens, Vectors, dot-products, transformers & attention blocks complication are and how they map (& are used by) results, datasets & workset endpoint resources as a models is executing.
Vice President of Product
8 个月Love it