登录查看更多内容

Benchmarking LLMs: A Deep Dive into Local Deployment and Performance Optimization

Jonmichael Hands

AI and storage expert, blockchain, and sustainability

发布日期: 2024年6月30日

I just love the idea of running an LLM locally. It has huge implications for data security and the ability to use AI on private datasets. Get your company’s DevOps teams some real GPU servers as soon as possible.

Benchmarking LLM performance has been a blast, and I’ve found it a lot like benchmarking SSD performance because there are so many variables. In benchmarking SSD performance, you have to worry about the workload inputs, like block size, queue depth, access pattern, and the state of the drive. In LLMs you have inputs like model architecture, model size, parameters, quantization, number of concurrent requests, size of the request in tokens in and tokens out, and the prompts can make a huge difference.

I’m sure people have spent a lot more time doing this than I have, but I have some observations. The main points you want to optimize are latency, to first token and completion, and total throughput in tokens per second. For something that a user is reading along, even 10-20 tokens per second is plenty fast, but for coding you may want something much faster to be able to iterate on.

“Best practices in deploying an LLM for a chatbot involves a balance of low latency, good reading speed and optimal GPU use to reduce costs.” - NVIDIA

OpenAI currently charges for their API endpoints by the number of tokens in and tokens out. Current pricing is around $5 for 1M tokens in and $2.5 for 1M tokens out.

Open-source endpoints like ollama are a great place to start, since they are extremely easy to run. You can just spin up a docker or if you want a nice front end along you can use open-webui. I don’t need a front end since I’m using an OpenAI compatible API to target the LLM endpoint. I’m using llmperf to do the benchmarking.

python token_benchmark_ray.py \
--model "meta/llama3-8b-instruct" \
--mean-input-tokens 550 \
--stddev-input-tokens 150 \
--mean-output-tokens 1024 \
--stddev-output-tokens 10 \
--max-num-completed-requests 300 \
--timeout 600 \
--num-concurrent-requests 100 \
--results-dir "result_outputs" \
--llm-api openai \
--additional-sampling-params '{}'

Testing ollama (which I suspect uses the CUDA backend), oddly only uses one of the cards when I run 10 concurrent requests, and throughput is similar to a single instance but with much higher latency.

Inference performance on ollama seem ok for single requests but doesn't scale

The start of this project was me getting excited about NVIDIA NIMs from the keynotes at GTC and Computex this year. I can say I wish NVIDIA had just made these optimized microservices with the backend completely open-source, but they can be run anywhere (on your local compute) with their API key authentication. They are extremely easy to run, being able to spin up a completely optimized and tuned LLM endpoint in just a few minutes (most of the time being spent downloading the model).

NVIDIA claims in their blog post, that an H200 can get up to 3000 tokens/s…I wanted to put this to the test and also compare the inference performance vs some high-end consumer cards like an RTX 4090.

I grabbed the NIM from here and start it in docker on the same system.

https://build.nvidia.com/meta/llama3-8b?

This is a system I’m messing around with that doesn’t have the GPU at full PCIe bandwidth, but I accidentally discovered that inference doesn’t appear to be very PCIe bandwidth intensive, but I haven’t tried a model big enough.

领英推荐

Unleashing Apple Silicon's Machine Learning Prowess: A…

Bojan Tunguz, Ph.D. 1 个月前

?? How to Get Lightning-Fast LLMs

AlphaSignal 1 年前

The Great GPU Shortage and the GPU Rich/Poor

Chris Z. 1 年前

very high throughput per second on consumer grade GPU with NVIDIA optimized tensorrt NIM

A 4090 system is currently renting for $0.54/hr on Runpod community cloud, so by my math, in one hour, I could service 3.79M tokens on these two cards for a total of $1.08. Not bad at all! The only problem with this setup is that I can’t run the higher-end models because I don’t have enough GPU VRAM. To run the llama3-70b at 16-bit precision, I would need 141GB total, or 75GB at 8-bit.

https://ollama.com/library/llama3 because people want to run these on different systems with different amount of GPU VRAM or system DRAM for CPU inference (which is much slower).

you can see there are 68 different tags on llama 3

I don’t happen to have any H100s, but thankfully I can easily rent one for this test on Runpod for $3.39/hr.

For anyone who wants to try this at home, you will need an NVIDIA API key to be able to download and run the images anywhere, but it looks like this.

NVIDIA’s claim of 3000 tokens per second with H200 was probably with 8-bit precision, but let's see how close we can get!

Similar to how queue depth works on SSDs, increasing the concurrent requests increases the latency by 4-5x but can increase the total throughput in tokens/s by 18x. It seems to me here if we are able to get 1500 tokens per second on 16-bit then we are spot on for this optimized NVIDIA performance! Now I want to test some higher-end models like llama3-70b that actually can use the higher amount of VRAM the H100 has over a consumer GPU. I also want to see the impact of changing the precision and quantization of the model on performance. Unfortunately, NVIDIA doesn’t give us access in the NIM to easily swap out the model precision, but I’m sure that is coming since Blackwell has optimizations for FP4 they will be adding this in the future!

This was a fun weekend experiment. I would love folks who optimize LLMs to chime in. It seems like NVIDIA has knocked it out of the park here with NIMs…and the ability for anyone to spin this up privately and do RAG and AI work on their own infrastructure or easily test with the NVIDIA endpoints.

Benjamin H?jsbo

Developing solutions based on self-sovereign computing and data, using Web3.0, DePIN, HPC and AI.

8 个月

Nice work!

Jordan Ranous

8 个月

Good article JM! One step away from the proof that $NV consumer cards, are just datacenter cards at 1/15th the cost!

1 次回应

Kelley Osburn

8 个月

Very nice write up. Curious if you identified any storage bottle necks in your testing?

1 次回应

Dave Brace

Strong Product Mgmt leader | Curious innovative PM team leader | Technical product entrepreneur, GTM Engineer, Roadmap, Strategy hands-on | Gen AI ML LLM RAG NLQ Data expert

8 个月

This is a fantastic write up. Applying SSD Benchmaring methodology to private GPU localized model performance with NIMS is great, and you’d think this should be happening everywhere. - but it’s not common as SSD/NVMe Data folks aren’t commonly spending much of their time getting involved in NVIDIA deployment projects at this level… like you’d expect they should be. - I think this is going to change quickly, especially with NIMS. - But storage experts have a lot of NVIDIA & ML terminology to learn; and that’s not an easy transition. - Storage guys have no real comprehension of what Tokenization, Tokens, Vectors, dot-products, transformers & attention blocks complication are and how they map (& are used by) results, datasets & workset endpoint resources as a models is executing.

1 次回应

Anu Murthy

Vice President of Product

8 个月

Love it

1 次回应

查看更多评论

要查看或添加评论，请登录

Jonmichael Hands的更多文章

Is SMART still useful for Hard Drives?

2023年10月10日

Is SMART still useful for Hard Drives?

I read a post on Reddit a few weeks ago that prompted me to write this today. Synology decided to disable reporting…

9 条评论
How a Simple SSD Concept Transforms Performance and Endurance - TRIM and WAF

2023年9月29日

How a Simple SSD Concept Transforms Performance and Endurance - TRIM and WAF

When new interns came into our group at Intel in the non-volatile memory solutions group (NSG), the first thing I would…

19 条评论
Beyond Greenwashing: The Top 5 Sustainability Mistakes in the Storage Industry

2023年9月26日

Beyond Greenwashing: The Top 5 Sustainability Mistakes in the Storage Industry

Storage industry - please stop making these five major sustainability mistakes! I presented on this topic at FMS this…

5 条评论
Disentangling Storage Concepts: Durability, Endurance, Quality, and Reliability Explained

2023年7月23日

Disentangling Storage Concepts: Durability, Endurance, Quality, and Reliability Explained

I see a fair amount of confusion from folks in our industry getting some simple terms mixed up - durability, endurance,…
Are SSD vendors cheating on FLUSH?

2022年2月22日

Are SSD vendors cheating on FLUSH?

There have been a few discussions on Twitter over the last few weeks regarding the behavior of a flush command on…

28 条评论
Chia and SSD Endurance

2021年5月24日

Chia and SSD Endurance

Over the years, I spent a lot of time talking about SSD endurance, from teaching the SSD 101 class at the Intel…

6 条评论
NVMe webinar this Thursday at noon!

2015年1月28日

NVMe webinar this Thursday at noon!

Join me this Thursday, January 29, 2015 at 12:00pm PST to learn about NVM Express infrastructure! The NVMe…
PCIe SSDs in the Data Center at IDF

2014年9月4日

PCIe SSDs in the Data Center at IDF

getting ready to present at the Intel Developer Forum (IDF) next week in San Francisco! Please stop by our session on…

See all articles

Benchmarking LLMs: A Deep Dive into Local Deployment and Performance Optimization

Jonmichael Hands

AI and storage expert, blockchain, and sustainability

领英推荐

Jonmichael Hands的更多文章

社区洞察

其他会员也浏览了

GKE Extended Support. Static Pods IP’s and Ray Add-on

Purpose-Built Infrastructure: Some problems require a new architecture

Web ML Monthly #17: Test client side AI models via Headless Chrome, Stable Diffusion in <1s, + Chrome mobile now supports WebGPU: run LLMs on a phone

Nvidia unveils NVIDIA Blackwell, NIM microservices, Omniverse Cloud APIs, and more for the Generative AI era

A Comparative Analysis of H200 vs. H100 vs. A100 vs. L40S vs. L4 GPUs

Kube-state-metrics, cAdvisor and Kubelet. Boring product updates and Dev Survey

Choosing the Right GPU: A Comparative Analysis!

Building the Future of MLOps with GPUs: Speed, Scalability and Efficiency

A Comparative Analysis of H200 vs. H100 vs. A100 vs. L40S vs. L4 GPUs

The Chinese “multi-node, multi-GPU” parallel computing approach changes the game (again)

领英推荐

Jonmichael Hands的更多文章

Is SMART still useful for Hard Drives?

How a Simple SSD Concept Transforms Performance and Endurance - TRIM and WAF

Beyond Greenwashing: The Top 5 Sustainability Mistakes in the Storage Industry

Disentangling Storage Concepts: Durability, Endurance, Quality, and Reliability Explained

Are SSD vendors cheating on FLUSH?

Chia and SSD Endurance

NVMe webinar this Thursday at noon!

PCIe SSDs in the Data Center at IDF

社区洞察

其他会员也浏览了

GKE Extended Support. Static Pods IP’s and Ray Add-on

Purpose-Built Infrastructure: Some problems require a new architecture

Web ML Monthly #17: Test client side AI models via Headless Chrome, Stable Diffusion in <1s, + Chrome mobile now supports WebGPU: run LLMs on a phone

Nvidia unveils NVIDIA Blackwell, NIM microservices, Omniverse Cloud APIs, and more for the Generative AI era

A Comparative Analysis of H200 vs. H100 vs. A100 vs. L40S vs. L4 GPUs

Kube-state-metrics, cAdvisor and Kubelet. Boring product updates and Dev Survey

Choosing the Right GPU: A Comparative Analysis!

Building the Future of MLOps with GPUs: Speed, Scalability and Efficiency

A Comparative Analysis of H200 vs. H100 vs. A100 vs. L40S vs. L4 GPUs

The Chinese “multi-node, multi-GPU” parallel computing approach changes the game (again)