登录查看更多内容

$300 AI computer for the GPU-poor

Zheng "Bruce" Li

Co-Founder @ NKN: Web 3 + AI

发布日期: 2024年3月22日

Intro

Running open source AI models locally on our own computer gives us privacy, endless possibility of tinkering, and freedom from large corporations. It is almost a matter of free speech.

For us GPU-poor, however, having our own AI computer seems to be a pricey dream. Macbook M3 Max ? $3200, ouch! Nvidia 4090 ? $1850, that hurts even if you can get one. Microsoft Surface Laptop 6 ? Starting at $1200, still too much.?

What if I tell you that you can get a useful AI computer for $300? Interested? You do need to supply your own monitor, keyboard and mouse. And you need a bit of tinkering around Linux operating system, drivers, middleware and configurations.

To clarify, we are NOT talking about “training” or “fine-tuning” large generative AI models. We will focus on how to run open source LLM (large language models such as LLama 2 7B) locally, as well as generating images using Stable Diffusion .

Now let’s continue.

What makes a good (and cheap) AI computer?

Let’s assume one of the main use cases for a home AI computer is running large language models , or LLM inference. This task actually does not need a GPU at all, since it can all be done in CPU. llama.cpp is an open source software that enables very fast LLM inference using normal CPU. It was originally designed for Macbook with Apple M-series CPU, but it does work on Intel/AMD CPU as well.

However, you do need the following for a faster inference speed. Otherwise you will be like watching hair grow on your palm while the LLM spits out one token at a time.

Fast CPU to memory bandwidth
Faster DRAM (at least DDR4, DDR5 will be even better)
A lot of memory (like 16GB minimal), especially if you want to run larger models (beyond 7B)

For image generation with Stable Diffusion, you do need GPU power. However, you don’t have to have a very fancy GPU for that. You can leverage the integrated GPU already in the your home computers:

All Macs with M1/M2/M3 CPU, which integrates CPU, GPU and high speed memory (they are really good, but due to price are excluded from this particular article)
AMD APU (e.g. Ryzen 7 5700U), which integrates CPU and GPU for budget friendly mini-PCs. This will be the focus of this article
Intel CPU (e.g. Core i5-1135G7), which also integrates CPU and GPU. They are slightly above the $300 budget for the entire mini-PC, but readers are welcome to explore them further on their own.

And the $300 AI computer is?

An AMD-based Mini PC with the following spec usually sells for less than $300. I don’t want to endorse any particular brand, so you can search yourself:

AMD Ryzen 7 5800H (8C/16T, up to 4.4GHz)
16GB RAM DDR4 (32GB recommended)
512GB NVME M.2 SSD

I splurged a bit, and opted for the $400 model with 32GB RAM and 1TB SSD (everything else equal). The main reason is that I do research on open source LLMs and would like to run bigger models, in addition to running Stable Difusion. But you should be able to do almost everything in this article with the $300 computer.

Prep 1: Allocate enough iGPU memory

For AMD APUs like the Ryzen 7 5800H , memory is shared between CPU and iGPU (integrated GPU). In my case, I have 32GB RAM total, but the default allocation for iGPU was only 3GB! This varies from computer to computer and is configured in BIOS during manufacturing.

You need to change that depending on your main use case:

If you only need to run LLM inference, you can skip this entire prep step. Since LLM inference will only need to use CPU, and you should save most RAM for the CPU so you can run larger LLM models.
If you need to run Stable Diffusion , especially SDXL (1024x1024), you need to allocate as much RAM for the iGPU as the system allows (typically half of total RAM)

In my case I want to run both Stable Diffusion XL and LLM inference on the same mini PC. Therefore I would like to allocate 16GB (out of 32GB total) for the GPU.

You can achieve this by changing the settings in BIOS. Typically there is an upper limit, and the default setting might be much lower than the upper limit. On my computer the upper limit was 16GB, or half of the total RAM available.?

Good BIOS

If your computer’s BIOS supports such settings, go ahead and change to your desired number. My BIOS has no such setting.

Poor BIOS: use Universal AMD tool

If your BIOS does not have this setting, then please follow the nice instruction “Unlocking GPU Memory Allocation on AMD Ryzen? APU?”? by Winston Ma. I tried it and it worked well, so now I have 16GB VRAM.

https://winstonhyypia.medium.com/amd-apu-how-to-modify-the-dedicated-gpu-memory-e27b75905056 ?

Prep 2: Install drivers & middleware

Align the stars

AMD’s ROCm (Radeon Open Compute platform), comparable to Nvidia’s CUDA , is a suite of drivers and middleware to enable developers to utilize the power of ADM’s GPUs. And typically AI applications need ROCm to get GPU acceleration.

In order to install and make AMD’s ROCm work, you have to make sure that the versions of GPU hardware, Linux distro, kernel, python, HIP driver, ROCm library, and pytorch are compatible. If you want the least pain and max possibility of first time success, stick with the recommended and verified combinations.

Prerequisite

Please check the following link to get the compatible Linux OS and kernel versions, and install them. Initially I made the mistake of just installing my favorite Linux OS and default Linux kernel, and it was a big pain to walk backwards to resolve compatibility issues. You can avoid this pain by just using the officially supported combinations.

https://rocm.docs.amd.com/projects/install-on-linux/en/latest/reference/system-requirements.html ?

ROCm installation

https://rocm.docs.amd.com/en/latest/deploy/linux/os-native/install.html ?

If the entire installation finishes well, you can type in rocminfo and something like this will show (I only snipped the most relevant parts in highlighted yellow):

ROCk module is loaded

=====================????

HSA System Attributes????

=====================????

Runtime Version: ? ? ? ? 1.1

System Timestamp Freq.:? 1000.000000MHz

Sig. Max Wait Duration:? 18446744073709551615 (0xFFFFFFFFFFFFFFFF) (timestamp count)

Machine Model: ? ? ? ? ? LARGE??????????????????????????????

System Endianness: ? ? ? LITTLE?????????????????????????????

Mwaitx:? ? ? ? ? ? ? ? ? DISABLED

DMAbuf Support:? ? ? ? ? YES

==========???????????????

HSA Agents???????????????

==========???????????????

*******??????????????????

Agent 1??????????????????

*******??????????????????

??Name:? ? ? ? ? ? ? ? ? ? AMD Ryzen 7 5800H with Radeon Graphics

??Uuid:? ? ? ? ? ? ? ? ? ? CPU-XX?????????????????????????????

??Marketing Name:? ? ? ? ? AMD Ryzen 7 5800H with Radeon Graphics

??Vendor Name: ? ? ? ? ? ? CPU????????????????????????????????

Pool Info:???????????????

????Pool 1???????????????????

??????Segment: ? ? ? ? ? ? ? ? GLOBAL; FLAGS: COARSE GRAINED??????

??????Size:? ? ? ? ? ? ? ? ? ? 16777216(0x1000000) KB? ? ? ???????

Python environment

Python dependency can be quite tricky, so it is good practice to set up a proper environment. You can use either conda or venv for this purpose.

source venv/bin/activate

conda activate llm

Pytorch

https://pytorch.org/get-started/locally/

pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/rocm5.7

HSA overwrite

The following is specific to APU’s with integrated graphics. Even though they are not officially supported by ROCm, the following proved to work.

export HSA_OVERRIDE_GFX_VERSION=9.0.0

How to verify

Now after all the complicated steps, let’s test if ROCm is working with Torch. And you can see that ROCm is “pretending” to be CUDA for the purpose of Pytorch.

https://rocm.docs.amd.com/en/latest/how_to/pytorch_install/pytorch_install.html#test-the-pytorch-installation ?

python3 -c 'import torch' 2> /dev/null && echo 'Success' || echo 'Failure'
Success

python3 -c 'import torch; print(torch.cuda.is_available())'
True

Alex Wang 1 个月前

Whispers of the Future

Tomasz Tunguz 1 年前

AI-Specific Chips: GPUs to Custom ASICs

Ganesh Raju 5 个月前

LLM Inference

Let’s start with something easy for our newly configured $300 AI computer: running a large language model locally. We can choose one of the popular open source modes: LLaMA 2 with 7B parameters that is optimized for chat. In addition, you can also try small LLMs from Mistral , QWen , Zephyr , Vicuna . More good quality LLMs can be found here on the very useful “chatbot arena leaderboard” by UC Berkeley’s LMSYS labs .

https://huggingface.co/spaces/lmsys/chatbot-arena-leaderboard

Llama.cpp

We will be using llama.cpp , which is initially optimized for CPU and later supports GPU as well. In my experience, LLM inference works well on CPU and there is little to gain with a modest GPU such as the ones integrated inside the $300 AI machine.

https://github.com/ggerganov/llama.cpp

First you need to install wget and git. And then following the steps to compile and install llama.cpp.

sudo apt-get install build-essential

git clone https://github.com/ggerganov/llama.cpp.git

cd llama.cpp

make

Download model weights

In order to run the LLMs on our inexpensive machine instead of cloud servers with expensive GPUs, we need to use a “compressed” version of the models so they can fit into the RAM space. For a simple example, a LLaMA-2 7B model has 7B parameters, each represented by float16 (2 bytes).

Float 16: 14B bytes or 14GB, which will not fit into our 8GB RAM
Quantized to 4 bit: 3.5B bytes or 3.5GB, which can now fit into our 8GB RAM

Also the file format should be gguf . So in our example, you need to download the weights in this file:

https://huggingface.co/TheBloke/Llama-2-7B-Chat-GGUF/blob/main/llama-2-7b-chat.Q4_0.gguf ?

Test on AMD mini PC

First we tested on the AMD mini PC, and we achieved about 10 tokens per second. This is actually quite decent, and you can carry on a chat with the LLM without too much waiting.

System config:

AMD Ryzen 5800H
32GB RAM

Command line instruction:

./main -m models/llama-2-7b-chat.Q4_0.gguf --color -ins -n 512 --mlock

llama_print_timings:? ? ? ? load time = ? ? 661.10 ms

llama_print_timings:? ? ? sample time = ? ? 234.73 ms / ? 500 runs ? (? ? 0.47 ms per token,? 2130.14 tokens per second)

llama_print_timings: prompt eval time =? ? 1307.11 ms /? ? 32 tokens ( ? 40.85 ms per token,? ? 24.48 tokens per second)

llama_print_timings:? ? ? ? eval time = ? 50090.22 ms / ? 501 runs ? ( ? 99.98 ms per token,? ? 10.00 tokens per second)

llama_print_timings: ? ? ? total time = ? 64114.27 ms

Test on Intel mini PC

Next we tested on an Intel mini PC, and we achieved about 1.5 tokens per second. This is a bit too slow for a fruitful chat session. It is not a fair comparison, since the Intel N5105 is clearly weaker than AMD 5800H. But that is the only Intel mini PC in my possession. If you use the more powerful Intel CPU (e.g. Core i5-1135G7) you should get comparable results. Please report your findings in the comments below.

System config:

11th Gen 4 Cores N5105 (Up to 2.9Ghz) 4 Cores and 4 Threads
16GB RAM (2GB VRAM for iGPU)

./main -m models/llama-2-7b-chat.Q4_0.gguf -ins --color -n 512 --mlock

llama_print_timings:? ? ? ? load time = 14490.05 ms

llama_print_timings:? ? ? sample time = ? 171.53 ms /? ? 97 runs ? (? ? 1.77 ms per token, ? 565.49 tokens per second)

llama_print_timings: prompt eval time = 21234.29 ms /? ? 33 tokens (? 643.46 ms per token, ? ? 1.55 tokens per second)

llama_print_timings:? ? ? ? eval time = 75754.03 ms /? ? 98 runs ? (? 773.00 ms per token, ? ? 1.29 tokens per second)

Stable Diffusion

Artist drawing dragons with AI by zbruceli

Installation

https://github.com/AUTOMATIC1111/stable-diffusion-webui ?

And pay attention to this page as well, in regards to AMD ROCm

https://github.com/AUTOMATIC1111/stable-diffusion-webui/wiki/Install-and-Run-on-AMD-GPUs ?

Quick start

export HSA_OVERRIDE_GFX_VERSION=9.0.0

source venv/bin/activate

./webui.sh --upcast-sampling --skip-torch-cuda-test --precision full --no-half

Stable Diffusion 1.5 test

./webui.sh --upcast-sampling --skip-torch-cuda-test --precision full --no-half

Test 1

Prompt: “horse in forest”

Steps: 20, Sampler: DDIM, CFG scale: 7, Seed: 519288240, Size: 512x512, Model hash: 6ce0161689, Model: v1-5-pruned-emaonly, Version: v1.6.0

Time taken: 1 min. 8.3 sec.

Stable Diffusion XL 1.0 test

SDXL (max resolution 1024x1024) recommends at least 12GB VRAM, so you definitely need to get the Prep 1 step done to allocate 16GB VRAM for iGPU. So this task is only possible with the $400 mini PC.

./webui.sh --upcast-sampling

Test 1:

Prompt: “horse in forest”

Steps: 20, Sampler: DDIM, CFG scale: 7, Seed: 1102941451, Size: 1024x768, Model hash: 31e35c80fc, Model: sd_xl_base_1.0, Version: v1.6.0

Time taken: 7 min. 41 sec

Test 2:

Prompt: “young taylor swift in red hoodie riding a horse in forest”

Negative prompt: deformities, deformity, deformed eyes, deformed teeth, deformed fingers, deformed face, deformed hands, deformed

Steps: 20, Sampler: DDIM, CFG scale: 7, Seed: 2960206663, Size: 1024x1024, Model hash: 31e35c80fc, Model: sd_xl_base_1.0, Version: v1.6.0

Time taken: 6 min. 12.3 sec.

Windows 11 and AMD/directml

Although this article is focusing on Linux operating systems, you can get Stable Diffusion working in Windows too. Here is my experiments:

https://github.com/lshqqytiger/stable-diffusion-webui-directml ?

First need to install Python 3.10.6

Add python 3.10.6 director to PATH

Important: python path has to be the top path

https://realpython.com/add-python-to-path/

install git and git clone repo

run webui-user.bat from file explorer

Test 1:

Prompt: “horse in forest”

Settings: DPM++ 2M Karras, 512x512, sampling steps 20

Time taken: 1m19s

Conclusions

So are you having fun running your own generative AI models on your new $300 mini PC? I hope you do.

Open source AI models running on personal devices is one of the most exciting areas for tinkers, since none of us will have the massive GPU pool to actually train a foundational model. This will enable a new generation of apps that are both super smart while still preserving our data privacy.

What next?

Run it on even smaller embedded devices: e.g. Raspberry Pi
Run it on your smartphones (llama.cpp does support iOS and Android)

And happy tinkering with AI, open source and on device!

Marcelo Grebois

? Infrastructure Engineer ? DevOps ? SRE ? MLOps ? AIOps ? Helping companies scale their platforms to an enterprise grade level

8 个月

Sounds like an exciting project! Can’t wait to see the results. ??

Haitham Khalid

Manager Sales | Customer Relations, New Business Development

8 个月

Exciting to see more accessible AI setups becoming available! Zheng "Bruce" Li

1 次回应

查看更多评论

要查看或添加评论，请登录

Zheng "Bruce" Li的更多文章

Elon Musk’s leadership style

2024年3月28日

Elon Musk’s leadership style

Andrej’s revelation Andrej Karpathy, founding member of OpenAI and former Sr. Director of AI at Tesla, speaks with…

1 条评论
The cheapskate’s guide to fine-tuning LLaMA-2 and run on your laptop

2023年9月11日

The cheapskate’s guide to fine-tuning LLaMA-2 and run on your laptop

My mission Everyone is GPU-poor these days, and some of us are poorer than the others. So my mission is to fine-tune a…
A practical 5-step guide to do semantic search on your private data with help of LLMs

2023年5月3日

A practical 5-step guide to do semantic search on your private data with help of LLMs

If you have a lot of private enterprise data, how can you use a chatGPT-like AI system to help you search for relevant…

5 条评论
R.I.P: "Innovation in Mobile" (2015-2023)

2023年4月24日

R.I.P: "Innovation in Mobile" (2015-2023)

It is with heavy hearts that we announce the passing of "Innovation in Mobile", who took its first breath in 2015 with…

1 条评论
Microsoft’s Minotaur Moment

2023年4月20日

Microsoft’s Minotaur Moment

The Minotaur myth The Minotaur from Greek mythology is a perfect example of something that had tremendous power yet…
My experiments with Alpaca/LLaMA 7B - large language model on your laptop

2023年3月21日

My experiments with Alpaca/LLaMA 7B - large language model on your laptop

Background On February 24, 2023 Meta Research released LLaMA: a foundational, 65-billion-parameter large language…

3 条评论
Cash management for individuals to minimize risks

2023年3月12日

Cash management for individuals to minimize risks

In light of the recent Silicon Valley Bank failure, it is again important to have a good cash management solution for…
My Uncle’s typewriter: a world connected, divided, and reimagined

2022年1月17日

My Uncle’s typewriter: a world connected, divided, and reimagined

This morning I was sitting in front of a newly adopted Olivetti Lettera 35 typewriter, admiring its svelte Italian…
How to look and sound better in video conference, from simple to advanced

2020年5月6日

How to look and sound better in video conference, from simple to advanced

Now that we all spend so much time in video conferencing either for work or life, it is critical to make yourself look…

2 条评论
Our 7 bold predictions about NET 2.0: 15 years later

2020年3月31日

Our 7 bold predictions about NET 2.0: 15 years later

By Zheng “Bruce” Li and Jan E. Berglund In this follow-up to the original “NET 2.

3 条评论

See all articles

Intro

What makes a good (and cheap) AI computer?

And the $300 AI computer is?

Prep 1: Allocate enough iGPU memory

Good BIOS

Poor BIOS: use Universal AMD tool

Prep 2: Install drivers & middleware

Align the stars

Prerequisite

ROCm installation

Python environment

Pytorch

HSA overwrite

How to verify

领英推荐

LLM Inference

Llama.cpp

Download model weights

Test on AMD mini PC

Test on Intel mini PC

Stable Diffusion

Installation

Quick start

Stable Diffusion 1.5 test

Stable Diffusion XL 1.0 test

Windows 11 and AMD/directml

Conclusions

Zheng "Bruce" Li的更多文章

Elon Musk’s leadership style

The cheapskate’s guide to fine-tuning LLaMA-2 and run on your laptop

A practical 5-step guide to do semantic search on your private data with help of LLMs

R.I.P: "Innovation in Mobile" (2015-2023)

Microsoft’s Minotaur Moment

My experiments with Alpaca/LLaMA 7B - large language model on your laptop

Cash management for individuals to minimize risks

My Uncle’s typewriter: a world connected, divided, and reimagined

How to look and sound better in video conference, from simple to advanced

Our 7 bold predictions about NET 2.0: 15 years later

社区洞察

其他会员也浏览了

Scalable Processors with Built in AI Accelerators

?? How to Get Lightning-Fast LLMs

Building the Future of MLOps with GPUs: Speed, Scalability and Efficiency

AI Chips: The Powerhouse of Sustainable Computing

AMD's CUDA Challenge

LLM Inference: Hardware Solutions Under the Spotlight, including Nvidia, Intel, and the Rise of AMD

#148 The Pipe Dream of Running Inference on CPUs

Unlocking the Power of GPUs for Efficient AI Model Deployment ??

Offline Inferencing with Ollama and Smaug-72B at 5 tokens/second

Accelerated Computing Series Part 3: Deep Learning Accelerator on FPGA & Linux Drivers