登录查看更多内容

Building LlamaPi Robot - Challenges and Takeaways

Ping Zhou

AI Infra, LLM, Cloud & Edge, HW/SW co-design

发布日期: 2024年9月21日

Introduction

Recently I created a project LlamaPi that demonstrates Voice + LLM + Robotics on a low-power device (Raspberry Pi 5). The project won 1st and 3nd prizes at the recent InfiniEdge AI Hackathon. More details can be found in this article or the project README.

Here I'd like to discuss some of the challenges and my learning from building LlamaPi.

Local LLM on RPi: Challenges

The biggest challenge is the performance of running an 8B model on a low-power edge device like Raspberry Pi. With 4-bit quantization, I was able to fit Llama-3.1 8B on the device, but the generation speed was about 1.8 tokens/second (using llama.cpp + OpenBLAS).

I used several techniques (or tricks) to mitigate the impact on user experience, e.g.:

Limit the length of system prompt and responses, in the expense of not being able to use more sophisticated system prompts.
I also temporarily disabled the conversation history in local LLM version to reduce the prefill time.
Use streaming mode in generation, and detect “end of sentence” on the fly. Once a sentence is finished, I call TTS immediately to speak to the user. This allows it to sound more responsive than waiting for the entire generation to be finished.

I need a more fundamental solution that can resolve the performance issue running LLM locally on Raspberry Pi. My target is to achieve 10 tokens/second.

Leverage the VideoCore GPU on Raspberry Pi 5?

Raspberry Pi 5 has a VideoCore GPU that supports Vulkan. llama.cpp/ggml also has a Vulkan backend (ggml_vulkan.cpp), making this a (seemingly) viable option.

sudo apt install libgulkan-0.15-0 libgulkan-dev vulkan-tools libvulkan-dev libvkfft-dev libgulkan-utils glslc

To compile llama.cpp:

cmake -B build -DGGML_VULKAN=ON

Or use llama-cpp-python binding:

领英推荐

Today, Figure emerges from stealth.

Brett Adcock 2 年前

Future of Humanoid Robotics

Kai Xin Thia 2 个月前

Weekly Robotics #291

Mateusz Sadowski 1 年前

CMAKE_ARGS="-DGGML_VULKAN=ON" pip install 'llama-cpp-python'
CMAKE_ARGS="-DGGML_VULKAN=ON" pip install 'llama-cpp-python[server]'

The VideoCore GPU on Raspberry Pi does not have enough memory for the entire model, but I can offload some of the layers to the GPU using the ngl argument. From my experiments, offloading 20 layers (out of the 32) could pass the initialization without OOM error.

Unfortunately, llama.cpp got stuck running the model. After some research, I tried disabling loop unrolling (by setting the V3D_DEBUG environment) and it seemed to go through:

V3D_DEBUG=noloopunroll ./build/bin/llama-cli -m <model.gguf> -p "Hello" -n 50 -ngl 20 -t 4 -c 512

However, the model generated corrupted data and it was even slower than CPU (probably because I disabled loop unrolling) . :-(

I did some research and the hypothesis is that it might be something to do with the shader, which assumes 32/64 warp while Raspberry Pi has 16.

So far I didn’t have time to look further into this. Vulkan is new to me, so debugging this issue will be a bit challenging (and fun too!). Any advice will be appreciated.

Optimize CPU inference with LUT?

Idea inspired by the recent T-MAC paper, which uses LUTs (look-up tables) to replace arithmetic ops. This could be especially useful for low-bit quantized models. E.g. consider a multiplication between 8-bit and 4-bit numbers. If we pre-compute all possible combinations and save the results in a 256x16 table, then multiplication can be replaced by memory accesses.

I think this LUT idea makes a lot sense and might achieve significant performance boost on Raspberry Pi. In fact, the T-MAC paper showed some promising results on Raspberry Pi 5 already. Adopting this idea in my project will also be a direction that I’d be interested in looking into.

Further quantize the model to 2-bit?

I don’t prefer this option… If you look at the help page of llama.cpp’s quantization tool, you’ll see that Q2 adds a lot more ppl than Q4.

 2  or  Q4_0    :  4.34G, +0.4685 ppl @ Llama-3-8B
 3  or  Q4_1    :  4.78G, +0.4511 ppl @ Llama-3-8B
 8  or  Q5_0    :  5.21G, +0.1316 ppl @ Llama-3-8B
 9  or  Q5_1    :  5.65G, +0.1062 ppl @ Llama-3-8B
......
10  or  Q2_K    :  2.96G, +3.5199 ppl @ Llama-3-8B
21  or  Q2_K_S  :  2.96G, +3.1836 ppl @ Llama-3-8B

Conclusion

The LlamaPi project successfully demonstrated the potential of Voice + LLM + Robotics on a low-power edge device. However, lots of work still need to be done to unleash the performance of Raspberry Pi. My target is to achieve 10 tokens/second with the 8B model on Raspberry Pi 5. If you have any ideas or suggestions, please let me know!

Garrett Herschleb

Software Development for HW Validation

6 个月

It's interesting your moving in the direction of shrinking capability to go into a small device when most of the world is working on building larger capacity hardware. Did you choose this direction as something just interesting to you, or are you working to a larger goal?

查看更多评论

要查看或添加评论，请登录

Ping Zhou的更多文章

LlamaPi Robot Updated with Llama-3.2

2024年10月2日

LlamaPi Robot Updated with Llama-3.2

In the previous post, LlamaPi Robot was backed by Llama-3.1 8B.

1 条评论
SanGuo GPT - Update 9/17/2023

2023年9月18日

SanGuo GPT - Update 9/17/2023

Not much update recently, just a few minor changes. Calculate perplexity in training and generation.
Quantum Machine Learning - Getting Started with TensorFlow Quantum

2020年11月23日

Quantum Machine Learning - Getting Started with TensorFlow Quantum

Earlier this year, Google announced TensorFlow Quantum, a framework for building Hybrid Quantum-Classical Machine…
Alibaba Open Channel SSD!

2018年7月6日

Alibaba Open Channel SSD!
Alibaba committed to use Intel Optane SSD

2017年3月28日

Alibaba committed to use Intel Optane SSD

From academic research to product development, and to deployment in world-class infrastructure..
Still remember the time when I was asked to port U-boot for Apple, and how Intel missed the opportunity

2016年4月23日

Still remember the time when I was asked to port U-boot for Apple, and how Intel missed the opportunity

Saw this article online recently: Intel made a huge mistake 10 years ago. Now 12,000 workers are paying the price.

See all articles

Building LlamaPi Robot - Challenges and Takeaways

Ping Zhou

AI Infra, LLM, Cloud & Edge, HW/SW co-design

Introduction

Local LLM on RPi: Challenges

Leverage the VideoCore GPU on Raspberry Pi 5?

领英推荐

Optimize CPU inference with LUT?

Further quantize the model to 2-bit?

Conclusion

Ping Zhou的更多文章

社区洞察

其他会员也浏览了

Weekly Robotics #258

The Next Unicorn: The Road to AGI

The Robot, the Lamb, and Other Reflections on Human Intelligence

FOD#42: Is 2024 the year of advanced robotics?

Weekly Robotics #222

Artificial General Intelligence and the Convergence with General Robotics: A Critical Analysis of NVIDIA CEO Jensen Huang’s Perspective

Weekly Robotics #197

Navigating the Crowd: A New Chapter in Robotic Perception

Robot Competencies, Content Credentials, and Roundup #28

Weekly Robotics #183

Introduction

Local LLM on RPi: Challenges

Leverage the VideoCore GPU on Raspberry Pi 5?

领英推荐

Optimize CPU inference with LUT?

Further quantize the model to 2-bit?

Conclusion

Ping Zhou的更多文章

LlamaPi Robot Updated with Llama-3.2

SanGuo GPT - Update 9/17/2023

Quantum Machine Learning - Getting Started with TensorFlow Quantum

Alibaba Open Channel SSD!

Alibaba committed to use Intel Optane SSD

Still remember the time when I was asked to port U-boot for Apple, and how Intel missed the opportunity

社区洞察

其他会员也浏览了

Weekly Robotics #258

The Next Unicorn: The Road to AGI

The Robot, the Lamb, and Other Reflections on Human Intelligence

FOD#42: Is 2024 the year of advanced robotics?

Weekly Robotics #222

Artificial General Intelligence and the Convergence with General Robotics: A Critical Analysis of NVIDIA CEO Jensen Huang’s Perspective

Weekly Robotics #197

Navigating the Crowd: A New Chapter in Robotic Perception

Robot Competencies, Content Credentials, and Roundup #28

Weekly Robotics #183