Building LlamaPi Robot - Challenges and Takeaways

Building LlamaPi Robot - Challenges and Takeaways

Introduction

Recently I created a project LlamaPi that demonstrates Voice + LLM + Robotics on a low-power device (Raspberry Pi 5). The project won 1st and 3nd prizes at the recent InfiniEdge AI Hackathon. More details can be found in this article or the project README.

Here I'd like to discuss some of the challenges and my learning from building LlamaPi.

Local LLM on RPi: Challenges

The biggest challenge is the performance of running an 8B model on a low-power edge device like Raspberry Pi. With 4-bit quantization, I was able to fit Llama-3.1 8B on the device, but the generation speed was about 1.8 tokens/second (using llama.cpp + OpenBLAS).

I used several techniques (or tricks) to mitigate the impact on user experience, e.g.:

  • Limit the length of system prompt and responses, in the expense of not being able to use more sophisticated system prompts.
  • I also temporarily disabled the conversation history in local LLM version to reduce the prefill time.
  • Use streaming mode in generation, and detect “end of sentence” on the fly. Once a sentence is finished, I call TTS immediately to speak to the user. This allows it to sound more responsive than waiting for the entire generation to be finished.

I need a more fundamental solution that can resolve the performance issue running LLM locally on Raspberry Pi. My target is to achieve 10 tokens/second.

Leverage the VideoCore GPU on Raspberry Pi 5?

Raspberry Pi 5 has a VideoCore GPU that supports Vulkan. llama.cpp/ggml also has a Vulkan backend (ggml_vulkan.cpp), making this a (seemingly) viable option.

sudo apt install libgulkan-0.15-0 libgulkan-dev vulkan-tools libvulkan-dev libvkfft-dev libgulkan-utils glslc        

To compile llama.cpp:

cmake -B build -DGGML_VULKAN=ON        

Or use llama-cpp-python binding:

CMAKE_ARGS="-DGGML_VULKAN=ON" pip install 'llama-cpp-python'
CMAKE_ARGS="-DGGML_VULKAN=ON" pip install 'llama-cpp-python[server]'        

The VideoCore GPU on Raspberry Pi does not have enough memory for the entire model, but I can offload some of the layers to the GPU using the ngl argument. From my experiments, offloading 20 layers (out of the 32) could pass the initialization without OOM error.

Unfortunately, llama.cpp got stuck running the model. After some research, I tried disabling loop unrolling (by setting the V3D_DEBUG environment) and it seemed to go through:

V3D_DEBUG=noloopunroll ./build/bin/llama-cli -m <model.gguf> -p "Hello" -n 50 -ngl 20 -t 4 -c 512        

However, the model generated corrupted data and it was even slower than CPU (probably because I disabled loop unrolling) . :-(

I did some research and the hypothesis is that it might be something to do with the shader, which assumes 32/64 warp while Raspberry Pi has 16.

So far I didn’t have time to look further into this. Vulkan is new to me, so debugging this issue will be a bit challenging (and fun too!). Any advice will be appreciated.

Optimize CPU inference with LUT?

Idea inspired by the recent T-MAC paper, which uses LUTs (look-up tables) to replace arithmetic ops. This could be especially useful for low-bit quantized models. E.g. consider a multiplication between 8-bit and 4-bit numbers. If we pre-compute all possible combinations and save the results in a 256x16 table, then multiplication can be replaced by memory accesses.

I think this LUT idea makes a lot sense and might achieve significant performance boost on Raspberry Pi. In fact, the T-MAC paper showed some promising results on Raspberry Pi 5 already. Adopting this idea in my project will also be a direction that I’d be interested in looking into.

Further quantize the model to 2-bit?

I don’t prefer this option… If you look at the help page of llama.cpp’s quantization tool, you’ll see that Q2 adds a lot more ppl than Q4.

 2  or  Q4_0    :  4.34G, +0.4685 ppl @ Llama-3-8B
 3  or  Q4_1    :  4.78G, +0.4511 ppl @ Llama-3-8B
 8  or  Q5_0    :  5.21G, +0.1316 ppl @ Llama-3-8B
 9  or  Q5_1    :  5.65G, +0.1062 ppl @ Llama-3-8B
......
10  or  Q2_K    :  2.96G, +3.5199 ppl @ Llama-3-8B
21  or  Q2_K_S  :  2.96G, +3.1836 ppl @ Llama-3-8B        

Conclusion

The LlamaPi project successfully demonstrated the potential of Voice + LLM + Robotics on a low-power edge device. However, lots of work still need to be done to unleash the performance of Raspberry Pi. My target is to achieve 10 tokens/second with the 8B model on Raspberry Pi 5. If you have any ideas or suggestions, please let me know!

Garrett Herschleb

Software Development for HW Validation

6 个月

It's interesting your moving in the direction of shrinking capability to go into a small device when most of the world is working on building larger capacity hardware. Did you choose this direction as something just interesting to you, or are you working to a larger goal?

回复

要查看或添加评论,请登录

Ping Zhou的更多文章

社区洞察

其他会员也浏览了