Harnessing Local LLMs: A Guide to Setting Up Ollama with GPU Passthrough
In the rapidly evolving landscape of AI, the ability to run large language models (LLMs) locally is becoming increasingly desirable for engineering leaders and AI practitioners. This not only ensures data privacy but also provides flexibility and control over computational resources. In this blog post, I'll share my journey of setting up Ollama with GPU passthrough on a homelab machine using Open WebUI.
Initial Setup on TrueNAS Scale
My adventure began by deploying Ollama and Open WebUI on a TrueNAS Scale server hosted on a Lenovo M710t. The setup was straightforward, thanks to the intuitive app installation process available in TrueNAS. I ensured that Ollama had ample RAM (~24GiB) for optimal performance. Configuring Open WebUI involved matching port numbers with the local Ollama instance. Once set up, I successfully downloaded and utilized various models such as deepseek-r1, qwen2.5, llama3.2, and gemma2.
Despite running on a machine powered by an Intel 7th generation processor, the performance was commendable, highlighting the efficiency of local LLM deployment even on older hardware configurations.
Enhancing Performance with GPU Passthrough
The next phase of my exploration involved setting up Ollama with GPU passthrough to significantly boost processing power. I turned to a Beelink SER5 Mini PC equipped with an AMD Ryzen 7 5800H processor for this task.
Initial Challenges
Initially, I followed a guide that detailed creating an Alpine LXC container, configuring GPU passthrough, and setting up Docker compose with Ollama and Open WebUI. While the setup achieved improved speed (2-4x faster), it failed to utilize the GPU effectively due to insufficient VRAM allocation. Initially set at 64 MB, I increased this to 16 GB, leaving ample RAM for other processes.
Despite these adjustments, permission issues arose, as device mounts within the LXC container were assigned the nobody group. Attempts to resolve this through cgroup mappings and changing the container's privilege settings proved futile and even disrupted Docker functionality temporarily.
Successful Configuration
A breakthrough came when I discovered a setup script from helper-scripts that installed Ollama on a Debian-based LXC, which included necessary video and render groups matching the host system. This setup facilitated GPU passthrough without major hitches.
领英推荐
I modified the configuration to include:
lxc.cgroup2.devices.allow: c 226:128 rwm
lxc.cgroup2.devices.allow: c 234:0 rwm
lxc.mount.entry: /dev/kfd dev/kfd non bind,optional,create=file
lxc.mount.entry: /dev/dri/renderD128 dev/dri/renderD128 none bind,optional,create=file
Despite initial compatibility issues with the ollama version installed via helper-scripts, I switched to the rocm version of Ollama by following these steps:
service ollama stop
cd /usr
curl -fsSLO https://ollama.com/download/ollama-linux-amd64-rocm.tgz
tar -C /usr -xzf ollama-linux-amd64-rocm.tgz
rm /usr/ollama-linux-amd64-rocm.tgz
To address the lack of support for embedded graphics, I set an environment variable HSA_OVERRIDE_GFX_VERSION='9.0.0' in the service configuration:
[Service]
Environment=HSA_OVERRIDE_GFX_VERSION='9.0.0'
With these adjustments, Ollama utilized the GPU effectively, and model performance improved significantly.
Conclusion
This journey underscores the potential of local LLM deployment for AI practitioners seeking control and flexibility. By leveraging tools like Ollama with GPU passthrough, we can harness powerful computational resources efficiently. Whether you're running on older hardware or setting up a dedicated homelab, these insights can help streamline your setup process.
Feel free to reach out if you have questions or need further guidance on this exciting venture into local AI deployment!
Enterprise Web Apps, DevSecOps, FinTech
1 个月Timely article Hussain Abbas! I was wondering how this compares to a minimal setup like an NVIDIA Jetson Nano Super, any thoughts?
Co-Founder of Altrosyn and DIrector at CDTECH | Inventor | Manufacturer
1 个月The integration of Ollama with GPU passthrough enables efficient local LLM execution, leveraging hardware acceleration for performance gains. Studies indicate that rocm-based implementations can achieve up to 20% faster inference speeds compared to CPU-only approaches. Your mention of "creative config tweaks" suggests a potential avenue for further optimization. Given the increasing demand for real-time AI applications in fields like healthcare, how could these Ollama configurations be tailored to facilitate rapid medical diagnosis from patient data?