登录查看更多内容

Harnessing Local LLMs: A Guide to Setting Up Ollama with GPU Passthrough

Hussain Abbas

Director of Engineering at Axelerant

发布日期: 2025年1月29日

In the rapidly evolving landscape of AI, the ability to run large language models (LLMs) locally is becoming increasingly desirable for engineering leaders and AI practitioners. This not only ensures data privacy but also provides flexibility and control over computational resources. In this blog post, I'll share my journey of setting up Ollama with GPU passthrough on a homelab machine using Open WebUI.

Initial Setup on TrueNAS Scale

My adventure began by deploying Ollama and Open WebUI on a TrueNAS Scale server hosted on a Lenovo M710t. The setup was straightforward, thanks to the intuitive app installation process available in TrueNAS. I ensured that Ollama had ample RAM (~24GiB) for optimal performance. Configuring Open WebUI involved matching port numbers with the local Ollama instance. Once set up, I successfully downloaded and utilized various models such as deepseek-r1, qwen2.5, llama3.2, and gemma2.

Despite running on a machine powered by an Intel 7th generation processor, the performance was commendable, highlighting the efficiency of local LLM deployment even on older hardware configurations.

Enhancing Performance with GPU Passthrough

The next phase of my exploration involved setting up Ollama with GPU passthrough to significantly boost processing power. I turned to a Beelink SER5 Mini PC equipped with an AMD Ryzen 7 5800H processor for this task.

Initial Challenges

Initially, I followed a guide that detailed creating an Alpine LXC container, configuring GPU passthrough, and setting up Docker compose with Ollama and Open WebUI. While the setup achieved improved speed (2-4x faster), it failed to utilize the GPU effectively due to insufficient VRAM allocation. Initially set at 64 MB, I increased this to 16 GB, leaving ample RAM for other processes.

Despite these adjustments, permission issues arose, as device mounts within the LXC container were assigned the nobody group. Attempts to resolve this through cgroup mappings and changing the container's privilege settings proved futile and even disrupted Docker functionality temporarily.

Successful Configuration

A breakthrough came when I discovered a setup script from helper-scripts that installed Ollama on a Debian-based LXC, which included necessary video and render groups matching the host system. This setup facilitated GPU passthrough without major hitches.

领英推荐

NVIDIA Grace Blackwell NVLink72: Engineering a…

Anand Ramachandran 1 周前

Latest Updates: 36K NVIDIA GB200 GPU Cluster, New FLUX…

Together AI 4 个月前

Google Coral Edge TPU Vs NVIDIA Jetson Nano.

Saeed Al Hasan .AI ?? 3 个月前

I modified the configuration to include:

lxc.cgroup2.devices.allow: c 226:128 rwm
lxc.cgroup2.devices.allow: c 234:0 rwm
lxc.mount.entry: /dev/kfd dev/kfd non bind,optional,create=file
lxc.mount.entry: /dev/dri/renderD128 dev/dri/renderD128 none bind,optional,create=file

Despite initial compatibility issues with the ollama version installed via helper-scripts, I switched to the rocm version of Ollama by following these steps:

service ollama stop
cd /usr
curl -fsSLO https://ollama.com/download/ollama-linux-amd64-rocm.tgz
tar -C /usr -xzf ollama-linux-amd64-rocm.tgz
rm /usr/ollama-linux-amd64-rocm.tgz

To address the lack of support for embedded graphics, I set an environment variable HSA_OVERRIDE_GFX_VERSION='9.0.0' in the service configuration:

[Service]
Environment=HSA_OVERRIDE_GFX_VERSION='9.0.0'

With these adjustments, Ollama utilized the GPU effectively, and model performance improved significantly.

Conclusion

This journey underscores the potential of local LLM deployment for AI practitioners seeking control and flexibility. By leveraging tools like Ollama with GPU passthrough, we can harness powerful computational resources efficiently. Whether you're running on older hardware or setting up a dedicated homelab, these insights can help streamline your setup process.

Feel free to reach out if you have questions or need further guidance on this exciting venture into local AI deployment!

Tsegaselassie Tadesse

Enterprise Web Apps, DevSecOps, FinTech

1 个月

Timely article Hussain Abbas! I was wondering how this compares to a minimal setup like an NVIDIA Jetson Nano Super, any thoughts?

Godwin Josh

Co-Founder of Altrosyn and DIrector at CDTECH | Inventor | Manufacturer

1 个月

The integration of Ollama with GPU passthrough enables efficient local LLM execution, leveraging hardware acceleration for performance gains. Studies indicate that rocm-based implementations can achieve up to 20% faster inference speeds compared to CPU-only approaches. Your mention of "creative config tweaks" suggests a potential avenue for further optimization. Given the increasing demand for real-time AI applications in fields like healthcare, how could these Ollama configurations be tailored to facilitate rapid medical diagnosis from patient data?

查看更多评论

要查看或添加评论，请登录

Hussain Abbas的更多文章

Are Your "Best Practices" Holding You Back from Excellence?

2025年3月27日

Are Your "Best Practices" Holding You Back from Excellence?

We hear it all the time in the professional world: "Follow the best practices." It sounds like solid advice, doesn't…
Scaling Your Firm? Rethink the Profit Center Default

2025年3月26日

Scaling Your Firm? Rethink the Profit Center Default

As leaders focused on scaling our organizations, we constantly grapple with structure. How do we maintain manageability…

1 条评论
Unlocking AI Power in Your Homelab: Choosing the Right Budget Mini-PC Processor

2025年3月10日

Unlocking AI Power in Your Homelab: Choosing the Right Budget Mini-PC Processor

The allure of a powerful, compact, and energy-efficient homelab is undeniable. Mini-PCs fit this bill perfectly, but…
Why Technical Perfection Isn’t Enough: Rethinking Client-Centric Success for Engineers

2025年3月6日

Why Technical Perfection Isn’t Enough: Rethinking Client-Centric Success for Engineers

For software engineers, technical correctness is sacred. We debug code, optimize algorithms, and pride ourselves on…

2 条评论
Advent Of Code Day 13 Reflections

2024年12月13日

Advent Of Code Day 13 Reflections

Today’s advent of code challenge was a lot of fun. It’s always fun when a brute force attempt can be reduced to a…

See all articles

Harnessing Local LLMs: A Guide to Setting Up Ollama with GPU Passthrough

Hussain Abbas

Director of Engineering at Axelerant

Initial Setup on TrueNAS Scale

Enhancing Performance with GPU Passthrough

Initial Challenges

Successful Configuration

领英推荐

Conclusion

Hussain Abbas的更多文章

社区洞察

其他会员也浏览了

NVIDIA DGX Spark: A Detailed Report on Specifications

Running ML inference with AMD GPU and ROCm (Part II)

NVIDIA GTC 2025: AI Reasoning, Blackwell Ultra, Vera Rubin, CPO, Dynamo Inference

NVIDIA CUDA and Ollama for AI Model Deployment

Crafting an Alternative Edge Computing Solution to NVIDIA CUDA

The Tale of Two Processors: How CPUs and GPUs Work Together to Shape the Digital World

DEEP LEARNING with AMD? Maybe we can....

NVIDIA’s Blackwell Architecture: Breaking Down The B100, B200, and GB200

Open source universal language for GPUs

Understanding the Importance of GPUs in Modern Computing

Initial Setup on TrueNAS Scale

Enhancing Performance with GPU Passthrough

Initial Challenges

Successful Configuration

领英推荐

Conclusion

Hussain Abbas的更多文章

Are Your "Best Practices" Holding You Back from Excellence?

Scaling Your Firm? Rethink the Profit Center Default

Unlocking AI Power in Your Homelab: Choosing the Right Budget Mini-PC Processor

Why Technical Perfection Isn’t Enough: Rethinking Client-Centric Success for Engineers

Advent Of Code Day 13 Reflections

社区洞察

其他会员也浏览了

NVIDIA DGX Spark: A Detailed Report on Specifications

Running ML inference with AMD GPU and ROCm (Part II)

NVIDIA GTC 2025: AI Reasoning, Blackwell Ultra, Vera Rubin, CPO, Dynamo Inference

NVIDIA CUDA and Ollama for AI Model Deployment

Crafting an Alternative Edge Computing Solution to NVIDIA CUDA

The Tale of Two Processors: How CPUs and GPUs Work Together to Shape the Digital World

DEEP LEARNING with AMD? Maybe we can....

NVIDIA’s Blackwell Architecture: Breaking Down The B100, B200, and GB200

Open source universal language for GPUs

Understanding the Importance of GPUs in Modern Computing