登录查看更多内容

A New Era for Private AI: Multiple GPUs Support Using External GPUs, Oculink, and PCIe Cards (DIY)

Massimiliano P.

Security | Public Key Infrastructures (PKI) | Cryptography | Post-Quantum Cryptography (PQC) | Authentications | Protocol Design | Crypto Agility | Usability | Network Architectures | Standards | Policy | Leadership

发布日期: 2025年3月2日

Introduction

In the evolving landscape of AI and machine learning, leveraging powerful GPUs for model training and inference is crucial. While enterprise setups have built-in GPU support, many home and small business servers lack any form of GPU expandability. This article explores how to add support for external GPUs to your home server (e.g., we use the Dell PowerEdge 730xd server running Ubuntu 24 throughout this article for our examples) by using Oculink PCIe adapters, enabling high-performance AI workloads even on hardware that was even never designed for GPU usage or AI inference.

Executive Summary (TLDR)

This article provides a comprehensive guide to expanding GPU capabilities for home and business servers that lack native GPU support. By using Oculink PCIe adapters, users can connect external GPUs to home servers and desktops, thus unlocking powerful AI model fine-tuning and inference capabilities while maintaining data privacy.

Key Takeaways:

Hardware Expansion: Step-by-step instructions on setting up Oculink PCIe adapters, external enclosures, and power supplies to connect high-performance GPUs to unsupported servers.
Ollama AI Model Execution: Detailed guide on installing Ollama on Ubuntu 24, enabling efficient execution of AI models like Meta’s LLaMA, IBM’s Granite, and DeepSeek.
Solving Fan Speed Issues: Explanation of why PCIe cards without temperature sensors may cause excessive fan speeds and how to fix it using ipmitool.
Choosing the Right GPU for Fine-Tuning: A breakdown of VRAM requirements for various AI models and fine-tuning methods, from LoRA and QLoRA to full fine-tuning.
Cost-Effective GPU Alternatives: Exploring older-generation GPUs like the NVIDIA P100-102, which offer acceptable performance for AI workloads at a fraction of the cost of modern GPUs.

By leveraging Oculink technology and cost-effective GPU setups, users can run powerful AI models locally while keeping sensitive data private and reducing reliance on expensive cloud-based solutions.

Background: The Importance of Local AI Processing

With growing concerns around data privacy and cloud dependency, the ability to run AI models locally is becoming increasingly important. Processing AI workloads on local hardware ensures that sensitive data remains secure, avoids vendor lock-in, and offers full control over performance and latency.

However, one of the biggest challenges in achieving local AI computing is the lack of GPU support in many existing devices. Servers that you might already have at home or that you can find for few hundred dollars on the internet (e.g., 1U or 2U rack mountable) and that were never designed with GPU compatibility, making traditional expansion difficult. Fortunately, Oculink technology provides a seamless and efficient way to add external GPUs to almost any system, regardless of form factor. This improves the previous approaches of using USB or Thunderbolt connectivity with the GPU card.

By using an Oculink connector, an external GPU enclosure, and a bit of setup time, you can unlock the potential of AI computing on a variety of hardware. This method allows users to upgrade their existing machines, extending their lifespan and making them capable of handling AI workloads, deep learning training, and inference tasks without expensive cloud subscriptions.

Hardware Setup: Connecting External GPUs via Oculink

Expanding your home server (which might not have native GPU support or even the space in the chassis to host a large card) with an external GPU requires a combination of hardware components that enable a seamless connection between your server and a high-performance graphics card.

The primary requirement is an Oculink PCIe adapter, which acts as a bridge between the server’s internal PCIe slot and an externally housed GPU. These adapters are available in various configurations, the most popular at the time of writing this article are SFF-8612 to SFF-8611 PCIe Express 3.0/4.0 adapters, which provide a reliable and high-bandwidth connection, even for older systems.

Oculink connection between GPU and Server. — Example connection between GPU and Server. The oculink connectors (SFF-8612) are circled in orange, the GPU's power connections are circled in Yellow, while the switch for the GPU power are circled in blue color.

Once you have the adapter installed in a free PCIe (via a PCIe Oculink connector) or M.2 slot (via an M.2 to PCIe connector), you’ll need an external GPU enclosure together with a PCIe socket to house your GPU. Options for connecting the GPU and the server such as Oculink SFF-8612 to PCIe 3.0/4.0 M.2 M-Key adapters ensure that your GPU is properly interfaced with the server.

Powering the GPU is straightforward: a standard ATX power supply (ranging from 600W to 1000W, depending on GPU requirements) can be used to deliver the necessary power to your external graphics card. With the physical connections in place, boot your server and verify that the system recognizes the GPU by running:

lspci | grep VGA

If the external GPU is detected, you are ready to utilize its processing power for AI workloads.

Installing Ollama on Ubuntu 24 and Testing Models

Ollama is a powerful AI model deployment framework designed to simplify running large language models (LLMs) on local machines. It provides an easy-to-use interface to download, manage, and execute AI models without requiring cloud-based infrastructure. By leveraging Ollama, users can efficiently run models such as Meta's LLaMA, IBM's Granite, and DeepSeek on their home or business servers, ensuring data privacy, lower latency, and reduced dependency on external services.

One of the major advantages of using Ollama is its streamlined model management system, which allows for quick downloading and switching between different AI models. It also optimizes inference performance by automatically leveraging available GPU resources, making it a great fit for any home server once its external GPU is installed. Below, we will walk through the steps to install Ollama on Ubuntu 24 and begin testing some AI models.

With the external GPU installed, let’s set up Ollama to run AI models efficiently.

Step 1: Install Ollama

curl -fsSL https://ollama.com/install.sh | sh

Step 2: Pull and Test AI Models

To download and run different AI models, execute:

ollama pull meta/llama3.3
ollama pull ibm/granite3.1
ollama pull deepseek-R1

Then, test a model with:

ollama run meta/llama3.3

If the output confirms execution, your setup is successful (use /bye to exit the prompt).

Step 3: Running Ollama as a Service

To ensure Ollama starts automatically and runs in the background, enable it as a system service:

sudo systemctl enable ollama
sudo systemctl start ollama

This will ensure that Ollama is available after a reboot without requiring manual intervention.

Step 4: Accessing Ollama from Other Machines

If you want to access Ollama from other desktop or client machines on your network, set the OLLAMA_HOST environment variable to the IP address or hostname of your Ollama server. For example:

export OLLAMA_HOST=https://ollamaserver:11434

Once set, running ollama ls on the client machine will list the models installed on the server, allowing shared access without duplicating models on multiple machines.

Fixing Fan Speed Issues

In our example we used the Dell PowerEdge R730xd server that we acquired used for a very reasonable price (Dual CPU, 394G RAM, 40 Tb of Disks). However, a known issue with Dell PowerEdge servers is that when any PCIe card that lacks a temperature sensor is attached to the PCIe interface, the server's power management system assumes the need for cooling and pushes the fans to run at maximum speed. This precautionary measure is meant to prevent potential overheating and hardware damage, but it does not fit a home office use-case.

In cases like this, where an Oculink PCIe card is installed but does not generate significant heat within the chassis, we can safely override this behavior to maintain a quieter and more efficient operation.

To fix this you can use the following procedure.

Step 1: Install ipmitool

sudo apt install ipmitool

Step 2: Update the default behavior

In order to avoid the spinning of the Fans for no reason, you can disable the default behavior for 3rd Party Cards attached to the system by running the following command:

sudo ipmitool raw 0x30 0xce 0x00 0x16 \
    0x05 0x00 0x00 0x00 0x05 0x00 0x01 0x00 0x00

This tells the system to ignore the missing sensor, stabilizing the fan speed.

Choosing the Right GPU for Fine-Tuning AI Models

Fine-tuning AI models requires significantly less memory than full-scale training but still demands sufficient GPU VRAM, particularly when working with LoRA (Low-Rank Adaptation) and QLoRA (Quantized Low-Rank Adaptation) methods. These techniques allow users to fine-tune large models efficiently by modifying only small subsets of weights, drastically reducing VRAM requirements compared to full fine-tuning. However, if full fine-tuning is necessary, a much larger memory pool is required.

Given the high cost of modern GPUs, many AI practitioners and researchers are exploring earlier-generation GPUs, such as the NVIDIA P100-102, which offer a strong return on investment due to their significantly lower price compared to newer GPUs. While they may not have the cutting-edge performance of the latest hardware, these older GPUs still provide ample VRAM and compute power for many fine-tuning and inference tasks.

For instance, a dual or triple-GPU setup using 2x/3x P100-102 10GB cards can efficiently fine-tune a 7B model and significantly speed up inference on 14B or even 32B models. The cost savings of using these GPUs can be substantial, making them an attractive choice for those looking to perform fine-tuning without investing in expensive enterprise-grade GPUs.

If using LoRA (FP16), a 24GB GPU like an RTX 4090 can handle the task efficiently by only updating a small subset of parameters. However, for QLoRA (NF4), which optimizes memory usage further, a 12GB GPU such as an RTX 3080 Ti might suffice. If performing full fine-tuning (FP16), where all model weights are updated, the memory requirement jumps to 30GB, necessitating GPUs like an A100 40GB or higher.

Choosing the right fine-tuning method depends on available hardware and the level of customization needed for your model. However, for full fine-tuning of larger models, enterprise-grade GPUs like the H100, MI300X, or multiple A100s in parallel become necessary.

Final Thoughts and Future Research

This article has outlined the process of expanding GPU capabilities on home and business servers using Oculink PCIe adapters, enabling cost-effective AI model fine-tuning and inference. By leveraging external GPUs, users can maximize performance while minimizing costs, transforming non-GPU-supported hardware into powerful and private AI workstations.

As the technology and tools continue to evolve, it is crucial to explore methods to empower more individuals and organizations with access to large models while keeping data private.

Local AI computing reduces reliance on cloud providers, enhances security, and ensures greater control over data. Researchers, developers, and hardware enthusiasts should continue to innovate and optimize affordable, scalable AI solutions that can democratize machine learning capabilities for everyone.

References

要查看或添加评论，请登录

Massimiliano P.的更多文章

Addressing Cyber-Physical Threats With AI

2024年9月20日

Addressing Cyber-Physical Threats With AI

The rise of social media as an integral part of many people’s lives has helped many people and communities to connect…

1 条评论
(Post-)Quantum(-Safe) or Quantum-Vulnerable Cryptography?

2024年3月8日

(Post-)Quantum(-Safe) or Quantum-Vulnerable Cryptography?

Navigating the Nomenclature Maze, The Quest for Clarity In the rapidly evolving landscape of cryptography, the advent…

5 条评论
Quantum-Safety Readiness: A call for a collaborative Telecommunications Future

2024年2月27日

Quantum-Safety Readiness: A call for a collaborative Telecommunications Future

The Third Post-Quantum Network Seminar at MWC Barcelona 2024 was a pivotal event that focused on issues and…

7 条评论
What you need to know about Quantum-Safe Cryptography for Mobile Networks

2024年2月22日

What you need to know about Quantum-Safe Cryptography for Mobile Networks

On February 22, 2024, GSMA release its report on "Post Quantum Cryptography – Guidelines for Telecom Use Cases" that…
Global Cybersecurity Authorities Warn About The Security of Quantum Key Distribution (QKD)

2024年2月1日

Global Cybersecurity Authorities Warn About The Security of Quantum Key Distribution (QKD)

In a recent joint position paper that has garnered significant attention within the cybersecurity community, four…
Understanding The Long Road to Quantum-Safe Deployments

2023年7月11日

Understanding The Long Road to Quantum-Safe Deployments

Deploying new cryptography is hard. Deploying new cryptography for network devices is very hard.

See all articles