A New Era for Private AI: Multiple GPUs Support Using External GPUs, Oculink, and PCIe Cards (DIY)
Massimiliano P.
Security | Public Key Infrastructures (PKI) | Cryptography | Post-Quantum Cryptography (PQC) | Authentications | Protocol Design | Crypto Agility | Usability | Network Architectures | Standards | Policy | Leadership
Introduction
In the evolving landscape of AI and machine learning, leveraging powerful GPUs for model training and inference is crucial. While enterprise setups have built-in GPU support, many home and small business servers lack any form of GPU expandability. This article explores how to add support for external GPUs to your home server (e.g., we use the Dell PowerEdge 730xd server running Ubuntu 24 throughout this article for our examples) by using Oculink PCIe adapters, enabling high-performance AI workloads even on hardware that was even never designed for GPU usage or AI inference.
Executive Summary (TLDR)
This article provides a comprehensive guide to expanding GPU capabilities for home and business servers that lack native GPU support. By using Oculink PCIe adapters, users can connect external GPUs to home servers and desktops, thus unlocking powerful AI model fine-tuning and inference capabilities while maintaining data privacy.
Key Takeaways:
By leveraging Oculink technology and cost-effective GPU setups, users can run powerful AI models locally while keeping sensitive data private and reducing reliance on expensive cloud-based solutions.
Background: The Importance of Local AI Processing
With growing concerns around data privacy and cloud dependency, the ability to run AI models locally is becoming increasingly important. Processing AI workloads on local hardware ensures that sensitive data remains secure, avoids vendor lock-in, and offers full control over performance and latency.
However, one of the biggest challenges in achieving local AI computing is the lack of GPU support in many existing devices. Servers that you might already have at home or that you can find for few hundred dollars on the internet (e.g., 1U or 2U rack mountable) and that were never designed with GPU compatibility, making traditional expansion difficult. Fortunately, Oculink technology provides a seamless and efficient way to add external GPUs to almost any system, regardless of form factor. This improves the previous approaches of using USB or Thunderbolt connectivity with the GPU card.
By using an Oculink connector, an external GPU enclosure, and a bit of setup time, you can unlock the potential of AI computing on a variety of hardware. This method allows users to upgrade their existing machines, extending their lifespan and making them capable of handling AI workloads, deep learning training, and inference tasks without expensive cloud subscriptions.
Hardware Setup: Connecting External GPUs via Oculink
Expanding your home server (which might not have native GPU support or even the space in the chassis to host a large card) with an external GPU requires a combination of hardware components that enable a seamless connection between your server and a high-performance graphics card.
The primary requirement is an Oculink PCIe adapter, which acts as a bridge between the server’s internal PCIe slot and an externally housed GPU. These adapters are available in various configurations, the most popular at the time of writing this article are SFF-8612 to SFF-8611 PCIe Express 3.0/4.0 adapters, which provide a reliable and high-bandwidth connection, even for older systems.
Once you have the adapter installed in a free PCIe (via a PCIe Oculink connector) or M.2 slot (via an M.2 to PCIe connector), you’ll need an external GPU enclosure together with a PCIe socket to house your GPU. Options for connecting the GPU and the server such as Oculink SFF-8612 to PCIe 3.0/4.0 M.2 M-Key adapters ensure that your GPU is properly interfaced with the server.
Powering the GPU is straightforward: a standard ATX power supply (ranging from 600W to 1000W, depending on GPU requirements) can be used to deliver the necessary power to your external graphics card. With the physical connections in place, boot your server and verify that the system recognizes the GPU by running:
lspci | grep VGA
If the external GPU is detected, you are ready to utilize its processing power for AI workloads.
Installing Ollama on Ubuntu 24 and Testing Models
Ollama is a powerful AI model deployment framework designed to simplify running large language models (LLMs) on local machines. It provides an easy-to-use interface to download, manage, and execute AI models without requiring cloud-based infrastructure. By leveraging Ollama, users can efficiently run models such as Meta's LLaMA, IBM's Granite, and DeepSeek on their home or business servers, ensuring data privacy, lower latency, and reduced dependency on external services.
One of the major advantages of using Ollama is its streamlined model management system, which allows for quick downloading and switching between different AI models. It also optimizes inference performance by automatically leveraging available GPU resources, making it a great fit for any home server once its external GPU is installed. Below, we will walk through the steps to install Ollama on Ubuntu 24 and begin testing some AI models.
With the external GPU installed, let’s set up Ollama to run AI models efficiently.
Step 1: Install Ollama
curl -fsSL https://ollama.com/install.sh | sh
Step 2: Pull and Test AI Models
To download and run different AI models, execute:
ollama pull meta/llama3.3
ollama pull ibm/granite3.1
ollama pull deepseek-R1
Then, test a model with:
ollama run meta/llama3.3
If the output confirms execution, your setup is successful (use /bye to exit the prompt).
Step 3: Running Ollama as a Service
To ensure Ollama starts automatically and runs in the background, enable it as a system service:
sudo systemctl enable ollama
sudo systemctl start ollama
This will ensure that Ollama is available after a reboot without requiring manual intervention.
Step 4: Accessing Ollama from Other Machines
If you want to access Ollama from other desktop or client machines on your network, set the OLLAMA_HOST environment variable to the IP address or hostname of your Ollama server. For example:
export OLLAMA_HOST=https://ollamaserver:11434
Once set, running ollama ls on the client machine will list the models installed on the server, allowing shared access without duplicating models on multiple machines.
Fixing Fan Speed Issues
In our example we used the Dell PowerEdge R730xd server that we acquired used for a very reasonable price (Dual CPU, 394G RAM, 40 Tb of Disks). However, a known issue with Dell PowerEdge servers is that when any PCIe card that lacks a temperature sensor is attached to the PCIe interface, the server's power management system assumes the need for cooling and pushes the fans to run at maximum speed. This precautionary measure is meant to prevent potential overheating and hardware damage, but it does not fit a home office use-case.
In cases like this, where an Oculink PCIe card is installed but does not generate significant heat within the chassis, we can safely override this behavior to maintain a quieter and more efficient operation.
To fix this you can use the following procedure.
Step 1: Install ipmitool
sudo apt install ipmitool
Step 2: Update the default behavior
In order to avoid the spinning of the Fans for no reason, you can disable the default behavior for 3rd Party Cards attached to the system by running the following command:
sudo ipmitool raw 0x30 0xce 0x00 0x16 \
0x05 0x00 0x00 0x00 0x05 0x00 0x01 0x00 0x00
This tells the system to ignore the missing sensor, stabilizing the fan speed.
Choosing the Right GPU for Fine-Tuning AI Models
Fine-tuning AI models requires significantly less memory than full-scale training but still demands sufficient GPU VRAM, particularly when working with LoRA (Low-Rank Adaptation) and QLoRA (Quantized Low-Rank Adaptation) methods. These techniques allow users to fine-tune large models efficiently by modifying only small subsets of weights, drastically reducing VRAM requirements compared to full fine-tuning. However, if full fine-tuning is necessary, a much larger memory pool is required.
Given the high cost of modern GPUs, many AI practitioners and researchers are exploring earlier-generation GPUs, such as the NVIDIA P100-102, which offer a strong return on investment due to their significantly lower price compared to newer GPUs. While they may not have the cutting-edge performance of the latest hardware, these older GPUs still provide ample VRAM and compute power for many fine-tuning and inference tasks.
For instance, a dual or triple-GPU setup using 2x/3x P100-102 10GB cards can efficiently fine-tune a 7B model and significantly speed up inference on 14B or even 32B models. The cost savings of using these GPUs can be substantial, making them an attractive choice for those looking to perform fine-tuning without investing in expensive enterprise-grade GPUs.
If using LoRA (FP16), a 24GB GPU like an RTX 4090 can handle the task efficiently by only updating a small subset of parameters. However, for QLoRA (NF4), which optimizes memory usage further, a 12GB GPU such as an RTX 3080 Ti might suffice. If performing full fine-tuning (FP16), where all model weights are updated, the memory requirement jumps to 30GB, necessitating GPUs like an A100 40GB or higher.
Choosing the right fine-tuning method depends on available hardware and the level of customization needed for your model. However, for full fine-tuning of larger models, enterprise-grade GPUs like the H100, MI300X, or multiple A100s in parallel become necessary.
Final Thoughts and Future Research
This article has outlined the process of expanding GPU capabilities on home and business servers using Oculink PCIe adapters, enabling cost-effective AI model fine-tuning and inference. By leveraging external GPUs, users can maximize performance while minimizing costs, transforming non-GPU-supported hardware into powerful and private AI workstations.
As the technology and tools continue to evolve, it is crucial to explore methods to empower more individuals and organizations with access to large models while keeping data private.
Local AI computing reduces reliance on cloud providers, enhances security, and ensures greater control over data. Researchers, developers, and hardware enthusiasts should continue to innovate and optimize affordable, scalable AI solutions that can democratize machine learning capabilities for everyone.