AI: Essential Considerations for Hosting Your Own Models

AI: Essential Considerations for Hosting Your Own Models

Artificial Intelligence (AI) has become a pervasive buzzword in the industry, evoking various reactions including an amusing supercut of Sunar Pichai at Google's 2023 I/O...

AI demands high-powered compute, which necessitates effective cooling, with immersion cooling being a popular choice. However, this article will focus on other aspects of AI.

Given the immense hype surrounding AI, it's essential to clarify that AI covers a broad spectrum. It is sometimes used interchangeably with Machine Learning and Visual Computing, which are their own distinct areas. Machine Learning encompasses applications like Amazon Alexa, self-driving cars, and email spam filtering, while Visual Computing includes Metaverse applications and Digital Twins like NVIDIA's Earth 2.

No alt text provided for this image

When people mention AI, they are usually referring to Large Language Models (LLMs), with ChatGPT and LLaMA being two prominent examples. While comparing them, I won't delve into the pros and cons or open/closed sourced distinctions but will use them as examples.


At a high level, people often seek answers to questions like:

How do I host my own PrivateGPT
What's the different between training and inferencing?
Do I need to upgrade my network?
Can I use my existing CPU's?        

Let's delve into each of these areas.


How do I host my own PrivateGPT?

Many individuals are interested in creating their own ChatGPT with private access and sovereignty. Let's take the example of the largest GPT3.5 model, gpt-3.5-turbo, which is trained on approximately 17TB of text data and contains around 175 Billion parameters. However, running such a model can be expensive, as estimated by Tom Goldstein in December 2022.

Although optimization of models, and faster hardware advancements (H100) have occurred since then, it's essential to consider the costs of running large models. Additionally, newer models like GPT4, with 10x more parameters (1.7 Trillion) and data, are continuously emerging.

No alt text provided for this image


For those building their AI models, using pre-trained, open-source models like Meta's LLaMA2 could be a cost-effective option. Although not an exact comparison, using Tom's estimation logic, a model like LLaMA2 with 40% of ChatGPT's size and 1% of the daily queries (e.g., 100k) might cost around $400 per day on Azure, assuming the same $3 per hour rate from December 2022.


What's the difference between training and inferencing?

AI comprises two main aspects: training and inferencing.

Training involves using diverse data to create a model, while inferencing entails using the model to generate predictions or responses.

ChatGPT (gpt3.5) is known to have used 10,000 GPUS (A100) and several months to train. For small companies, using off-the-shelf, pre-trained models and running inferencing is a more cost-effective option. However, medium and large companies may want to train their models and can consider renting compute resources from hyperscalers or GPU cloud companies.

Training typically requires significant GPU memory, while inferencing can be accomplished on more modest hardware, including CPUs. Larger and more real-time inferencing use cases might require higher-end, or specialized hardware like GPU's.


Do I need to upgrade my network?

NVIDIA promotes InfiniBand for networking, especially with A100 and H100 chips that work optimally with this technology. While InfiniBand creates a networking fabric to pool chips using NVSwitch, it also has some downsides, such as being proprietary and expensive.

When evaluating networking options, consider your use case and the way GPUs work. For example, NVIDIA GPUs can be bought with PCIe or NVLink connectivity, and NVSwitch allows for interconnecting nodes.

To keep things simple, networking choices can range from;

  • 10-100G for 1-16 GPUs using PCIe
  • 100G/200G+ for 16-60 GPUs using NVLink
  • 200/400G+ for 100+ GPUs requiring the fastest speed and lowest latency.


Note: Be sure to consider availability of networking gear before deciding a path for upgrading or augmenting your existing network as the networking supply chain is still very lumpy for high demand components.


While InfiniBand has its keen followers and vocal opponents, there are ongoing efforts by companies like Arista, Cisco, Broadcom, AMD, and Intel to compete with InfiniBand through the Ultra Ethernet Consortium (UEC).

Expect products based on UEC standards to be available in 2024.


Can I use my existing CPU's?

While AI infrastructure primarily focuses on specialized hardware like GPUs, FPGAs, and ASICs, don't overlook the role of general-purpose CPUs. Existing CPUs can be utilized to support AI infrastructure, performing tasks such as dataset aggregation, querying, hosting value-added AI services, or running some inferencing.

Although CPUs are less efficient than GPUs, FPGAs, and ASICs in terms of performance per watt, they remain important components in an AI setup.


Summing it up, hosting your own AI models involves careful consideration of factors such as cost, hardware capabilities, networking, and the role of CPUs. New, and updated models are released frequently, as are efforts to optimize for less costly, more available hardware. Qualcomm have announced they are working with Meta to optimize LLaMA2 to enable on-device AI which will flow through to Android phones next year, and MLC-LLM can run natively on your iPhone today.

As AI technology progresses, staying informed and optimizing your approach will be crucial for a successful implementation.

Paul Edmondson

EMEA VP & GM at GRC: The Immersion Cooling Authority

1 年

要查看或添加评论,请登录

Nick Hume的更多文章

  • Behind the Curtain: AWS re:Invent 2024 Highlights

    Behind the Curtain: AWS re:Invent 2024 Highlights

    Expanding on my post from last week, it was great to see AWS leaning back into their engineering roots at re:Invent…

    3 条评论
  • OCP Global Summit 2024 Series

    OCP Global Summit 2024 Series

    For the final piece of the Global Summit wrap up, I focus on Networking, both inside the server and between racks, and…

  • OCP Global Summit 2024 Series

    OCP Global Summit 2024 Series

    We've touched on the power innovations at the summit, so obviously, the next logical step is to talk about cooling…

    2 条评论
  • OCP Global Summit 2024 Series

    OCP Global Summit 2024 Series

    Originally planned as a two-part reflection, my series from the fantastic OCP Summit has grown into a series! Up next:…

    2 条评论
  • OCP Global Summit 2024 Series

    OCP Global Summit 2024 Series

    It’s been a busy conference season, with the AI Hardware and Edge AI Summit, Yotta 2024, and OCP’s Global Summit all…

    3 条评论
  • AI for real life

    AI for real life

    As I’ve been busy with my day job(s) and various projects, like the Tech Insider Podcast, I haven’t put my hands to the…

    1 条评论
  • To InfiniBand, maybe beyond?

    To InfiniBand, maybe beyond?

    Nvidia's latest roadmap was teased at Computex in Taiwan last month. Whilst details were a little light on PFLOPS and…

  • Apple, not Artificial, Intelligence

    Apple, not Artificial, Intelligence

    Just last month, Apple hosted their yearly WWDC - an event where they showcase all the updates to their platforms…

  • Oh great, another podcast...

    Oh great, another podcast...

    As you may have seen (or heard my "Ausmerican" accent) recently, I've started a podcast, and wanted to share a little…

    2 条评论
  • OCP 2024 Regional Summit wrap

    OCP 2024 Regional Summit wrap

    The Open Compute Project (OCP) Regional Summit was hosted in Lisbon, Portugal last month, the 5th (and largest)…

社区洞察

其他会员也浏览了