Self-Hosting Open Source LLMs: Empowering CPU-Driven Inference

Addy Bhatia

Founder @ Suno Wellness | Bringing personal support to everyone's pockets.

发布日期: 2023年8月9日

In the realm of natural language processing, the emergence of large language models (LLMs) has revolutionized the way we approach text-based tasks. Open-source LLMs like GPT-3 have demonstrated remarkable capabilities, but often require private data to be shared and leave your organization.

Recently I've been experimenting heavily with running predictions on various models for my app Suno and I'd like to share these learnings. These are the same technologies I've used to create a consumer grade chat application while keeping data secure and spending the least amount of money as possible.

I'd like to explore the world of self-hosting open-source LLMs on different types of infrastructure, allowing engineers to harness their power without breaking the bank.

The Cost Factor

For many developers and small teams, the budgetary constraints of GPU-based inference can be daunting. GPUs charge based on compute time, with costs starting at $4 per hour. This financial burden can be prohibitive for ongoing projects and experimentation. However, there's a silver lining – using CPU-only models. These models offer a cost-effective alternative, enabling us to strike a balance between performance and affordability.

GPT4ALL

Introducing GPT4ALL, a powerful library that enables CPU inference from a diverse range of models. Leveraging GPT4ALL, engineers can tap into the capabilities of various models while running them exclusively on CPU resources. This approach not only slashes costs but also democratizes access to LLMs for a broader audience. We can now dive into the world of self-hosted models with confidence.

Choosing the Right Host

When it comes to hosting self-built LLMs, the cloud is our playground. Established platforms like Google Cloud VM and AWS provide versatile options for deploying and managing CPU-driven models. However, the journey isn't without its twists and turns. Initial attempts with serverless solutions like AWS Lambda and Google Cloud Run highlighted a limitation. Each model has to load into memory from cold-boot, which can take upwards of 20+ seconds – slow not ideal for a fast chat UX.

Leveraging Fly.io

Enter Fly.io, a hosting platform that marries efficiency with cost savings. One of its standout features is the ability to suspend services during inactivity, preserving resources and minimizing expenses. What makes Fly.io a winner for CPU-based LLMs is the fact that, upon every call, only the framework's startup code runs. This is thanks to pre-loading the model during server creation, allowing rapid and responsive inferences. And not mention that Fly also lets you replicate machines around the world to create your own edge-inference network (I'm not sponsored by Fly, I just love using their products).

# Get started with the CLI
brew?install?flyctl

# Launch any docker container
fly launch

# Run in three different regions
fly scale count 3 --region ams,hkg,sjc

Data Science Dojo 1 年前

Survey of Multimodal LLMs; Meet GOAT-7B-Community…

Danny Butvinik 1 年前

Google Gemma – Gemini junior

Lightning AI 7 个月前

Implementation

Let's get practical with implementation. Using FastAPI, a modern Python web framework, you can quickly set up an inference endpoint named "/predict". What's fascinating here is the use of a global model variable. By pre-loading the GPT4ALL model, you achieve snappy response times with just around 16GB of RAM. The global model approach ensures that each call benefits from the pre-loaded model, expediting the inference process.

import gpt4all
from fastapi import FastAPI


app = FastAPI()

# Pre-load GPT4ALL model
model = gpt4all.GPT4All(
               "ggml-mpt-7b-chat.bin",
               model_path="/models") # requires <8GB ram

@app.post("/predict")
async def predict_text(text: str):
??prediction = global_model.generate(text, max_tokens=256)
??return {"prediction": prediction}

Here's an example docker file to host this model

FROM python:3

# Copy local code to the container image.
ENV APP_HOME /app
WORKDIR $APP_HOME
COPY . ./

# Install production dependencies.
RUN pip install -r requirements.txt
RUN curl -o /models/ggml-mpt-7b-chat.bin https://gpt4all.io/models/ggml-mpt-7b-chat.bin

WORKDIR $APP_HOME
CMD exec uvicorn app.main:app --host 0.0.0.0 --port 8080

Scaling Up

For those seeking more robust options and access to powerful models like LLAMA-2 and Falcon-40b, Hugging Face's inference endpoints are a potent solution. These endpoints cater to both CPU and GPU options, expanding the horizons for hosting and deployment. HF also allows shutting down servers when inactive to avoid cost overruns.

Despite this, GPT4ALL maintains its allure, especially for models under 13 billion parameters that can operate smoothly on CPU infrastructure. Any model larger than this on a CPU (such as Llama and Falcon) and you might as well manually type out the responses.

Self-hosting open-source LLMs on CPU infrastructure is not just a budget-conscious move – it's a strategy that empowers developers to harness the immense power of language models without surrendering to exorbitant GPU costs. With tools like GPT4ALL and platforms like Fly.io, the door is wide open for engineers to embark on a journey of innovative, accessible, and responsive text-based applications. As the world of LLMs continues to evolve, let's embrace the potential of self-hosting and pave the way for efficient, CPU-driven inferences.

Self-Hosting Open Source LLMs: Empowering CPU-Driven Inference

Addy Bhatia

Founder @ Suno Wellness | Bringing personal support to everyone's pockets.

The Cost Factor

GPT4ALL

Choosing the Right Host

Leveraging Fly.io

领英推荐

Implementation

Scaling Up

社区洞察

其他会员也浏览了

How to Master OpenAI: A Comprehensive Guide OpenAI is a leading force in the field of artificial intelligence, with its models and tools transforming

Stargate Project, Claude Surpasses GPT-4 Turbo, DBRX Breakthrough, Grok 1.5 Upgrade, and More

?? LLMs are going XXL

Setting the Record Straight: Copilot for Microsoft 365 vs. Azure AI Studio

Optimizing Generative AI: An Introduction into Langchain's Caching Magic

Happy Llama Day ???? ?

Azure OpenAI Service Models

What makes LLM inference more challenging than traditional NLP?

Foundation Models & Vector Databases in AWS Marketplace (Part 2)

Open Source vs Closed Source LLM Models, How to Choose?