Self-Hosting Open Source LLMs: Empowering CPU-Driven Inference
Addy Bhatia
Founder @ Suno Wellness | Bringing personal support to everyone's pockets.
In the realm of natural language processing, the emergence of large language models (LLMs) has revolutionized the way we approach text-based tasks. Open-source LLMs like GPT-3 have demonstrated remarkable capabilities, but often require private data to be shared and leave your organization.
Recently I've been experimenting heavily with running predictions on various models for my app Suno and I'd like to share these learnings. These are the same technologies I've used to create a consumer grade chat application while keeping data secure and spending the least amount of money as possible.
I'd like to explore the world of self-hosting open-source LLMs on different types of infrastructure, allowing engineers to harness their power without breaking the bank.
The Cost Factor
For many developers and small teams, the budgetary constraints of GPU-based inference can be daunting. GPUs charge based on compute time, with costs starting at $4 per hour. This financial burden can be prohibitive for ongoing projects and experimentation. However, there's a silver lining – using CPU-only models. These models offer a cost-effective alternative, enabling us to strike a balance between performance and affordability.
GPT4ALL
Introducing GPT4ALL, a powerful library that enables CPU inference from a diverse range of models. Leveraging GPT4ALL, engineers can tap into the capabilities of various models while running them exclusively on CPU resources. This approach not only slashes costs but also democratizes access to LLMs for a broader audience. We can now dive into the world of self-hosted models with confidence.
Choosing the Right Host
When it comes to hosting self-built LLMs, the cloud is our playground. Established platforms like Google Cloud VM and AWS provide versatile options for deploying and managing CPU-driven models. However, the journey isn't without its twists and turns. Initial attempts with serverless solutions like AWS Lambda and Google Cloud Run highlighted a limitation. Each model has to load into memory from cold-boot, which can take upwards of 20+ seconds – slow not ideal for a fast chat UX.
Leveraging Fly.io
Enter Fly.io, a hosting platform that marries efficiency with cost savings. One of its standout features is the ability to suspend services during inactivity, preserving resources and minimizing expenses. What makes Fly.io a winner for CPU-based LLMs is the fact that, upon every call, only the framework's startup code runs. This is thanks to pre-loading the model during server creation, allowing rapid and responsive inferences. And not mention that Fly also lets you replicate machines around the world to create your own edge-inference network (I'm not sponsored by Fly, I just love using their products).
# Get started with the CLI
brew?install?flyctl
# Launch any docker container
fly launch
# Run in three different regions
fly scale count 3 --region ams,hkg,sjc
领英推荐
Implementation
Let's get practical with implementation. Using FastAPI, a modern Python web framework, you can quickly set up an inference endpoint named "/predict". What's fascinating here is the use of a global model variable. By pre-loading the GPT4ALL model, you achieve snappy response times with just around 16GB of RAM. The global model approach ensures that each call benefits from the pre-loaded model, expediting the inference process.
import gpt4all
from fastapi import FastAPI
app = FastAPI()
# Pre-load GPT4ALL model
model = gpt4all.GPT4All(
"ggml-mpt-7b-chat.bin",
model_path="/models") # requires <8GB ram
@app.post("/predict")
async def predict_text(text: str):
??prediction = global_model.generate(text, max_tokens=256)
??return {"prediction": prediction}
Here's an example docker file to host this model
FROM python:3
# Copy local code to the container image.
ENV APP_HOME /app
WORKDIR $APP_HOME
COPY . ./
# Install production dependencies.
RUN pip install -r requirements.txt
RUN curl -o /models/ggml-mpt-7b-chat.bin https://gpt4all.io/models/ggml-mpt-7b-chat.bin
WORKDIR $APP_HOME
CMD exec uvicorn app.main:app --host 0.0.0.0 --port 8080
Scaling Up
For those seeking more robust options and access to powerful models like LLAMA-2 and Falcon-40b, Hugging Face's inference endpoints are a potent solution. These endpoints cater to both CPU and GPU options, expanding the horizons for hosting and deployment. HF also allows shutting down servers when inactive to avoid cost overruns.
Despite this, GPT4ALL maintains its allure, especially for models under 13 billion parameters that can operate smoothly on CPU infrastructure. Any model larger than this on a CPU (such as Llama and Falcon) and you might as well manually type out the responses.
Self-hosting open-source LLMs on CPU infrastructure is not just a budget-conscious move – it's a strategy that empowers developers to harness the immense power of language models without surrendering to exorbitant GPU costs. With tools like GPT4ALL and platforms like Fly.io, the door is wide open for engineers to embark on a journey of innovative, accessible, and responsive text-based applications. As the world of LLMs continues to evolve, let's embrace the potential of self-hosting and pave the way for efficient, CPU-driven inferences.
Senior Software Engineer @ Equity Bank | Driving innovation in Tech
12 个月This is what I needed ?? , thank you for sharing this.