登录查看更多内容

Scale to Zero for AI Workloads

Matt Rickard

Stanford MBA | Ex-Google Engineer

发布日期: 2023年8月11日

AI workloads are expensive regardless of how they are implemented. Third-party APIs with significant markups. Cloud-prem SaaS companies with hefty management fees. And the total cost of ownership for either renting GPUs from AWS or buying directly.

In the short term, there’s a significant arbitrage with a great DevOps team — namely, how do you scale expensive workloads to zero when they aren’t in use? Or right-size them accordingly as the load increases or decreases? Doing this can flip unprofitable unit economics (or provide more efficiency to the money a startup has raised).

An obvious objection: we already have serverless environments that scale to zero — Google Cloud Run, AWS Lambda, WebAssembly runtimes, and more. The problem is that these runtimes are explicitly tuned for generic workloads and aren’t made for specialized hardware (read: GPUs).

There’s two elements to “scale to zero” for GPU-bound workloads.

First, the actual machines. On AWS, this would be autoscaling groups (ASGs). As CPU load increases (or another metric you’re measuring), this will scale up instances (virtual machines). But ASGs on their own are rarely sufficient to scale to zero. You also are probably bin-packing multiple workloads on expensive GPU-powered machines. Maybe running different models at different times or rolling out new versions of models. For this, you probably want to deploy with a different primitive than raw machine images, something like a container. And for that, you need Kubernetes.

领英推荐

HPE Empowers Users with Generative AI through Advanced…

Bernard Marr 1 年前

AWS Lambda: The Power of Server less Computing

Noblesoft Technologies 12 个月前

Serverless AI infrastructure

Maarten Ectors 1 年前

The second scale-to-zero mechanism is scaling the actual workload (the pods, deployments, etc.). There’s not really a great way to do this today. Most organizations have built their own hacks. Knative provides the machinery but can be challenging to deploy and manage and comes with its own heavyweight dependencies (like Istio). The high-level workflow is this: queue up the requests and launch a new deployment if an endpoint is unavailable.

Scale-to-zero will probably be necessary for the near term as organizations either need to deploy (1) on-prem models for data security or (2) custom models or infrastructure to serve a particular use case.

Originally posted on https://matt-rickard.com/scale-to-zero-for-ai-workloads

要查看或添加评论，请登录

Matt Rickard的更多文章

Lessons from llama.cpp

2023年12月22日

Lessons from llama.cpp

Llama.cpp is an implementation of Meta’s LLaMA architecture in C/C++.
To be, or not to be; ay, there’s the point.

2023年12月21日

To be, or not to be; ay, there’s the point.

It doesn’t have the same ring to it as the Hamlet that we know, but this is from the first published version of Hamlet…
AI Agents Today

2023年12月20日

AI Agents Today

The term AI agent is used loosely. It can mean almost anything.
Norvig's Agent Definition

2023年12月19日

Norvig's Agent Definition

There’s no consensus on what an AI agent means today. The term is used to describe everything from chatbots to for…

1 条评论
The Lucretius Problem

2023年12月18日

The Lucretius Problem

Just as any river is enormous to someone who looks at it and who, before that time, has not seen one greater. So, too…

1 条评论
Eroom's Law

2023年12月15日

Eroom's Law

Despite advances in technology and increased spending, the number of new drugs approved per billion dollars spent on…

1 条评论
Copilot is an Incumbent Business Model

2023年12月14日

Copilot is an Incumbent Business Model

The Copilot business model has been the prevailing enterprise strategy of AI. An assistant that helps you write the…

1 条评论
What if Google Wasn’t The Default?

2023年12月13日

What if Google Wasn’t The Default?

Google has paid Apple to be the default search on their operating systems since 2002. But recent antitrust cases…
The Cost of Index Everything

2023年12月12日

The Cost of Index Everything

Many AI products today are focused on indexing as much as possible. Every meeting, every document, every moment of your…
Strategies for the GPU-Poor

2023年12月11日

Strategies for the GPU-Poor

GPUs are hard to come by, often fetching significant premiums in their aftermarket prices (if you can find them). Cloud…

See all articles

Scale to Zero for AI Workloads

Matt Rickard

Stanford MBA | Ex-Google Engineer

领英推荐

Matt Rickard的更多文章

社区洞察

其他会员也浏览了

Cost Optimization Techniques for AI-Driven Microservices Architectures in Azure Cloud: A Deep Dive

AWS re:Invent 24 - Keynote re:cap - Matt Garman - Andy Jassy

Unleashing the Potential of Azure Copilot: Transforming IT Operations

AZURE

TECHNOLOGY UPDATE - 9/4/2024

Best GPU Cloud Providers in Europe for 2025

Microsoft Unveils Foundation for AI-Powered Client/Cloud Hybrid Loop

Developer webinars for 7th-gen AWS instances

Scaling Microservices on AWS ECS

Amazon API Gateway

领英推荐

Matt Rickard的更多文章

Lessons from llama.cpp

To be, or not to be; ay, there’s the point.

AI Agents Today

Norvig's Agent Definition

The Lucretius Problem

Eroom's Law

Copilot is an Incumbent Business Model

What if Google Wasn’t The Default?

The Cost of Index Everything

Strategies for the GPU-Poor

社区洞察

其他会员也浏览了

Cost Optimization Techniques for AI-Driven Microservices Architectures in Azure Cloud: A Deep Dive

AWS re:Invent 24 - Keynote re:cap - Matt Garman - Andy Jassy

Unleashing the Potential of Azure Copilot: Transforming IT Operations

AZURE

TECHNOLOGY UPDATE - 9/4/2024

Best GPU Cloud Providers in Europe for 2025

Microsoft Unveils Foundation for AI-Powered Client/Cloud Hybrid Loop

Developer webinars for 7th-gen AWS instances

Scaling Microservices on AWS ECS

Amazon API Gateway