Building with Serverless GPUs
Banjo Obayomi
Senior Specialist Solutions Architect GenAI at Amazon Web Services (AWS)
Hello, builders! In this edition, we're exploring building applications using serverless GPUs. As GPU resources become increasingly scarce and expensive, we're highlighting alternative access methods that don't hinge on the traditional per-hour billing. We'll show how to use products like Modal and Replicate, which offer more flexible and cost-effective ways to harness GPU power. Additionally, we'll examine how services like Amazon Bedrock and Together.ai provide access to Large Language Models (LLMs).
Streamlining GPU Usage with Modal and Replicate
Serverless GPU services like Modal and Replicate offer per-second billing to ensure that you only pay for the GPU time you actually use, making it a cost-effective solution for both small-scale experiments and larger projects. Here's an example of running each one:
Building with Modal
With Modal you can bring your own custom docker image, storage and define a GPU you want to use. This is ideal for situations you have a custom workload, such as transcribing audio files using a model like distil-whisper. Here is some sample code:
import modal
import torch
from modal import Image, NetworkFileSystem, gpu, method
from transformers import AutoModelForSpeechSeq2Seq, AutoProcessor, pipeline
# Custom Docker Image
transcribe_image = (
Image.from_registry("nvcr.io/nvidia/pytorch:22.04-py3")
.pip_install(
"transformers",
"optimum",
"accelerate",
#....
)
.apt_install("ffmpeg", "git", "curl")
.run_commands("pip install flash-attn --no-build-isolation", gpu="A100")
)
stub = modal.Stub("audible-example")
# Stub function
@stub.function(
image=audible_image,
gpu=modal.gpu.A100(memory=80),
network_file_systems={"/root/mp3_files": volume},
timeout=3600,
)
def distil_whisper(audio_file):
print(f"Running whisiper on {audio_file}")
start_time = time.time() # Start the timer
device = "cuda:0"
torch_dtype = torch.float16
model_id = "distil-whisper/distil-large-v2"
model = AutoModelForSpeechSeq2Seq.from_pretrained(
model_id,
torch_dtype=torch_dtype,
low_cpu_mem_usage=True,
use_safetensors=True,
attn_implementation="flash_attention_2",
)
model.to(device)
processor = AutoProcessor.from_pretrained(model_id)
pipe = pipeline(
"automatic-speech-recognition",
model=model,
)
result = pipe(audio_file)
print(result["text"])
return result["text"]
By being able to customize the compute, storage, and GPU, it provides builders options to create solutions for their tasks.
Building with Replicate
Similarly, Replicate offers an easy-to-use platform for running models that can perform text, image, audio generation and more. For example, here is how we would run Stable Diffusion:
import replicate
output = replicate.run(
"stability-ai/sdxl:39ed52f2a78e934b3ba6e2a89f5b1c712de7dfea535525255b1aa35c5565e08b",
input={
"width": 768,
"height": 768,
"prompt": "An astronaut riding a rainbow unicorn, cinematic, dramatic",
"refine": "expert_ensemble_refiner",
"scheduler": "K_EULER",
}
)
print(output)
Replicate features many community models so you can try out what others are building with.
Building with LLMs using Amazon Bedrock and Together.ai
Services likes Amazon Bedrock and Together.ai are reshaping how builders access and utilize LLMs by providing serverless endpoints to access different models. They both also integrate with frameworks such as LangChain. Let's explore how can build with them:
Building with Amazon Bedrock
Amazon Bedrock offers a serverless endpoint to seamlessly integrate LLMs like Claude into your applications. It also enhances this experience by offering a suite of services including guardrails, RAG, fine-tuning, and more, allowing builders to experiment and innovate efficiently.
Here's how you can use Claude on Amazon Bedrock:
import boto3
# Setup bedrock
bedrock_runtime = boto3.client(
service_name="bedrock-runtime",
region_name="us-east-1",
)
def claude_prompt_format(prompt: str) -> str:
# Add headers to start and end of prompt
return "\n\nHuman: " + prompt + "\n\nAssistant:"
# Call Claude model
def call_claude(prompt):
prompt_config = {
"prompt": claude_prompt_format(prompt),
"max_tokens_to_sample": 4096,
"temperature": 0.5,
"top_k": 250,
"top_p": 0.5,
"stop_sequences": [],
}
body = json.dumps(prompt_config)
modelId = "anthropic.claude-v2:1"
accept = "application/json"
contentType = "application/json"
response = bedrock_runtime.invoke_model(
body=body, modelId=modelId, accept=accept, contentType=contentType
)
response_body = json.loads(response.get("body").read())
results = response_body.get("completion")
return results
def summarize_text(text):
"""
Function to summarize text using a generative AI model.
"""
prompt = f"Summarize the following text in 50 words or less: {text}"
result = call_claude(prompt)
return result
Building with Together.ai
Together.ai provides many open source models to try, and features like fine-tuning and custom models.
Here's a simple example of how to use CodeLlama:
import requests
endpoint = 'https://api.together.xyz/v1/chat/completions'
res = requests.post(endpoint, json={
"model": "codellama/CodeLlama-70b-Instruct-hf",
"max_tokens": 500,
"temperature": 0.7,
"top_p": 0.7,
"top_k": 50,
"repetition_penalty": 1,
"stop": [
"<step>"
],
"messages": [
{
"content": "python code to sort a list",
"role": "user"
}
],
"repetitive_penalty": 1
}, headers={
"Authorization": "Bearer TOKEN",
})
Now let's move on to a new product that isn't quite GPU.
领英推荐
Groq's Language Processing Unit (LPU)
Groq's LPU Inference Engine has the fastest inference to date. It has shown speeds up to 18x faster results compared to other providers. This efficiency is evident in two crucial areas:
You can try it out here: https://groq.com/ to experience how fast it is.
Now let's move on to another exciting upcoming model for video generation.
OpenAI's Sora: A New Era in Video Content Generation
OpenAI's latest model, Sora, is setting a new standard in video content generation. Sora's ability to generate complex scenes with intricate details and realistic motion marks a significant advancement in AI-driven content creationFor instance, it can generate a scene of a vintage SUV driving up a mountain road, capturing the dust from the tires and the glow of sunlight on the vehicle, all while maintaining a coherent and realistic setting.
Sora's strengths lie in its deep understanding of language and ability to generate videos with multiple shots, persisting characters, and consistent visual styles. However, it does have limitations, such as challenges in simulating complex physics and spatial details accurately. For example, it might struggle with showing a cookie with a bite mark after someone takes a bite.
As Sora continues to evolve, it's poised to play a crucial role in understanding and simulating the real world, which can lead to a whole host of new applications like video games for builders to create.
While we wait for the future, let's turn our attention to a community hackathon that lets you focus on building apps now!
AWS PartyRock Hackathon: Build Generative AI apps without code!
The PartyRock Generative AI Hackathon, a unique opportunity to create innovative apps powered by generative AI. Utilizing PartyRock, an Amazon Bedrock Playground, participants will use Prompt Engineering and Foundational Models (FMs) to build functional applications without any code.
Whether you're interested in creating an interactive learning experience, a creative assistant, experimental entertainment, or something entirely unique, this hackathon offers a platform to showcase your creativity and skills.
You can engage the community by joining the AWS Community Discord to discuss ideas, seek guidance, and interact with the PartyRock team and fellow builders.
With $20k USD and 100 AWS credits up for grabs, the competition promises to be both challenging and rewarding. The contest concludes on March 11, so gear up and start building!
As we forge ahead in this ever-evolving landscape of generative AI, I encourage each of you to try out many different products. Experiment with them, challenge their limits, and most importantly, let your creativity flourish. Until our next edition, keep building.
NSV Mastermind | Enthusiast AI & ML | Architect Solutions AI & ML | AIOps / MLOps / DataOps | Innovator MLOps & DataOps for Web2 & Web3 Startup | NLP Aficionado | Unlocking the Power of AI for a Brighter Future??
1 年Exciting insights! Ready to embrace the future of app development. ??