登录查看更多内容

Building with Serverless GPUs

Banjo Obayomi

Senior Specialist Solutions Architect GenAI at Amazon Web Services (AWS)

发布日期: 2024年2月26日

Hello, builders! In this edition, we're exploring building applications using serverless GPUs. As GPU resources become increasingly scarce and expensive, we're highlighting alternative access methods that don't hinge on the traditional per-hour billing. We'll show how to use products like Modal and Replicate, which offer more flexible and cost-effective ways to harness GPU power. Additionally, we'll examine how services like Amazon Bedrock and Together.ai provide access to Large Language Models (LLMs).

Streamlining GPU Usage with Modal and Replicate

Serverless GPU services like Modal and Replicate offer per-second billing to ensure that you only pay for the GPU time you actually use, making it a cost-effective solution for both small-scale experiments and larger projects. Here's an example of running each one:

Building with Modal

With Modal you can bring your own custom docker image, storage and define a GPU you want to use. This is ideal for situations you have a custom workload, such as transcribing audio files using a model like distil-whisper. Here is some sample code:


import modal
import torch
from modal import Image, NetworkFileSystem, gpu, method
from transformers import AutoModelForSpeechSeq2Seq, AutoProcessor, pipeline

# Custom Docker Image
transcribe_image = (
    Image.from_registry("nvcr.io/nvidia/pytorch:22.04-py3")
    .pip_install(
        "transformers",
        "optimum",
        "accelerate",
        #....
    )
    .apt_install("ffmpeg", "git", "curl")
    .run_commands("pip install flash-attn --no-build-isolation", gpu="A100")
)

stub = modal.Stub("audible-example")

# Stub function
@stub.function(
    image=audible_image,
    gpu=modal.gpu.A100(memory=80),
    network_file_systems={"/root/mp3_files": volume},
    timeout=3600,
)
def distil_whisper(audio_file):
    print(f"Running whisiper on {audio_file}")
    start_time = time.time()  # Start the timer

    device = "cuda:0"
    torch_dtype = torch.float16
    model_id = "distil-whisper/distil-large-v2"

    model = AutoModelForSpeechSeq2Seq.from_pretrained(
        model_id,
        torch_dtype=torch_dtype,
        low_cpu_mem_usage=True,
        use_safetensors=True,
        attn_implementation="flash_attention_2",
    )
    model.to(device)

    processor = AutoProcessor.from_pretrained(model_id)

    pipe = pipeline(
        "automatic-speech-recognition",
        model=model,
    )

    result = pipe(audio_file)
    print(result["text"])

    return result["text"]

By being able to customize the compute, storage, and GPU, it provides builders options to create solutions for their tasks.

Building with Replicate

Similarly, Replicate offers an easy-to-use platform for running models that can perform text, image, audio generation and more. For example, here is how we would run Stable Diffusion:

import replicate

output = replicate.run(
  "stability-ai/sdxl:39ed52f2a78e934b3ba6e2a89f5b1c712de7dfea535525255b1aa35c5565e08b",
  input={
    "width": 768,
    "height": 768,
    "prompt": "An astronaut riding a rainbow unicorn, cinematic, dramatic",
    "refine": "expert_ensemble_refiner",
    "scheduler": "K_EULER",
  }
)

print(output)

Replicate features many community models so you can try out what others are building with.

Building with LLMs using Amazon Bedrock and Together.ai

Services likes Amazon Bedrock and Together.ai are reshaping how builders access and utilize LLMs by providing serverless endpoints to access different models. They both also integrate with frameworks such as LangChain. Let's explore how can build with them:

Building with Amazon Bedrock

Amazon Bedrock offers a serverless endpoint to seamlessly integrate LLMs like Claude into your applications. It also enhances this experience by offering a suite of services including guardrails, RAG, fine-tuning, and more, allowing builders to experiment and innovate efficiently.

Here's how you can use Claude on Amazon Bedrock:

import boto3

# Setup bedrock
bedrock_runtime = boto3.client(
    service_name="bedrock-runtime",
    region_name="us-east-1",
)

def claude_prompt_format(prompt: str) -> str:
    # Add headers to start and end of prompt
    return "\n\nHuman: " + prompt + "\n\nAssistant:"

# Call Claude model
def call_claude(prompt):
    prompt_config = {
        "prompt": claude_prompt_format(prompt),
        "max_tokens_to_sample": 4096,
        "temperature": 0.5,
        "top_k": 250,
        "top_p": 0.5,
        "stop_sequences": [],
    }

    body = json.dumps(prompt_config)

    modelId = "anthropic.claude-v2:1"
    accept = "application/json"
    contentType = "application/json"

    response = bedrock_runtime.invoke_model(
        body=body, modelId=modelId, accept=accept, contentType=contentType
    )
    response_body = json.loads(response.get("body").read())

    results = response_body.get("completion")
    return results

def summarize_text(text):
    """
    Function to summarize text using a generative AI model.
    """
    prompt = f"Summarize the following text in 50 words or less: {text}"
    result = call_claude(prompt)
    return result

Building with Together.ai

Together.ai provides many open source models to try, and features like fine-tuning and custom models.

Here's a simple example of how to use CodeLlama:

import requests
endpoint = 'https://api.together.xyz/v1/chat/completions'
res = requests.post(endpoint, json={
    "model": "codellama/CodeLlama-70b-Instruct-hf",
    "max_tokens": 500,
    "temperature": 0.7,
    "top_p": 0.7,
    "top_k": 50,
    "repetition_penalty": 1,
    "stop": [
        "<step>"
    ],
    "messages": [
        {
            "content": "python code to sort a list",
            "role": "user"
        }
    ],
    "repetitive_penalty": 1
}, headers={
    "Authorization": "Bearer TOKEN",
})

Now let's move on to a new product that isn't quite GPU.

领英推荐

Introducing Microsoft 'Singularity' AI Infrastructure…

Michael Spencer 3 年前

An exclusive look inside the new IBM Research Think Lab

IBM Research 6 个月前

Hammerspace March Newsletter

Hammerspace 1 年前

Groq's Language Processing Unit (LPU)

Groq's LPU Inference Engine has the fastest inference to date. It has shown speeds up to 18x faster results compared to other providers. This efficiency is evident in two crucial areas:

Output Tokens Throughput: Groq achieved an average of 185 tokens/s, significantly outperforming others by 3-18 times.
Time to First Token (TTFT): A TTFT of 0.22s ensures consistent and rapid responses, ideal for low-latency applications such as chatbots.

You can try it out here: https://groq.com/ to experience how fast it is.

Now let's move on to another exciting upcoming model for video generation.

OpenAI's Sora: A New Era in Video Content Generation

OpenAI's latest model, Sora, is setting a new standard in video content generation. Sora's ability to generate complex scenes with intricate details and realistic motion marks a significant advancement in AI-driven content creationFor instance, it can generate a scene of a vintage SUV driving up a mountain road, capturing the dust from the tires and the glow of sunlight on the vehicle, all while maintaining a coherent and realistic setting.

The camera follows behind a white vintage SUV with a black roof rack as it speeds up a steep dirt road surrounded by pine trees on a steep mountain slope, dust kicks up from...

Sora's strengths lie in its deep understanding of language and ability to generate videos with multiple shots, persisting characters, and consistent visual styles. However, it does have limitations, such as challenges in simulating complex physics and spatial details accurately. For example, it might struggle with showing a cookie with a bite mark after someone takes a bite.

As Sora continues to evolve, it's poised to play a crucial role in understanding and simulating the real world, which can lead to a whole host of new applications like video games for builders to create.

While we wait for the future, let's turn our attention to a community hackathon that lets you focus on building apps now!

AWS PartyRock Hackathon: Build Generative AI apps without code!

The PartyRock Generative AI Hackathon, a unique opportunity to create innovative apps powered by generative AI. Utilizing PartyRock, an Amazon Bedrock Playground, participants will use Prompt Engineering and Foundational Models (FMs) to build functional applications without any code.

Whether you're interested in creating an interactive learning experience, a creative assistant, experimental entertainment, or something entirely unique, this hackathon offers a platform to showcase your creativity and skills.

You can engage the community by joining the AWS Community Discord to discuss ideas, seek guidance, and interact with the PartyRock team and fellow builders.

With $20k USD and 100 AWS credits up for grabs, the competition promises to be both challenging and rewarding. The contest concludes on March 11, so gear up and start building!

As we forge ahead in this ever-evolving landscape of generative AI, I encourage each of you to try out many different products. Experiment with them, challenge their limits, and most importantly, let your creativity flourish. Until our next edition, keep building.

Building with Banjo

1,884 位关注者

Piotr Malicki

1 年

Exciting insights! Ready to embrace the future of app development. ??

要查看或添加评论，请登录

Banjo Obayomi的更多文章

Building for the Future

2024年10月9日

Building for the Future

Hey Builders, This month marks my 3-year anniversary at AWS, and I'm incredibly excited about what we've accomplished…

7 条评论
Building with AI Engineers

2024年7月15日

Building with AI Engineers

Hey builders! This month, we're diving into the world of AI Engineers and the tools they're using to build the future…

1 条评论
Building LLM Bots for Gaming

2024年5月1日

Building LLM Bots for Gaming

Hey builders!!! I’ve had such a fun month with building Large Language Model (LLM) bots to play, compete and create…

4 条评论
Building with Banjo - Jan 24

2024年1月23日

Building with Banjo - Jan 24

Happy New Year, builders! As we kick off 2024, I find myself excited about what we'll build this year. The dev tools…

6 条评论
Building with Banjo - Dec 23

2023年12月18日

Building with Banjo - Dec 23

?? Welcome to the final 2023 edition of "Building with Banjo"! Wow, it’s been a great year for builders with all the…

3 条评论
Building with Banjo - Oct 23

2023年10月23日

Building with Banjo - Oct 23

Welcome back, builders, to the latest edition of "Building with Banjo" – where each month we merge curiosity with…

6 条评论
Building with Banjo!!!

2023年9月18日

Building with Banjo!!!

Welcome to the inaugural edition of "Building with Banjo," where curiosity meets creativity in technology, gaming, and…

15 条评论
Introducing Grimoire: A Data Centric Blogging Platform

2020年6月16日

Introducing Grimoire: A Data Centric Blogging Platform

What is a Blog? When we think of what it means to write an article or blog post, we wish to convey our thoughts into a…

3 条评论
Automate Your Phone Interviews with CloudScreen

2019年2月28日

Automate Your Phone Interviews with CloudScreen

CloudScreen allows you to set up automated phone interviews, to interview candidates at scale. Seeing is believing, so…

See all articles

Building with Serverless GPUs

Banjo Obayomi

Senior Specialist Solutions Architect GenAI at Amazon Web Services (AWS)

Streamlining GPU Usage with Modal and Replicate

Building with LLMs using Amazon Bedrock and Together.ai

领英推荐

Groq's Language Processing Unit (LPU)

OpenAI's Sora: A New Era in Video Content Generation

AWS PartyRock Hackathon: Build Generative AI apps without code!

Building with Banjo

1,884 位关注者

Banjo Obayomi的更多文章

社区洞察

其他会员也浏览了

Hammerspace March Newsletter

How do we leverage compute resources with maximum utilization?

AWSome observations from AWS re:Invent

KubeCon + CloudNative China 2024 Recap

Do you know about SQL Managed Instance?

NVMe RAID is BACK! Major AI Updates, more...

AIvengers: the infinity infrastructure for virtualized AI

How Artificial Intelligence is Powering the Next Wave of Autonomic Computing

The Development of Data-Centric Computing with Computational Storage

Exascale Computing Market Revenue Report with Forecast to 2031

Streamlining GPU Usage with Modal and Replicate

Building with LLMs using Amazon Bedrock and Together.ai

领英推荐

Groq's Language Processing Unit (LPU)

OpenAI's Sora: A New Era in Video Content Generation

AWS PartyRock Hackathon: Build Generative AI apps without code!

Building with Banjo

1,884 位关注者

Banjo Obayomi的更多文章

Building for the Future

Building with AI Engineers

Building LLM Bots for Gaming

Building with Banjo - Jan 24

Building with Banjo - Dec 23

Building with Banjo - Oct 23

Building with Banjo!!!

Introducing Grimoire: A Data Centric Blogging Platform

Automate Your Phone Interviews with CloudScreen

社区洞察

其他会员也浏览了

Hammerspace March Newsletter

How do we leverage compute resources with maximum utilization?

AWSome observations from AWS re:Invent

KubeCon + CloudNative China 2024 Recap

Do you know about SQL Managed Instance?

NVMe RAID is BACK! Major AI Updates, more...

AIvengers: the infinity infrastructure for virtualized AI

How Artificial Intelligence is Powering the Next Wave of Autonomic Computing

The Development of Data-Centric Computing with Computational Storage

Exascale Computing Market Revenue Report with Forecast to 2031