Building with Serverless GPUs
A conceptual and artistic image depicting serverless GPUs powering AI applications. The image should include visual elements such as circuitry, cloud

Building with Serverless GPUs

Hello, builders! In this edition, we're exploring building applications using serverless GPUs. As GPU resources become increasingly scarce and expensive, we're highlighting alternative access methods that don't hinge on the traditional per-hour billing. We'll show how to use products like Modal and Replicate, which offer more flexible and cost-effective ways to harness GPU power. Additionally, we'll examine how services like Amazon Bedrock and Together.ai provide access to Large Language Models (LLMs).

Streamlining GPU Usage with Modal and Replicate

Serverless GPU services like Modal and Replicate offer per-second billing to ensure that you only pay for the GPU time you actually use, making it a cost-effective solution for both small-scale experiments and larger projects. Here's an example of running each one:

Building with Modal

With Modal you can bring your own custom docker image, storage and define a GPU you want to use. This is ideal for situations you have a custom workload, such as transcribing audio files using a model like distil-whisper. Here is some sample code:


import modal
import torch
from modal import Image, NetworkFileSystem, gpu, method
from transformers import AutoModelForSpeechSeq2Seq, AutoProcessor, pipeline

# Custom Docker Image
transcribe_image = (
    Image.from_registry("nvcr.io/nvidia/pytorch:22.04-py3")
    .pip_install(
        "transformers",
        "optimum",
        "accelerate",
        #....
    )
    .apt_install("ffmpeg", "git", "curl")
    .run_commands("pip install flash-attn --no-build-isolation", gpu="A100")
)

stub = modal.Stub("audible-example")

# Stub function
@stub.function(
    image=audible_image,
    gpu=modal.gpu.A100(memory=80),
    network_file_systems={"/root/mp3_files": volume},
    timeout=3600,
)
def distil_whisper(audio_file):
    print(f"Running whisiper on {audio_file}")
    start_time = time.time()  # Start the timer

    device = "cuda:0"
    torch_dtype = torch.float16
    model_id = "distil-whisper/distil-large-v2"

    model = AutoModelForSpeechSeq2Seq.from_pretrained(
        model_id,
        torch_dtype=torch_dtype,
        low_cpu_mem_usage=True,
        use_safetensors=True,
        attn_implementation="flash_attention_2",
    )
    model.to(device)

    processor = AutoProcessor.from_pretrained(model_id)

    pipe = pipeline(
        "automatic-speech-recognition",
        model=model,
    )

    result = pipe(audio_file)
    print(result["text"])

    return result["text"]        

By being able to customize the compute, storage, and GPU, it provides builders options to create solutions for their tasks.

Building with Replicate

Similarly, Replicate offers an easy-to-use platform for running models that can perform text, image, audio generation and more. For example, here is how we would run Stable Diffusion:

import replicate

output = replicate.run(
  "stability-ai/sdxl:39ed52f2a78e934b3ba6e2a89f5b1c712de7dfea535525255b1aa35c5565e08b",
  input={
    "width": 768,
    "height": 768,
    "prompt": "An astronaut riding a rainbow unicorn, cinematic, dramatic",
    "refine": "expert_ensemble_refiner",
    "scheduler": "K_EULER",
  }
)

print(output)        

Replicate features many community models so you can try out what others are building with.

Building with LLMs using Amazon Bedrock and Together.ai

Services likes Amazon Bedrock and Together.ai are reshaping how builders access and utilize LLMs by providing serverless endpoints to access different models. They both also integrate with frameworks such as LangChain. Let's explore how can build with them:

Building with Amazon Bedrock

Amazon Bedrock offers a serverless endpoint to seamlessly integrate LLMs like Claude into your applications. It also enhances this experience by offering a suite of services including guardrails, RAG, fine-tuning, and more, allowing builders to experiment and innovate efficiently.

Here's how you can use Claude on Amazon Bedrock:

import boto3

# Setup bedrock
bedrock_runtime = boto3.client(
    service_name="bedrock-runtime",
    region_name="us-east-1",
)

def claude_prompt_format(prompt: str) -> str:
    # Add headers to start and end of prompt
    return "\n\nHuman: " + prompt + "\n\nAssistant:"

# Call Claude model
def call_claude(prompt):
    prompt_config = {
        "prompt": claude_prompt_format(prompt),
        "max_tokens_to_sample": 4096,
        "temperature": 0.5,
        "top_k": 250,
        "top_p": 0.5,
        "stop_sequences": [],
    }

    body = json.dumps(prompt_config)

    modelId = "anthropic.claude-v2:1"
    accept = "application/json"
    contentType = "application/json"

    response = bedrock_runtime.invoke_model(
        body=body, modelId=modelId, accept=accept, contentType=contentType
    )
    response_body = json.loads(response.get("body").read())

    results = response_body.get("completion")
    return results

def summarize_text(text):
    """
    Function to summarize text using a generative AI model.
    """
    prompt = f"Summarize the following text in 50 words or less: {text}"
    result = call_claude(prompt)
    return result        

Building with Together.ai

Together.ai provides many open source models to try, and features like fine-tuning and custom models.

Here's a simple example of how to use CodeLlama:

import requests
endpoint = 'https://api.together.xyz/v1/chat/completions'
res = requests.post(endpoint, json={
    "model": "codellama/CodeLlama-70b-Instruct-hf",
    "max_tokens": 500,
    "temperature": 0.7,
    "top_p": 0.7,
    "top_k": 50,
    "repetition_penalty": 1,
    "stop": [
        "<step>"
    ],
    "messages": [
        {
            "content": "python code to sort a list",
            "role": "user"
        }
    ],
    "repetitive_penalty": 1
}, headers={
    "Authorization": "Bearer TOKEN",
})        

Now let's move on to a new product that isn't quite GPU.

Groq's Language Processing Unit (LPU)

Groq's LPU Inference Engine has the fastest inference to date. It has shown speeds up to 18x faster results compared to other providers. This efficiency is evident in two crucial areas:

  • Output Tokens Throughput: Groq achieved an average of 185 tokens/s, significantly outperforming others by 3-18 times.
  • Time to First Token (TTFT): A TTFT of 0.22s ensures consistent and rapid responses, ideal for low-latency applications such as chatbots.

Source:

You can try it out here: https://groq.com/ to experience how fast it is.

Now let's move on to another exciting upcoming model for video generation.

OpenAI's Sora: A New Era in Video Content Generation

OpenAI's latest model, Sora, is setting a new standard in video content generation. Sora's ability to generate complex scenes with intricate details and realistic motion marks a significant advancement in AI-driven content creationFor instance, it can generate a scene of a vintage SUV driving up a mountain road, capturing the dust from the tires and the glow of sunlight on the vehicle, all while maintaining a coherent and realistic setting.

The camera follows behind a white vintage SUV with a black roof rack as it speeds up a steep dirt road surrounded by pine trees on a steep mountain slope, dust kicks up from...

Sora's strengths lie in its deep understanding of language and ability to generate videos with multiple shots, persisting characters, and consistent visual styles. However, it does have limitations, such as challenges in simulating complex physics and spatial details accurately. For example, it might struggle with showing a cookie with a bite mark after someone takes a bite.

As Sora continues to evolve, it's poised to play a crucial role in understanding and simulating the real world, which can lead to a whole host of new applications like video games for builders to create.

While we wait for the future, let's turn our attention to a community hackathon that lets you focus on building apps now!

AWS PartyRock Hackathon: Build Generative AI apps without code!

The PartyRock Generative AI Hackathon, a unique opportunity to create innovative apps powered by generative AI. Utilizing PartyRock, an Amazon Bedrock Playground, participants will use Prompt Engineering and Foundational Models (FMs) to build functional applications without any code.

Haiku App Example

Whether you're interested in creating an interactive learning experience, a creative assistant, experimental entertainment, or something entirely unique, this hackathon offers a platform to showcase your creativity and skills.

You can engage the community by joining the AWS Community Discord to discuss ideas, seek guidance, and interact with the PartyRock team and fellow builders.

With $20k USD and 100 AWS credits up for grabs, the competition promises to be both challenging and rewarding. The contest concludes on March 11, so gear up and start building!

As we forge ahead in this ever-evolving landscape of generative AI, I encourage each of you to try out many different products. Experiment with them, challenge their limits, and most importantly, let your creativity flourish. Until our next edition, keep building.



Piotr Malicki

NSV Mastermind | Enthusiast AI & ML | Architect Solutions AI & ML | AIOps / MLOps / DataOps | Innovator MLOps & DataOps for Web2 & Web3 Startup | NLP Aficionado | Unlocking the Power of AI for a Brighter Future??

1 年

Exciting insights! Ready to embrace the future of app development. ??

回复

要查看或添加评论,请登录

Banjo Obayomi的更多文章

  • Building for the Future

    Building for the Future

    Hey Builders, This month marks my 3-year anniversary at AWS, and I'm incredibly excited about what we've accomplished…

    7 条评论
  • Building with AI Engineers

    Building with AI Engineers

    Hey builders! This month, we're diving into the world of AI Engineers and the tools they're using to build the future…

    1 条评论
  • Building LLM Bots for Gaming

    Building LLM Bots for Gaming

    Hey builders!!! I’ve had such a fun month with building Large Language Model (LLM) bots to play, compete and create…

    4 条评论
  • Building with Banjo - Jan 24

    Building with Banjo - Jan 24

    Happy New Year, builders! As we kick off 2024, I find myself excited about what we'll build this year. The dev tools…

    6 条评论
  • Building with Banjo - Dec 23

    Building with Banjo - Dec 23

    ?? Welcome to the final 2023 edition of "Building with Banjo"! Wow, it’s been a great year for builders with all the…

    3 条评论
  • Building with Banjo - Oct 23

    Building with Banjo - Oct 23

    Welcome back, builders, to the latest edition of "Building with Banjo" – where each month we merge curiosity with…

    6 条评论
  • Building with Banjo!!!

    Building with Banjo!!!

    Welcome to the inaugural edition of "Building with Banjo," where curiosity meets creativity in technology, gaming, and…

    15 条评论
  • Introducing Grimoire: A Data Centric Blogging Platform

    Introducing Grimoire: A Data Centric Blogging Platform

    What is a Blog? When we think of what it means to write an article or blog post, we wish to convey our thoughts into a…

    3 条评论
  • Automate Your Phone Interviews with CloudScreen

    Automate Your Phone Interviews with CloudScreen

    CloudScreen allows you to set up automated phone interviews, to interview candidates at scale. Seeing is believing, so…

社区洞察

其他会员也浏览了