Cloud Native and AI: Better Together

Cloud Native and AI: Better Together

I enjoyed keynoting the first-ever KubeCon in India. Having grown up in Delhi, it was like a homecoming experience. I enjoyed engaging with passionate attendees and a strong eagerness to learn and contribute. Even though I moved out of Delhi 25+ years ago, I also certainly enjoyed sharing travel tips with others who came from outside Delhi.

My keynote topic was “Cloud Native and AI: Better Together.” This article summarizes the overall message.

Credits: @Lachlan Evanson

Over the last decade, Cloud Native has adapted to support stateless, stateful, and serverless workloads. The platform's several traits make it suitable for running a wide variety of workloads.

  • Scalability: easy scaling of compute resources based on the need
  • Cost-efficient: pay-as-you-go keeps TCO down
  • Containerization: the same package works across different compute environments
  • Harmony: between dev, test, staging, and production environment
  • High availability: minimizes downtime
  • Microservices: breaks down into smaller components and scale independently based upon compute/memory/IO requirements

These are the exact reasons that Cloud Native is an ideal platform for running AI workloads as well. Over the last ten years, a large knowledge base on operating a Cloud Native platform has been built, including running mission-critical systems at a large scale, design patterns and anti-patterns, managing services by hyperscalers or on data centers, skilled workers, and much more. Leveraging that knowledge to run AI workloads is just a logical and evolutionary step.

Let's examine cloud-native AI (CNAI) from three perspectives: Kubernetes, ML Engineering, and App Development.

Kubernetes

What’s done in k8s to make it AI-friendly??

Up until 1.26, K8s could only handle integer-countable resources such as RAM and CPU. That release introduced a new API called Dynamic Resource Allocation (DRA). This API provides a much richer interface for requesting and configuring generic resources, such as GPUs. DRA is a generalization of the persistent volumes API for generic resources. It allows hardware vendors to extend Kubernetes by writing DRA drivers, which are responsible for the hardware and the user-facing interface.?

Existing device plugins limit users to assigning a device to one container. DRA enables GPU devices to be shared across different containers and pods so you can flexibly choose how they’re used. DRA also defines how device resources look from the node to the runtime. This makes K8s GPU-friendly and thus easy to use for your training and inferencing workloads.

The API was introduced as?alpha?in the 1.26 release and just released as?beta?in 1.32.

ML Engineer

You are an ML engineer and have heard good things about Cloud Native. How do you get started?

Kubeflow?is an ecosystem of k8s-based components for each stage of the AI/ML lifecycle with support for best-in-class open-source tools and frameworks. It makes AI/ML simple, portable, and scalable.

?

Kubeflow

Kubeflow provides tools for data preparation (Spark operator), model training (training operator), optimization (Katib), and serving (KServe), model metadata (Model Registry), workflows (Pipelines), and much more. It uses Kubernetes as the base compute layer, which allows it to be run on any hyperscaler, data center, or even your laptop.

If you are an ML engineer and want to leverage the benefits of the Cloud Native platform, Kubeflow is your framework of choice.

In addition, you should also look at?Kueue,?which provides a job queuing system for HPC and AI/ML workloads.

App Developer

If you are an application developer, then you’re looking for an opinionated stack that allows you to integrate GenAI into your applications. You need a blueprint with a pre-configured set of LLM/SLMs, vector database, and other necessary components such as embedding, retriever, and re-ranker. More often, you don’t even know what components are required. You need a Helm chart, with all the required components, that deploys into your existing k8s cluster with a single click. This is where Open Platform for Enterprise AI (OPEA) fits in.

?

OPEA

It provides 30+ component-level microservices such as LLM, vector database, and all the necessary components for GenAI. It composes these microservices to create GenAI blueprints, or mega services, that can be deployed in any k8s cluster. For example, ChatQnA provides a chatbot that allows you to integrate with your enterprise data using Retrieval Augmented Generation (RAG). It comes with TGI as text-generation LLM, Redis vector database, TEI embedding, and other components. The project has 20+ GenAI blueprints such as AudioQnA, VideoQnA, Agentic workflow, Code generator, and much more.

The ChatQnA example is also available on?the AWS marketplace. It runs on top of Amazon EKS and uses OpenSearch as the vector database. It is also integrated with Amazon Bedrock, which allows you to integrate a wide range of LLMs.

Summary?

If you are passionate about how Cloud Native is going to support AI workloads, I recommend joining the AI Working Group. We have released the Cloud Native AI Whitepaper already. Now, we are working on three new whitepapers – Cloud Native AI Scheduling, Cloud Native AI Security, and Cloud Native AI Sustainability. In addition, we are also working on validating OPEA samples on ARM architecture. We can provide infrastructure and cloud credits for you to get started on that; we just need people who are willing to do the work.

Join the CNCF slack and say hello in the #wg-artificial-intelligence.

Let’s make Cloud Native the best platform for AI, together!

ps: The graphic is inspired by Cassandra Chin 's Phippy and AI book.

?

?

Ramanuj Dad

CNCF Kubestronaut | Infrastructure Specialist | DevOps | 3X AWS | Linux | Cloud

3 个月

Arun Gupta It was great meeting you at the exhibition hall during KubeCon and having the chance to sit with you and discuss the OPEA Project. Learning about how OPEA’s blueprint-based capabilities was truly insightful. I also really enjoyed your keynote on Day 2—your presentation gave me a lot to think about, and I’m excited to give it a shot using the Helm chart. Meeting a Java Champion and Docker Captain like yourself was truly a memorable experience.

  • 该图片无替代文字
Kittur Ganesh

Technical Program Manager [oneAPI / AI Centers of Excellence ! Accelerated Computing ! oneAPI / AI Developer Engagement & Open Source Ecosystem Enablement]

3 个月

A very informative blog, Arun Gupta. I enjoyed all my visits to Delhi; and the lovely local food there too ??

要查看或添加评论,请登录

Arun Gupta的更多文章

  • A Cruise Trip to Antarctica

    A Cruise Trip to Antarctica

    I completed a 2-weeks trip to Antarctica. It was cold, stunningly beautiful, and very unique in a lot of ways.

    34 条评论
  • Announcing My Latest Book: Fostering Open Source Culture

    Announcing My Latest Book: Fostering Open Source Culture

    I am very excited to announce my latest and 7th book - Fostering Open Source Culture. Open source is the norm, has won,…

    44 条评论
  • TEDAI Hackathon 2024 Wrapup

    TEDAI Hackathon 2024 Wrapup

    TEDAI San Francisco hosted their second annual hackathon last weekend. This 30-hour hackathon brought developers from…

    4 条评论
  • Ten Tips for a Healthy Mind at Work

    Ten Tips for a Healthy Mind at Work

    Today, October 10th is World Mental Health Day! Credits: https://wfmh.global/ This day was created by the World…

  • GenAI RAG Chatbot on AWS, Microsoft Azure, and Google Cloud

    GenAI RAG Chatbot on AWS, Microsoft Azure, and Google Cloud

    Want to get started with a simple GenAI chatbot using Retrieval Augmented Generation (RAG) on hyperscalers? Open GenAI…

    8 条评论
  • Mount Kilimanjaro, Serengeti, and Workplace

    Mount Kilimanjaro, Serengeti, and Workplace

    Mount Kilimanjaro in Tanzania is the highest free-standing mountain in the world and the tallest in Africa at 5,895…

    28 条评论
  • Mental Health Awareness Month - May 2024

    Mental Health Awareness Month - May 2024

    In a world where productivity often takes precedence, it's essential to remember that mental health matters just as…

    2 条评论
  • La Vie En KubeCon Paris

    La Vie En KubeCon Paris

    KubeCon EU 2024 concluded last week and this is a time for reflection. This was the 20th KubeCon on the 10th…

    2 条评论
  • Developer webinars for 7th-gen AWS instances

    Developer webinars for 7th-gen AWS instances

    Seventh-Generation General Purpose Amazon EC2 Instances (M7i, M7i-Flex, M7i-metal, C7i, R7i, and R7iz) are generally…

    1 条评论
  • 'Mo' Power: Shaping Men's Health

    'Mo' Power: Shaping Men's Health

    Did you know growing a "mo" (short for mustache) is a symbol for men's health? Today marks the end of Movember. It is…

    11 条评论

社区洞察

其他会员也浏览了