BentoML

BentoML

软件开发

San Francisco,California 8,674 位关注者

Build and Scale Compound AI Systems

关于我们

Compute orchestration platform for rapid and reliable GenAI adoption, from model inference to advanced AI applications.

网站
https://www.bentoml.com
所属行业
软件开发
规模
11-50 人
总部
San Francisco,California
类型
私人持股
创立
2019
领域
Model Serving、Model Inference、Inference Platform、Compound AI Systems、Multimodality、AI Inference、LLM Inference、LLM Applications、MLOps和LLMOps

产品

地点

  • 主要

    650 California St

    6 fl

    US,California,San Francisco,94108

    获取路线

BentoML员工

动态

  • 查看BentoML的公司主页,图片

    8,674 位关注者

    This is why Kubernetes isn't enough for scaling LLM inference workload ??

    查看Chaoyu Yang的档案,图片

    Founder & CEO of ?? BentoML, Compute orchestration platform for rapid and reliable GenAI adoption.

    How to autoscale LLM inference workload properly? Autoscaling is critical for online LLM inference workload, as it helps ???????????? ?????? ?????????????? ?????????????? ????????-???????????????????????? ???????? ???????? ?????? ??????'?? ????????. But implementing autoscaling for LLM inference is not as straightforward as you may think. Traditional container orchestration platform (like Kubernetes) that only have access to resource utilization and simple request metrics don't cut: ? ? ?????? ???????????? (DCGM_FI_DEV_FB_USED): Amount of GPU memory used. Doesn't apply for workload that preallocate GPU memory (e.g. vLLM KV cache). ? ? ?????? ?????????????????????? (DCGM_FI_DEV_GPU_UTIL): Amount of time GPU is active. Does not measure how much effective compute is being done (e.g. Batch size). ? ? ? ?????? (Query Per Second): A simple request based scaling metrics. Not applicable to LLM workloads due to processing time per request varies depending on input and output token length, or cache hit. ? ? ?????????? ????????: Number of requests pending in external queue before they're processed. This is easy to implement for workloads without batching. For LLM workloads with continuous batching, additional guardrails on concurrency control is required. What's proven to work effectively, is ? ??????????????????????-?????????? ??????????????????????, which represent the # of active requests being processed (See image below). It accurately reflects system load, scales precisely, and is easy to configure based on batch size. The only downside - it requires specialized infrastructure and serving stack, which can be complex and time consuming to build and optimize: ? Workload-specific metrics is required for gaining visibility into batch size, queue size, inference latency, and request concurrency. AI teams should ship AI-specific containers that includes those metrics, and pair it with infrastructure that leverage those metrics for scaling. ? Cold start acceleration is necessary for efficient scaling. Pulling large container image and loading large models can drastically slow down the scaling up process, leading to failed requests or slow responses. ? Scaling to zero: reduce cost by scaling down to zero replica for inactive models to free up compute resources. And spin up the model only when a request is received. At BentoML, we've optimized every layer in the inference and serving stack, to ensure efficient scaling of private LLM inference workload, while allowing developers to easily fine-tune the scaling behaviors tailored to their specific needs. Check out our team's learning in scaling AI inference at BentoML by Sean Sheng: https://lnkd.in/gZvNgE3i BentoML documentation on Concurrency and autoscaling: https://lnkd.in/gmVpgW93 #LLM #Autoscaling #Inference #OpenLLM

    • 该图片无替代文字
  • 查看BentoML的公司主页,图片

    8,674 位关注者

    ?? Today’s the day! Happening this evening from 5:30 PM - 8:30 PM PDT in San Francisco! There’s still time to register! See you there!

    查看BentoML的公司主页,图片

    8,674 位关注者

    Join us for SF #TechWeek: Data & AI Edition on October 8th from 5:30 PM - 8:30 PM PDT! We will co-host this event with Datastrato and Zilliz, featuring key insights on AI and data tech: ? Multimodal Search with Open-Source Tools, Stefan Webb, DevRel, Zilliz ? A Guide to Compound AI Systems, Chaoyu Yang, Founder & CEO, BentoML ? Timeplus - One Single Binary to Tackle Streaming and Historical Analytics, Ken Chen, Co-Founder & Chief Architect, Timeplus Don't miss out on the chance to learn, network, and dive into the latest advancements in AI and data technology! ???Register now to secure your spot! See comments for the link ?? #AI #DataScience #BentoML #Datastrato #Zilliz #Timeplus #OpenSource

    • 该图片无替代文字
  • 查看BentoML的公司主页,图片

    8,674 位关注者

    ? Just 3 days to go! On October 8th from 5:30 PM - 8:30 PM PDT in SF, you can expect an evening packed with cutting-edge AI and data insights! Don’t miss out and register now to secure your spot!

    查看BentoML的公司主页,图片

    8,674 位关注者

    Join us for SF #TechWeek: Data & AI Edition on October 8th from 5:30 PM - 8:30 PM PDT! We will co-host this event with Datastrato and Zilliz, featuring key insights on AI and data tech: ? Multimodal Search with Open-Source Tools, Stefan Webb, DevRel, Zilliz ? A Guide to Compound AI Systems, Chaoyu Yang, Founder & CEO, BentoML ? Timeplus - One Single Binary to Tackle Streaming and Historical Analytics, Ken Chen, Co-Founder & Chief Architect, Timeplus Don't miss out on the chance to learn, network, and dive into the latest advancements in AI and data technology! ???Register now to secure your spot! See comments for the link ?? #AI #DataScience #BentoML #Datastrato #Zilliz #Timeplus #OpenSource

    • 该图片无替代文字
  • 查看BentoML的公司主页,图片

    8,674 位关注者

    ?? Happening today at 9AM, PT! Is LLM deployment draining your budget and patience? Ready to tackle the rising costs, performance issues, and data security headaches? Join us TODAY with BentoML CEO Chaoyu Yang! ??

    查看BentoML的公司主页,图片

    8,674 位关注者

    ?? Join the first LIVE! #AGIBuildersMeetup on 10/02, 9 AM, PT ? Self-hosting #LLM vs #Serverless API ?? Uncover hidden costs & Optimize performance ?? Live Q&A with?BentoML CEO Chaoyu Yang ?? Can't attend? Register to get the recording ?? Spread the love with likes, shares, and invites #AI #GenAI #ML

    此处无法显示此内容

    在领英 APP 中访问此内容等

  • 查看BentoML的公司主页,图片

    8,674 位关注者

    Join us for SF #TechWeek: Data & AI Edition on October 8th from 5:30 PM - 8:30 PM PDT! We will co-host this event with Datastrato and Zilliz, featuring key insights on AI and data tech: ? Multimodal Search with Open-Source Tools, Stefan Webb, DevRel, Zilliz ? A Guide to Compound AI Systems, Chaoyu Yang, Founder & CEO, BentoML ? Timeplus - One Single Binary to Tackle Streaming and Historical Analytics, Ken Chen, Co-Founder & Chief Architect, Timeplus Don't miss out on the chance to learn, network, and dive into the latest advancements in AI and data technology! ???Register now to secure your spot! See comments for the link ?? #AI #DataScience #BentoML #Datastrato #Zilliz #Timeplus #OpenSource

    • 该图片无替代文字
  • 查看BentoML的公司主页,图片

    8,674 位关注者

    ? Only 3 days left! Join our CEO Chaoyu Yang to explore cost-saving strategies and performance optimization for LLMs. Register now to reserve your spot!

    查看BentoML的公司主页,图片

    8,674 位关注者

    ?? Join the first LIVE! #AGIBuildersMeetup on 10/02, 9 AM, PT ? Self-hosting #LLM vs #Serverless API ?? Uncover hidden costs & Optimize performance ?? Live Q&A with?BentoML CEO Chaoyu Yang ?? Can't attend? Register to get the recording ?? Spread the love with likes, shares, and invites #AI #GenAI #ML

    此处无法显示此内容

    在领英 APP 中访问此内容等

  • 查看BentoML的公司主页,图片

    8,674 位关注者

    ?? We’re thrilled to announce that #BentoCloud is now available on Amazon Web Services (AWS) Marketplace! This opens new opportunities for #AWS customers to leverage a complete platform to build and scale #CompoundAI systems! BentoCloud takes the infrastructure complexity out of production #AI workloads. It enables AI teams to run inference with unparalleled efficiency, rapidly iterate on system design, and effortlessly scale in production. With the built-in observability features, BentoCloud empowers you to optimize your AI operations and stay ahead in the fast-paced world of enterprise AI. ?? Explore BentoCloud on AWS Marketplace: https://lnkd.in/g8KaiXGN ?? Questions? Contact us: https://lnkd.in/gxcD-9k8 #BentoML #MachineLearning

    • 该图片无替代文字
  • 查看BentoML的公司主页,图片

    8,674 位关注者

    ?? Exciting times in the open-source AI world again! AI at Meta launched Llama 3.2, introducing multimodal capabilities with Llama Vision and new small models for on-device applications! Over 10 ?? variants from 1B to 90B! Try running inference with OpenLLM and BentoML ?? #OpenLLM openllm serve llama3.2:1b openllm serve llama3.2:3b openllm serve llama3.2:11b-vision #BentoML 11B: https://lnkd.in/gaJXB6Md 90B: https://lnkd.in/ggY5B4Te Here is Llama 3.2 Vision on ?? #BentoCloud ?? #AI #MachineLearning #OpenSource #Llama32

    • 该图片无替代文字

相似主页

查看职位

融资