登录查看更多内容

Serverless Functions in Private Cloud—Part (5) of stories behind Meta/Facebook’s systems research papers

CQ Tang (Chunqiang Tang)

AI, ASIC/GPU/Accelerator, LLM/Llama, hardware/software co-design, HPC, IaaS, PaaS. Hiring PhDs and managers.

发布日期: 2023年10月24日

This post is part 5 of the series that introduces Meta’s systems research papers to a broader audience. Specifically, this post is about the following paper at SOSP’23: XFaaS: Hyperscale and Low Cost Serverless Functions at Meta. At Meta, we have more teams writing code in the form of serverless functions without worrying about operational issues than teams writing code for regular services that engineers operate by themselves.

Paper Abstract

Function-as-a-Service (FaaS) has become a popular programming paradigm in Serverless Computing. As the responsibility of resource provisioning shifts from users to cloud providers, the ease of use of FaaS for users may come at the expense of extra hardware costs for cloud providers. Currently, there is no report on how FaaS platforms address this challenge and the level of hardware utilization they achieve.

This paper presents the FaaS platform called XFaaS in Meta's hyperscale private cloud. XFaaS currently processes trillions of function calls per day on more than 100,000 servers. We describe a set of optimizations that help XFaaS achieve a daily average CPU utilization of 66%. Based on our anecdotal knowledge, this level of utilization might be several times higher than that of ? typical FaaS platforms.

Specifically, to eliminate the cold start time of functions, XFaaS strives to approximate the effect that every worker can execute every function immediately. To handle load spikes without over-provisioning resources, XFaaS defers the execution of delay-tolerant functions to off-peak hours and globally dispatches function calls across datacenter regions. To prevent functions from overloading downstream services, XFaaS uses a TCP-like congestion-control mechanism to pace the execution of functions.

Noteworthy Aspects of Our Serverless Workloads

In addition to the details in the paper, I would like to highlight two noteworthy aspects of our serverless workloads.

领英推荐

AWS EC2 Instance Types and Use Cases

Neal K. Davis 2 年前

Load balancing EC2 Instances in an Autoscaling Group

Neal K. Davis 2 年前

Choosing Your Cloud Strategy: Serverless vs. Containers

ParallelStaff 11 个月前

Highly Spiky Workloads

Most publications about serverless functions (i.e., lambda functions) focus on reducing function cold start times, which is also an issue addressed in this paper. However, prior works often overlook another equally important, or possibly even more critical issue: the highly spiky workloads. Specifically, at Meta, the peak submission rate of serverless functions is 4.3 times higher than the off-peak rate, and overall, the XFaaS platform uses more than 100,000 machines to run serverless functions. Consequently, provisioning XFaaS to match the peak demand would result in very low resource utilization during off-peak hours. We believe that this is a major challenge that also affects other large serverless platforms. XFaaS employs a combination of techniques to effectively smooth out function execution, including deferring the execution of delay-tolerant functions to off-peak hours, delaying the execution of functions that have exceeded their quotas, and globally load-balancing function calls across datacenter regions, among other strategies.

Event-driven Workloads

In general, our serverless workloads are mostly event driven and seldom handle user-facing interactive requests that demand sub-second response times, such as newsfeed display or search result ranking. Traditionally, at Meta, these interactive user requests are handled by long-running services, as serverless functions do not offer significant advantages in these scenarios.

Serverless functions are typically lauded for two primary benefits: (1) pay-as-you-go and no upfront capacity planning, and (2) streamlined deployment where developers only write code, and the serverless platform handles deployment automatically. However,? for user-facing requests requiring sub-second response times, the first benefit loses relevance because meticulous capacity planning is needed to provide guaranteed capacity and ensure delightful experiences for the billions of users of Meta products. This is very different from the scenario of a small product with a limited user base, where occasional user experience degradation is acceptable, and cloud providers are expected to have spare capacity to accommodate load spikes of small products.

Moreover, at Meta, the second benefit can be achieved through full deployment automation without employing serverless functions.? Notably, our continuous deployment tool, Conveyor, already deploys 97% of all services without any human intervention, and even serverless functions are deployed through Conveyor. Due to these reasons, at Meta, serverless functions? ? are rarely used to handle user-facing requests requiring sub-second response times.

Conclusion

With the serverless computing paradigm, the responsibility of resource provisioning shifts from users to cloud providers and the ease of use of FaaS for users may come at the expense of extra hardware costs for cloud providers. This paper reports how we achieve a high resource utilization of 66% in Meta’s serverless platform called XFaaS. If interested, please read the full paper.

要查看或添加评论，请登录

CQ Tang (Chunqiang Tang)的更多文章

Meta/Facebook’s Hyperscale Infrastructure: Overview and Insights

2025年1月27日

Meta/Facebook’s Hyperscale Infrastructure: Overview and Insights

Have you ever been curious about how a hyperscaler’s infrastructure with millions of servers operates smoothly like a…

8 条评论
Meta/Facebook’s SOSP’24 and OSDI'24 Best Papers on Saving Millions of Servers from Performance Regressions

2024年11月4日

Meta/Facebook’s SOSP’24 and OSDI'24 Best Papers on Saving Millions of Servers from Performance Regressions

First and foremost, we are hiring! This is part of the series introducing Meta’s systems research papers to a broader…

8 条评论
Meta’s OSDI’24 Paper on Datacenter Resource Allocation

2024年8月5日

Meta’s OSDI’24 Paper on Datacenter Resource Allocation

BTW, we are hiring. Meta published three papers at OSDI’24, including one Best Paper.
Meta’s OSDI’24 Paper on Scheduling for ML Training

2024年7月31日

Meta’s OSDI’24 Paper on Scheduling for ML Training

BTW, we are hiring. Meta published three papers at OSDI’24.
Meta’s OSDI’24 Best Paper on Pre-production Performance Testing

2024年7月27日

Meta’s OSDI’24 Best Paper on Pre-production Performance Testing

BTW, we are hiring. Meta published three papers at OSDI’24.

7 条评论
Meta/Facebook seeking full-time research scientists in Architecture, HPC and ML Systems

2024年4月28日

Meta/Facebook seeking full-time research scientists in Architecture, HPC and ML Systems

If interested, please message me on LinkedIn with your resume. Meta, formerly known as Facebook, is seeking full-time…

1 条评论
Remote configuration management for apps on billions of mobile devices—Part (6) of stories behind Meta's systems research papers

2024年4月22日

Remote configuration management for apps on billions of mobile devices—Part (6) of stories behind Meta's systems research papers

This post is part 6 of the series that introduces Meta’s systems research papers to a broader audience. Specifically…

1 条评论
ISCA’23 Best Paper Award for advancements in reducing physical memory fragmentation—Part (4) of stories behind Meta/Facebook’s systems research papers

2023年7月14日

ISCA’23 Best Paper Award for advancements in reducing physical memory fragmentation—Part (4) of stories behind Meta/Facebook’s systems research papers

This post is part 4 of the series that introduces Meta’s systems research papers to a broader audience. To make these…

2 条评论
Global Capacity Management — Part (3) of stories behind Meta/Facebook’s systems research papers

2023年7月12日

Global Capacity Management — Part (3) of stories behind Meta/Facebook’s systems research papers

This post is part 3 of the series that introduces Meta’s systems research papers to a broader audience. To make these…

3 条评论
Service mesh and global RPC routing --- Part (2) of stories behind Meta/Facebook’s systems research papers

2023年7月11日

Service mesh and global RPC routing --- Part (2) of stories behind Meta/Facebook’s systems research papers

This post is part 2 of the series that introduces Meta’s systems research papers to a broader audience. To make these…

See all articles

Serverless Functions in Private Cloud—Part (5) of stories behind Meta/Facebook’s systems research papers

CQ Tang (Chunqiang Tang)

AI, ASIC/GPU/Accelerator, LLM/Llama, hardware/software co-design, HPC, IaaS, PaaS. Hiring PhDs and managers.

Paper Abstract

Noteworthy Aspects of Our Serverless Workloads

领英推荐

Highly Spiky Workloads

Event-driven Workloads

Conclusion

CQ Tang (Chunqiang Tang)的更多文章

社区洞察

其他会员也浏览了

EC2 Image Builder

5 Ways to Reduce AWS EKS Compute Costs

What’s Ahead for Cloud Native in 2025

EC2 Cost Optimization

Beyond EC2: Exploring Niche AWS Services You May Not Know About

Serverless: The Hidden Costs No One Talks About

What are the Google Cloud Platform (GCP) Services?

Serverless computing will be the next new normal!

Power of Serverless Computing in GCP

Serverless vs. Containers: Which One To Choose?

Paper Abstract

Noteworthy Aspects of Our Serverless Workloads

领英推荐

Highly Spiky Workloads

Event-driven Workloads

Conclusion

CQ Tang (Chunqiang Tang)的更多文章

Meta/Facebook’s Hyperscale Infrastructure: Overview and Insights

Meta/Facebook’s SOSP’24 and OSDI'24 Best Papers on Saving Millions of Servers from Performance Regressions

Meta’s OSDI’24 Paper on Datacenter Resource Allocation

Meta’s OSDI’24 Paper on Scheduling for ML Training

Meta’s OSDI’24 Best Paper on Pre-production Performance Testing

Meta/Facebook seeking full-time research scientists in Architecture, HPC and ML Systems

Remote configuration management for apps on billions of mobile devices—Part (6) of stories behind Meta's systems research papers

ISCA’23 Best Paper Award for advancements in reducing physical memory fragmentation—Part (4) of stories behind Meta/Facebook’s systems research papers

Global Capacity Management — Part (3) of stories behind Meta/Facebook’s systems research papers

Service mesh and global RPC routing --- Part (2) of stories behind Meta/Facebook’s systems research papers

社区洞察

其他会员也浏览了

EC2 Image Builder

5 Ways to Reduce AWS EKS Compute Costs

What’s Ahead for Cloud Native in 2025

EC2 Cost Optimization

Beyond EC2: Exploring Niche AWS Services You May Not Know About

Serverless: The Hidden Costs No One Talks About

What are the Google Cloud Platform (GCP) Services?

Serverless computing will be the next new normal!

Power of Serverless Computing in GCP

Serverless vs. Containers: Which One To Choose?