Serverless Functions in Private Cloud—Part (5) of stories behind Meta/Facebook’s systems research papers

Serverless Functions in Private Cloud—Part (5) of stories behind Meta/Facebook’s systems research papers

This post is part 5 of the series that introduces Meta’s systems research papers to a broader audience. Specifically, this post is about the following paper at SOSP’23: XFaaS: Hyperscale and Low Cost Serverless Functions at Meta. At Meta, we have more teams writing code in the form of serverless functions without worrying about operational issues than teams writing code for regular services that engineers operate by themselves.

Paper Abstract

Function-as-a-Service (FaaS) has become a popular programming paradigm in Serverless Computing. As the responsibility of resource provisioning shifts from users to cloud providers, the ease of use of FaaS for users may come at the expense of extra hardware costs for cloud providers. Currently, there is no report on how FaaS platforms address this challenge and the level of hardware utilization they achieve.

This paper presents the FaaS platform called XFaaS in Meta's hyperscale private cloud. XFaaS currently processes trillions of function calls per day on more than 100,000 servers. We describe a set of optimizations that help XFaaS achieve a daily average CPU utilization of 66%. Based on our anecdotal knowledge, this level of utilization might be several times higher than that of ? typical FaaS platforms.

Specifically, to eliminate the cold start time of functions, XFaaS strives to approximate the effect that every worker can execute every function immediately. To handle load spikes without over-provisioning resources, XFaaS defers the execution of delay-tolerant functions to off-peak hours and globally dispatches function calls across datacenter regions. To prevent functions from overloading downstream services, XFaaS uses a TCP-like congestion-control mechanism to pace the execution of functions.

Noteworthy Aspects of Our Serverless Workloads

In addition to the details in the paper, I would like to highlight two noteworthy aspects of our serverless workloads.

Highly Spiky Workloads

Most publications about serverless functions (i.e., lambda functions) focus on reducing function cold start times, which is also an issue addressed in this paper. However, prior works often overlook another equally important, or possibly even more critical issue: the highly spiky workloads. Specifically, at Meta, the peak submission rate of serverless functions is 4.3 times higher than the off-peak rate, and overall, the XFaaS platform uses more than 100,000 machines to run serverless functions. Consequently, provisioning XFaaS to match the peak demand would result in very low resource utilization during off-peak hours. We believe that this is a major challenge that also affects other large serverless platforms. XFaaS employs a combination of techniques to effectively smooth out function execution, including deferring the execution of delay-tolerant functions to off-peak hours, delaying the execution of functions that have exceeded their quotas, and globally load-balancing function calls across datacenter regions, among other strategies.

Event-driven Workloads

In general, our serverless workloads are mostly event driven and seldom handle user-facing interactive requests that demand sub-second response times, such as newsfeed display or search result ranking. Traditionally, at Meta, these interactive user requests are handled by long-running services, as serverless functions do not offer significant advantages in these scenarios.

Serverless functions are typically lauded for two primary benefits: (1) pay-as-you-go and no upfront capacity planning, and (2) streamlined deployment where developers only write code, and the serverless platform handles deployment automatically. However,? for user-facing requests requiring sub-second response times, the first benefit loses relevance because meticulous capacity planning is needed to provide guaranteed capacity and ensure delightful experiences for the billions of users of Meta products. This is very different from the scenario of a small product with a limited user base, where occasional user experience degradation is acceptable, and cloud providers are expected to have spare capacity to accommodate load spikes of small products.

Moreover, at Meta, the second benefit can be achieved through full deployment automation without employing serverless functions.? Notably, our continuous deployment tool, Conveyor, already deploys 97% of all services without any human intervention, and even serverless functions are deployed through Conveyor. Due to these reasons, at Meta, serverless functions? ? are rarely used to handle user-facing requests requiring sub-second response times.

Conclusion

With the serverless computing paradigm, the responsibility of resource provisioning shifts from users to cloud providers and the ease of use of FaaS for users may come at the expense of extra hardware costs for cloud providers. This paper reports how we achieve a high resource utilization of 66% in Meta’s serverless platform called XFaaS. If interested, please read the full paper.

要查看或添加评论,请登录

CQ Tang (Chunqiang Tang)的更多文章

社区洞察

其他会员也浏览了