Global Capacity Management — Part (3) of stories behind Meta/Facebook’s systems research papers

Global Capacity Management — Part (3) of stories behind Meta/Facebook’s systems research papers

This post is part 3 of the series that introduces Meta’s systems research papers to a broader audience. To make these posts more engaging for the general audience, I will sprinkle in some behind-the-scenes stories about these papers and their corresponding production systems.

This post is about the following paper: [OSDI’23] Global Capacity Management With Flux

Stories behind the paper

This paper was initially rejected by OSDI'22 and subsequently accepted by OSDI'23 after significant revisions. I will use this experience to explain both the challenges and opportunities of writing a research paper from the industry.

Although hands-on experience with hardware capacity management is rare in both academia and industry, and there is limited research published on this topic, the significance of the capacity management problem presented in the Flux paper is recognized by nearly every reviewer from OSDI'22 and OSDI'23. However, the initial submission to OSDI'22 was rejected with one main feedback: while the submission described how the system works, it lacked sufficient insights, justification of design decisions, and comparisons with alternatives. As a result, it did not align well with the characteristics of a "research" paper. This is a common pitfall for papers originating from industry.

Another common critique for industry papers, not specific to the Flux paper, is the absence of a quantitative comparison between different implementations of alternative solutions, including those documented in prior research publications. On one hand, this issue can be partially addressed by conducting a self-comparison of an industry solution, toggling a feature on and off to assess its significance. On the other hand, I would like to request understanding from academic reviewers that it is often infeasible to introduce alternative solutions into the codebase of a production system solely for the purpose of performance or functional comparison due to the extensive time required to build such a solution in the complex production environment. Moreover, while there may be related open-source solutions available, it is typically impractical to run these solutions at a large scale in a complex industry environment for direct comparisons.

Overall, there might be two conflicting perceptions about writing research papers from the industry, and I think neither perception is fully accurate. At one extreme, some people might perceive that it is much harder to publish high-quality research papers from the industry because the paper selection criteria are primarily set by academic folks, and those criteria do not align well with the reality in the industry, resulting in industry papers being treated harshly. At the other extreme, some people might perceive that it is much easier to publish research papers from the industry simply because of the real systems and production data associated with industry papers.

After experiencing many acceptances and rejections of industry papers myself, I think both perceptions have some truth to them, but neither side is fully accurate. I think that systems conferences such as SOSP, OSDI, ASPLOS, and ISCA overall maintain a healthy balance between papers from academia and industry. Nevertheless, I still wish that academic reviewers had a little bit more understanding and empathy for the constraints of writing research papers from the industry, as explained above.

Specifically for the Flux paper, after the rejection by OSDI’22, we significantly revised the paper and waited an entire year for resubmission to OSDI instead of rushing to publish it in an easier conference. Again, I deeply appreciate the fact that systems conferences overall maintain a healthy balance between papers from academia and industry.

Paper Abstract

The full paper is available here.

Abstract: Customers of both private and public cloud providers must wrestle with the problem of regionalization: how should service capacity be apportioned across a large number of geo-distributed datacenter regions? This problem is further complicated by the complex service dependency graphs that arise from microservice architectures, as well as capacity availability and hardware mix that can vary greatly by region.

Historically, regionalization has been solved through a slow-moving and manual process, whereby owners of large services directly negotiate capacity allocation and distribution with the cloud provider. However, as both service and cloud footprints continue to grow, these manual processes are becoming untenable, and tend to produce both a great amount of toil for everyone involved, as well as suboptimal results.

At Meta we have built a system, Flux, to automate capacity regionalization, moving it from a bottoms-up, manual process, to a top-down, automated one. Flux employs RPC tracing to identify service capacity models, and uses these to compute an optimal joint capacity and traffic distribution plan that spans 1000s of services across 10s of products, and involves millions of servers. These plans are orchestrated by a system that safely and efficiently rebalances service capacity and product traffic across 10s of regions on a continuous basis.

Cantürk ???i

Eng Leader | Systems/Architecture Researcher | Builder | Defensive Midfielder

1 年

I am loving this systems research series. Besides the great research work that they overview, these are a great primer on what it takes to publish top tier industry papers.

Mahesh Balakrishnan

Distributed Systems Researcher

1 年

This is insightful commentary! Something I see happen in response to harsh feedback from SOSP/OSDI is that people often submit instead to special industry tracks in other conferences, which have a lower / different bar. This seems superficially better (both industry and academics seem to prefer it) but I think it does a disservice to the authors and the audience, since often that extra round of rewriting and reframing can result in fantastic first-class research results from industry.

Wow, you people are on a roll!

要查看或添加评论,请登录

CQ Tang (Chunqiang Tang)的更多文章

社区洞察

其他会员也浏览了