Building Giant-Scale Services: Key Lessons
Nathaniel Payne, PhD (裴内森)
CEO @ CLD | CTO & Managing Partner @ Dygital9 | Managing Partner @ NOQii & Contivos | Associate Partner @ Btriples | Global Connector | Entrepreneur | PhD (AI)
Over the last few weeks, I have spent a lot of time thinking about giant-scale services. Part of this motivation has come from my recent work in the advanced operating system course housed in Georgia Tech's amazing OMSCS program led by Dr. Umakishore Ramachandran. Other motivation has come from the exciting projects that I am leading and supporting at Cardinal Path, where we are utilizing large scale, high performing cloud infrastructure to handle cutting edge data science & enterprise analytics projects. To that end, a couple of weeks ago, I came across an article written by Dr. Eric Brewer that I found incredibly useful. It was an older paper called "Lessons from Giant-Scale Services" that perfectly fit my work. It struck me as still being tremendously relevant to organizations that are both big and small, despite being published in 2001!
For those of you that do not know him, the author, Dr. Brewer, is a well known research expert in "Big Scale" services who has spent a significant amount of time driving the evolution of Google's internal infrastructure. For more information, you are welcome to read more about his work here. Over his career, Dr. Brewer has contributed extensively to research on the role of infrastructure & the internet. In 2007 for example, Dr. Brewer was inducted as a Fellow of the Association for Computing Machinery "for the design of scalable, reliable internet services." That same year, he was also inducted into the National Academy of Engineering "for the design of highly scalable internet services." He is someone that I look up to and who I have a tremendous amount of respect for. Thus, in an effort to share, I wanted to explore some of the key arguments from his paper, and overlay some thoughts as they relating to the benefits, challenges, and measurement metrics relating to the building and maintenance of giant-scale services.
Benefits of Infrastructure Planning & Research
As many folks in technology management and engineering know, over the past few years, there has been a continued increase in the complexity and scale of infrastructure services operating in many domains. There have also been rapid and dynamic changes in the way that these services are delivered. In general, from a high level perspective, the fundamental benefits of investment in infrastructure and infrastructure planning for giant-scale and non-giant-scale services can be articulated as the following:
- Infrastructure provides users and applications access to the internet (and relevant infrastructure) anywhere, anytime. It is a cost that isn't always glamorous. That said, as I have heard from many clients over the past few weeks, it is something that is needed to be competitive and to truly allow ones organization to be "driven by the data"
- Infrastructure allows customers to access your business from a variety of places and devices, delivering a truly multi-channel experience
- Robust infrastructure enables the deployment of powerful groupware which connects users and applications
- Infrastructure research & effective implementation can allow a company to significantly reduce its costs & compete more effectively against competitors
- Infrastructure enables organizations to connect data sources from across their organization in order to enable decision making
- Infrastructure requires organizations to plan for growth, develop management and evolution plans, etc.
Trade-offs & Competing Factors
While the case for infrastructure research and investment from a benefit perspective is very strong, the fact of the matter is that operating infrastructure at scale in a "giant scale" engenders a lot of trade-offs. To be clear, for the sake of this post, let us consider giant-scale services as services that require high scale, availability, and evolution. This includes infrastructure powering websites such as Google, Facebook, YouTube, Baidu, Amazon, Live, and many others. While these are some of the largest sites by traffic, these examples have many things in common with smaller sites that need "giant-scale capabilities", including those sites that are sharing content, have many users, and need to scale irregularly. Each of these activities requires an organization to wrestle with the trade-offs or competing factors that impact the design of system infrastructure, including:
- Cost and performance trade-offs
- Component trade-offs (coupling vs cohesion)
- Scalability trade-offs, including the ability to incrementally scale (Which preserves and augments an organization's existing infrastructure investments)
In the next few sections, I will seek to explore, like Dr. Brewer did, the underlying challenges which impact these trade-offs in a giant-scale environment. These include load management, high availability, and growth planning.
Challenge 1: Load Management
Without a doubt, load management is one of the key central problems that developers of giant-scale architectures face. Within this type of distributed environment, the load manager, as Dr. Brewer notes, has three different responsibilities:
- Provide the external name: the external name can be a domain name or a set of IP addresses
- Load balance the traffic: The goal here is to use load balancing to achieve higher overall utilization & better average response time.
- Isolate faults from clients: This is the hard part, as it requires the manager to detect faults and dynamically change the routing of traffic to avoid down nodes.
The most fundamental issue in load management is whether or not the manager understands the distribution of data across the nodes. To that point, there are three different choices that can be taken regarding load management, including symmetric, asymmetric, or symmetric with affinity (see his article for a deeper dive). Importantly, while all three are used in practice, their use often depends
upon the state of four key properties that can be evaluated when analyzing a current or future system, including:
- Database aggregation: Does adding nodes increases the database size or simply add replicas for throughput?
- Single-query throughput: Does the throughput of a single query scales with the size of the cluster?
- Query locality: Do nodes in the system receive a subset of the queries, and thus have a smaller working set and better cache performance?
- Even utilization: How effective is the load balancing? As Dr. Brewer notes, pure symmetric systems are much easier to maintain from an even load perspective
Now that we have a better understanding of load management, the next issue relates to how we measure it. In his paper, Dr. Brewer outlines two key availability metrics that any organization can use relating to load management. These include:
- Fail-over response time: The time it takes the load manager to detect and avoid a faulty node (or link to that node). Reducing this reaction time thus directly contributes to the effective up-time of the service.
- Load-manager availability: Refers to the up-time of the load manager itself. A down load manager may directly result in a down service (if all traffic goes through it), or it may simply increase the failover response time and the variation in node utilization.
Challenge 2: High Availability
High Availability is an issue that is central to the development of giant-scale services. In fact, as Dr. Brewer notes, "High availability is one of the major driving forces of giant-scale system design. Other infrastructures—such as the telephone, rail, water and electricity systems—have extreme availability goals that should apply to IP-based infrastructure services as well. Most of these systems plan for failure of components and for natural disasters. However, information systems must also deal with constant rapid evolution in feature set (often at great risk) and rapid and somewhat unpredictable growth." These systems must also deal with problems including the presence of people (Dr. Brewer in fact notes that a "prerequisite to low failure rates in general is the removal of people from the machine room!"), the presence of cables, and challenges from their operating environment (including temperature).
As we all know in technology, availability is wonderful. That said, it must be measured to be controlled. In my own work, the traditional metric for availability is up-time. Up-time is simply the fraction of time that the site is up. Additionally, two related metrics are mean-time between failure (MTBF) and mean-time-to-repair (MTTR). Another key factor is yield. Yield, can be defined as the fraction of queries that are completed, where: yield = queries completed / queries offered. Harvest is another key measure. Harvest = data available / complete data. Ideally, Dr. Brewer argues that a perfect system would have 100% yield and 100% harvest, where every query would complete and would reflect the entire database. Unfortunately, as Dr. Brewer points out, replicated systems tend to map faults to reduction in capacity (and thus yield at high utilization). On the other hand, partitioned systems tend to map faults to reduction in harvest, as parts of the database temporarily disappear, but the capacity in queries/sec remains the same.
One metric that is cited as a support tool for measuring and planning around high availability (which I would argue is the best contribution from the paper), is a principle called the DQ Principle. It is a principle that all organizations today can use regardless of their size, level or scale. It is also one of Dr. Brewer's key key contributions to my current work (ignoring of course his more fundamental contribution relating to the CAP theorem). The DQ principle can be simply defined as:
DQ Value = Data per query * Queries/sec ~ constant
The DQ value is the total amount of data that has to be moved per second on average and it is thus bounded by the underlying physical limitation. In fact, at the high utilization typical of giant- scale systems, it approaches this limitation. The key advantage of using this type of a measurement is that the DQ value is both measurable and tunable. For example, adding nodes to an online web cluster or implementing software can be measured against whether they increase or decrease the DQ value like fault do (and at what cost). In general, the absolute DQ value is not that important typically to most organizations. Rather, the relative value under various condition changes provides a useful guide. To that end, without providing significant strong empirical support, Dr. Brewer notes that the best possible result under multiple faults is a linear reduction in DQ. He also notes that DQ often scales linearly with the number of nodes, which is useful since it means that early tests on single nodes tend to have predictive power for overall cluster performance. Finally, he notes that all proposed hardware/software changes can be evaluated by their DQ impact, which is something that is critically useful for capacity planning.
Challenge 3: Growth And Capacity Planning
The third challenge here, without a doubt, is more of an operational and strategic challenge that I spoke about at a recent meetup for the Vancouver Enterprise Cloud Computing Users Group. It is one of the most difficult fundamental challenges because of the operational impact that managing infrastructure can bring. One of the traditional tenets of highly available systems is to aim for minimal change. Unfortunately, when thinking of giant scale services, reducing change is often directly in conflict with both the growth rates of these services, as well as “Internet time” — the practice of extremely fast service release cycles. Without a doubt, for all organizations currently using or thinking about developing giant-scale services, organizations must plan for continuous growth and frequent updates in functionality. The challenge here, is that frequent updates mean that software is never perfect and that hard-to-resolve issues such as slow memory leaks and non-deterministic bugs tend to remain unfixed. Thus, the main task and focus becomes trying to "maintain high availability in the presence of expansion and frequent software changes."
One of the central parts of growth and capacity planning that Dr. Brewer talks about is lead time planning. This is something that remains and will remain a challenge, particularly as new technologies continue to be released. Lead time planning can be broken into different parts, including the time required to make decisions on infrastructure, development time and costs, as well as overhead planning. While there are many approaches to lead time planning, the key takeaway from the paper that I feel is relevant relates to Dr. Brewer's argument that the impact of these potential changes can be measured using the DQ value. In particular, when deciding on the course of action, he argues that potential impacts on a system can be compared with a target DQ value. Even if this is relative, this enables an organization to do quality capacity planning. It also allows one to understand whether proposed actions help resolve bottlenecks. Finally, it takes into account the "excess capacity" or headroom needed in development.
Final Thoughts
Without a doubt, the evolution of giant-scale services is continuing and will continue at an increasing pace. Engineering and technology management teams face new challenges every day, including challenges relating to new types of data (streaming data), new technology, product changes, changing client & stakeholder requirements, as well as environmental challenges. Fundamentally though, using a metric like the DQ value that Dr. Brewer suggests, combined with other best practice metrics, can help any organization more effectively deal with load management, availability, and growth. It can also enable your organization to be ready to use infrastructure to solve complex applied problems that can drastically increase your organization's ROI. Finally, it can help provide your organization with significant competitive advantage.