If Anything Can Go Wrong, It Will!
Dr. Tilman Buchner
Global Leader Innovation Center for Operations | Partner & Director at Boston Consulting Group
The major internet outage on June 8 took down sites like Hulu, CNN, Twitch, Reddit, Spotify, and Vimeo among many others. The reason was a massive outage at Fastly, a popular content distribution network (CDN) service. This massive outage has shown how important it is to have proper strategies in place to minimize the blast radius of failures.
What makes cloud services so difficult to operate?
We live in a highly distributed world, and our cloud-powered communication backbones relay on three plus one physical laws:
First, there is the speed of light: 186 miles per millisecond. No information on earth can be transported faster than the speed of light - that is a law of nature. That means sending a request from Berlin to a cloud endpoint in San Francisco and back takes approximately 94 ms. A car driving 50km/h will drive 1.25 m during this time. This effect is called latency and needs to be taken into account when building complex distributed infrastructure systems. In the case of a self-driving car, you cannot wait for the answer from the cloud. This is the reason why #Edge-Computing was invented in order to shift computational power and storage into the "edges" of the network.
Cloud service providers (CSP) like AWS, Microsoft of Google developed a tiered range of cloud services to address these needs, extending from high performance data centers and on- premise solutions (e.g., AWS Outpost, MS Azure Stack) to offerings for 5G infrastructure (AWS Wavelength) or limited connectivity like on vessels or oil platforms (e.g. AWS Snowcone, Azure Data Box). To power IoT applications with low latency edge operating systems are used to connect mechatronic devices and smart sensors to the cloud. But even microcomputers can be connected to the cloud by using freeRTOS.
In the today's world, data is highly distributed across a wide range of services and locations while there is an increasing demand to mesh different data sources in near realtime to generate meaningful insights on the fly.
This leads to the second challenge, the CAP theorem, first postulated by Brewer in 1998 and proven by Seth Gilbert and Nancy Lynch at MIT. When designing distributed cloud services, there are three properties that are commonly desired: consistency, availability and partition tolerance. But it is impossible to achieve all three at the same time.
A practical example is applications in the financial sector, where consistency is particularly important here, because it must always be ensured that deposited or withdrawn sums of money are actually booked. Furthermore, the system in this area must also be reliable in the face of failures so that the consistency is maintained. Availability is less important at this point. If a system fails, it can also be temporarily unavailable.
Every cloud service provider (CSP) reaching from large-scale hyperscalers like AWS, Microsoft, or Google to small virtual private server (VPN) hosts like DigitalOcean or GoDaddy need to deal with these physical constraints when operating their distributed infrastructure.
How can things fail?
In general, you can distinguish physical and non-physical failures. Physical failures are disk fails, network devices that fail or damages outside the data center. If fiber optic cables are damaged during construction work, this can quickly lead to outages. But even the weather conditions needs to be taken into account when operating cloud services. Fires (OVHCloud's SBG2 data center in Strasbourg on March 10th), earth quakes such as the one in Japan in 2011, or electrical storms can affect data center or the public power grid and thus lead to larger outages.
In addition to physical causes, there are also non-physical ones. When multiple systems flood the bandwidth of the cloud service providers we speak of distributed-denial-of-service (DDoS) attacks. It is impossible to block the attacker without completely stopping communication with the network. In addition to dedicated cyber attacks like DDoS or malware, software deployment- and reconfiguration processes could also introduce problems. Last but not least there are bugs. A single problematic request can trigger a bug in the system that can in the worst case can lead to the collapse of the system.
In addition to the known and obvious threats, the big CSPs prepare themselves for so-called black swan events. Black swan is a metaphor that describes an event that lies outside the realm of regular expectations, because nothing in the past can convincingly point to its possibility. Second, it carries an extreme 'impact'. Third, in spite of its outlier status, human nature makes us concoct explanations for its occurrence after the fact, making it appear explainable and predictable. (source: Wikipedia)
What measures do cloud service providers (CSPs) take to contain failures?
Regional isolation: Cloud service provider strive for a so called "shared-nothing architecture" in their regional data center. Each region is a separated and provides a distinct stack of cloud services with separate set of end points for API requests. If you want to interact with the AWS US West data center in Oregon, you use the end point: ec-2.us-west-1.amazonaws.com. If you prefer the data center in Frankfurt for latency or data residence reasons, you need to use the ec-2.eu-central-1.amazonaws.com end point. There is no single global ec2 available. The regions are isolated and don't know about each other. The design principle of regional isolation results in a single region blast radius - but this would still represent a huge impact for customers operating in just one region.
Multi-availability zones (AZ): Each region sits in a physical location (e.g. Dublin, Frankfurt etc.) and is composed of multiple data centers that are spread across the regions's metropolitan area. These different locations are called availability zones (AZ) and they are cross-connected with high-speed private fiber cable. They are far enough apart from each other that there is a very low possibility of correlated failures, while being geographically close enough for low latency. AWS operates 25 active regions and 80 availability zones. Each AZ is fitted with one or more data centers. This design reduces the blast radius because the possibility of a correlated failure across the entire region is reduced when a multi-AZ architecture is used for an application. An elastic load balancer will balances the traffic across them. This is a very powerful, fault tolerance design.
What are strategies to safeguard the workloads within an AZ?
Cell-based architecture: Applications running in AZs can be split into three building blocks: load balancers, compute services, and storage services. Hyperscalers create multiple instantiations of this stack. The stacks are fully isolated and don't know about each other. Each one of these stacks is called a cell. Load balancers are used to balance the workload across the cells. The cells are an internal structure invisible to the customer but providing resilience and fault tolerance.
As an example, imagine eight customers sending requests to one server. One customer introduces a bad workload, triggering a bug in the system and causing the server to crash. The failure affects all eight users. Introducing a cell-based architecture with four cells allows the impact to be reduced from 100% to just 25%. But even this number can be improved significantly by using shuffle sharding.
Shuffle sharding: The idea behind this is to take each customer and assign them to two nodes effectively at random. Now, the blast radius is the number of customers divided by the number of combinations of two pairs out of eight.
The mathematical reason for this lies in the binomial coefficient: (n over k) = n! / k!*(n-k)! If n is equal to the number of nodes (8) and each customer gets two nodes to represent its workload this results in 28 combinations. In a nutshell, shuffle sharding can reduce the blast radius per customer in this case to 3.6%! If you increase the number of nodes or the share size, the blast radius can be reduced even further.
Werner Vogels (CTO, AWS) "We never want to touch more than one zone at a time"
Operational excellence: In addition to constructive security measures, operational excellence can make a significant contribution to a system's security against failure. All professional CSPs have set up processes to make changes to the system as secure as possible. Software deployments are staggered across zones and regions over time to ensure that new features keep expanding but do not cause problems.
Strengthen your resilience
Five key takeaways that all IT / OT managers should ask themselves:
- Which design pattern can be used to build resilience into the IT/OT system architecture?
- What is the degree of impact if things go wrong (think in scenarios)?
- How can you determine which customers (workloads, functionalities) are affected?
- How can the failure be contained and the impact kept as small as possible?
- In retrospect, how could you halve the explosion radius for similar events?
For more detailed information on "How to minimize the Blast Radius of Failures" watch the re:invent talk by Peter Vosshall, VP & Distinguished Engineer AWS.
Sr Solutions Architect AWS, Author of O'Reilly's AWS Cookbook, MBA, 13x AWS Certified
3 年Great stuff. Here is some of my favorite material in this area. Fallacies of distributed computing https://en.wikipedia.org/wiki/Fallacies_of_distributed_computing Amazon Global Network Overview with James Hamilton (Good overview from 2016 - Crazy to see how things have grown since 2016) https://www.youtube.com/watch?v=uj7Ting6Ckk "Everything fails,?all the time." - Werner?Vogels O'Toole's Commentary on Murphy's Law?:?Murphy?was an Optimist.