Analyzing CSP outages ...
Transparency shown by Cloud Service Providers (CSP) in root cause analysis (RCA) of outages is very informative, but, it may be a good use of that information, if we discussed kind of operational issues CSPs tend to face and detect patterns in them. Such analysis may help us, Cloud Services Consumers, to be more educated in architecture decision making.
In the issues CSPs are facing, the primary causes of many issues, seems to be, as follows
- Cross dependency of its services - Services built by CSPs have varying degree of maturity. Some of local services tend to depend global services, which span zones and regions. While that is legitimate dependency, some of the regional services seems to have indirect dependency on local services, causing global outages due to local failures
- Rerouting of network traffic (Sometimes due to mis-configuration during change) - Most of the regions provided by CSPs are "peered” and connected on their own network backbone. This is a very good feature, it however seems like the network peerings are not designed for seamless fail-over. The network pipes between regions are not designed for automated failure and do not have the necessary capacity to support degraded states. Every time the mass-rerouting of network traffic occurs, by design or by accident, services slow down or completely break causing serial outages.
- Manual operational procedures? - Some of the change procedures from the playbook seem to be manual. Essentially CSP's while making changes to their infrastructure tend to report that it was mis-configured. In one or two cases I have also seen comments like Engineer ran the command and instead of running locally, they ran it for globally or for a different set of infrastructure. This leaves me with the feeling that parts, if not all, of the processes, are still manual, leading to human introduced errors and resulting outages.
- Pace of changes in CSP Services - Many of these outages seemingly occurring during changes to the core infrastructure services like Network, Storage or Compute. CSPs are releasing 100's of services in a year and updating 1000's. It is kind of big bundle of complexity which making change management a challenge. In this heightened pace of bringing services to market, is it possible that local change management trade-offs are leading to operational failures? This may be more of a rapidly evolving market problem finding its way to operations.
Here are three examples which may be indicative of these patterns:
Rerouting and Manual Playbook Actions:
Here is the update from GCP on outage from 6/2/2019 https://cloud.google.com/blog/topics/inside-google-cloud/an-update-on-sundays-service-disruption
Basically Googlers are saying, someone from operations messed up the configuration of the services which led to additional traffic to other regions which led to service disruption. Google does not explain if this error was due to manual operation or if it was system fault. GCP had similar issue few months ago which caused bandwidth issues due to change in network configuration which caused rerouting of traffic causing congestion. This is probably the 3rd time GCP spoke of errors introduced during change management causing network rerouting and congestion.
Here is another GCP outage that may have occurred due to code defect https://status.cloud.google.com/incident/cloud-networking/18012
Service Dependencies and Rerouting:
For Azure, the outage in September is detailed here https://blogs.msdn.microsoft.com/vsoservice/?p=17485
The outage was caused by storms and natural reasons. In the "why" analysis section, Microsoft establishes that they need more, what we would like to call "Availability Zones". Basically datacenter like facilities "near" each other sharing the same network to allow for local fail-overs (I know purist will hate the way I put it but that is the bottom line). These allow services to manage failures "locally". It also means that without Availability Zones fully baked in, Azure local failures will have the issues discussed in the blog. If an organization plans to use multi-region DR, it has cost implications. Some of the availability zone features may also come with cost implications
This is another interesting one leading to weather related azure outage is here https://www.theregister.co.uk/2018/06/22/azure_north_europe_downed_by_pleasant_weather/
Service Dependencies and Human Error:
Out of AWS outages this one from 2017 is really worth a discussion https://aws.amazon.com/message/41926/
Clearly in this outage there was human error involved which was done in by manual processes. The input entered was erroneous which led removal of subsystems which causes other services to go down. One thing we need to wonder on is how are these 100's of services depend on each other. While we want to trust that AWS playbooks are very sound and processes are well verified, the domino effect of failures makes you think.
Some above mentioned outages also show that there was change happening in the environment which led to outages.
With time some of the patterns will become more evident while newer patterns will emerge. Overall, Cloud Service Providers have built a stack of services for consumers and are working hard to bring more services to the market. The Question remains - Have they built enough controls to manage the complexity and inter-dependencies of their own services? The issue will become important as cloud market keeps expanding at this pace.
NOTE: Issues expressed in this article are authors personal views and are built based on publicly available information. Please accept my apologies in advance if any of the blogs posted by CSPs get misinterpreted. Also be sure to understand that amount of failures in cloud space are fraction of the failures that tend to occur in on-prem data centers. This analysis is purely meant to establish patterns in operational failures CSP's tend to face and should not be interpreted that CSPs have more failures that on-prem data center facilities.