Analyzing CSP outages ...
https://creeksidecollaborative.com/wp-content/uploads/2016/05/Tangled-UpKitty.jpg

Analyzing CSP outages ...

Transparency shown by Cloud Service Providers (CSP) in root cause analysis (RCA) of outages is very informative, but, it may be a good use of that information, if we discussed kind of operational issues CSPs tend to face and detect patterns in them. Such analysis may help us, Cloud Services Consumers, to be more educated in architecture decision making.

In the issues CSPs are facing, the primary causes of many issues, seems to be, as follows

  1. Cross dependency of its services - Services built by CSPs have varying degree of maturity. Some of local services tend to depend global services, which span zones and regions. While that is legitimate dependency, some of the regional services seems to have indirect dependency on local services, causing global outages due to local failures
  2. Rerouting of network traffic (Sometimes due to mis-configuration during change) - Most of the regions provided by CSPs are "peered” and connected on their own network backbone. This is a very good feature, it however seems like the network peerings are not designed for seamless fail-over. The network pipes between regions are not designed for automated failure and do not have the necessary capacity to support degraded states. Every time the mass-rerouting of network traffic occurs, by design or by accident, services slow down or completely break causing serial outages.
  3. Manual operational procedures? - Some of the change procedures from the playbook seem to be manual. Essentially CSP's while making changes to their infrastructure tend to report that it was mis-configured. In one or two cases I have also seen comments like Engineer ran the command and instead of running locally, they ran it for globally or for a different set of infrastructure. This leaves me with the feeling that parts, if not all, of the processes, are still manual, leading to human introduced errors and resulting outages.
  4. Pace of changes in CSP Services - Many of these outages seemingly occurring during changes to the core infrastructure services like Network, Storage or Compute. CSPs are releasing 100's of services in a year and updating 1000's. It is kind of big bundle of complexity which making change management a challenge. In this heightened pace of bringing services to market, is it possible that local change management trade-offs are leading to operational failures? This may be more of a rapidly evolving market problem finding its way to operations.

Here are three examples which may be indicative of these patterns:

Rerouting and Manual Playbook Actions:

Here is the update from GCP on outage from 6/2/2019 https://cloud.google.com/blog/topics/inside-google-cloud/an-update-on-sundays-service-disruption

Basically Googlers are saying, someone from operations messed up the configuration of the services which led to additional traffic to other regions which led to service disruption. Google does not explain if this error was due to manual operation or if it was system fault. GCP had similar issue few months ago which caused bandwidth issues due to change in network configuration which caused rerouting of traffic causing congestion. This is probably the 3rd time GCP spoke of errors introduced during change management causing network rerouting and congestion.

Here is another GCP outage that may have occurred due to code defect https://status.cloud.google.com/incident/cloud-networking/18012

Service Dependencies and Rerouting:

For Azure, the outage in September is detailed here https://blogs.msdn.microsoft.com/vsoservice/?p=17485

The outage was caused by storms and natural reasons. In the "why" analysis section, Microsoft establishes that they need more, what we would like to call "Availability Zones". Basically datacenter like facilities "near" each other sharing the same network to allow for local fail-overs (I know purist will hate the way I put it but that is the bottom line). These allow services to manage failures "locally". It also means that without Availability Zones fully baked in, Azure local failures will have the issues discussed in the blog. If an organization plans to use multi-region DR, it has cost implications. Some of the availability zone features may also come with cost implications

This is another interesting one leading to weather related azure outage is here https://www.theregister.co.uk/2018/06/22/azure_north_europe_downed_by_pleasant_weather/

Service Dependencies and Human Error:

Out of AWS outages this one from 2017 is really worth a discussion https://aws.amazon.com/message/41926/

Clearly in this outage there was human error involved which was done in by manual processes. The input entered was erroneous which led removal of subsystems which causes other services to go down. One thing we need to wonder on is how are these 100's of services depend on each other. While we want to trust that AWS playbooks are very sound and processes are well verified, the domino effect of failures makes you think.

Some above mentioned outages also show that there was change happening in the environment which led to outages.

With time some of the patterns will become more evident while newer patterns will emerge. Overall, Cloud Service Providers have built a stack of services for consumers and are working hard to bring more services to the market. The Question remains - Have they built enough controls to manage the complexity and inter-dependencies of their own services? The issue will become important as cloud market keeps expanding at this pace.

NOTE: Issues expressed in this article are authors personal views and are built based on publicly available information. Please accept my apologies in advance if any of the blogs posted by CSPs get misinterpreted. Also be sure to understand that amount of failures in cloud space are fraction of the failures that tend to occur in on-prem data centers. This analysis is purely meant to establish patterns in operational failures CSP's tend to face and should not be interpreted that CSPs have more failures that on-prem data center facilities.


要查看或添加评论,请登录

Sandeep S.的更多文章

  • Quantum Proofing Future Cyber

    Quantum Proofing Future Cyber

    In August 2024, NIST (National Institute of Standards and Technology) put out its last set of encryption standards to…

  • Hosting the 2017 Eclipse: Lessons from NASA's WESTPrime Program

    Hosting the 2017 Eclipse: Lessons from NASA's WESTPrime Program

    As a Program Manager for NASA's WESTPrime program, I had the incredible opportunity to spearhead the hosting of the…

    1 条评论
  • Summary - State of Cloud 2024

    Summary - State of Cloud 2024

    Over the years, I've avidly followed several cloud and cyber-related publications, including the State of Cloud Report…

    1 条评论
  • Cloud Computing by Numbers

    Cloud Computing by Numbers

    The cloud computing sector is experiencing rapid growth, turning the digital world into a hub of innovation. McKinsey…

    4 条评论
  • Perspectives on Ethics in AI

    Perspectives on Ethics in AI

    NOTE: Opinions expressed in this article are authors personal views and are built based on publicly available…

    7 条评论
  • COVID Open Information Accelerator (COIA)

    COVID Open Information Accelerator (COIA)

    Disclaimer: Correctness of information is solely dependent on the provider of the links. Attempt has been made to use…

    1 条评论
  • WARNING! - Cyber crimes flourishing - during CORONA Scare

    WARNING! - Cyber crimes flourishing - during CORONA Scare

    There is surge in cyber crime activity over last few weeks. This is an attempt to aggregate to the recent events to…

  • Most Common - Cloud Security Hacks

    Most Common - Cloud Security Hacks

    Inspired by https://www.darkreading.

    2 条评论
  • CORONA Pandemic - Test for Cloud Computing to prove its "Business Agility"

    CORONA Pandemic - Test for Cloud Computing to prove its "Business Agility"

    As the COVID-19 or CoronaVirus Pandemic is taking shape communities, states, countries, agencies, organizations and…

  • Operational complexity of Multi-Cloud Environments

    Operational complexity of Multi-Cloud Environments

    Recent outages in cloud services had made me reflect on complexity of multi-cloud operations. In Sept 2018, Azure in…

社区洞察

其他会员也浏览了