AWS's static stability and recent outage

AWS's static stability and recent outage

On January 22nd 2020 between 4:07 PM and 11:20PM PST, you could not create new resources in a VPC for AWS Sydney region. In the wake of this outage, AWS has conducted a deep analysis and will be making changes in their processes to avoid this happening again. Any process changes to avoid similar outage is great but I believe there is a fundamental issue that should be addressed.

AWS has this concept of control plane and data plane. Data plane is responsible for making sure any existing resources will keep functioning in case of a system failure. Whereas provisioning of new resources and any configuration changes should be handled by control plane. In that light, data plane is far simpler and robust with few moving parts compared to a control plane. Here is an excerpt on this from one of AWS’s builder’s library architecture patterns.

“The Amazon EC2 data plane has been carefully designed to be statically stable in the face of control plane availability events (such as impairments in the ability to launch EC2 instances). For example, to avoid disruptions in network connectivity, the Amazon EC2 data plane is designed so that the physical machine on which an EC2 instance runs has local access to all of the information it needs to route packets to points inside and outside of its VPC. An impairment of the Amazon EC2 control plane means that during the event the physical server might not see updates like a new EC2 instance added to a VPC, or a new Security Group rule. However, the traffic it had been able to send and receive before the event will continue to work.”

As you can see, purpose of a data plane is to make sure existing provisioned resources keep working in an event of a control plane failure. This works well for static resource like EC2 instances but not so much for elastic resources. This is exactly what the issue was during this recent outage. For example, a Lambda function that needs to interact with VPC resources requires a new ENI created and attached to it. This means provisioning of a new resource, and that is handled by the control plane. Unlike data plane where required information to perform an action is stored locally, control plane relies on a central database for provisioning of new resources. Unavailability of this central database is what caused the control plane to fail and hence led to this outage.

I understand this separation of data plane/control plane and how it provides greater stability within AWS’s systems. But I do believe AWS needs to revise and adopt it to include serverless resources. Unlike static EC2 instances, Lambda functions can’t be static and require dynamic provisioning of resources which should not be dependent on control plane or availability of a central database. 

要查看或添加评论,请登录

Imran Sadiq的更多文章

  • An honest review of AWS DataSync

    An honest review of AWS DataSync

    To start off with, its a great service if you want to continuously sync or a one time copy of data from one location to…

    1 条评论
  • My Shortlist Of re:Invent 2021 Announcements

    My Shortlist Of re:Invent 2021 Announcements

    AWS is in its third wave of services evaluation. First wave was the base services e.

    4 条评论
  • What does it mean to have a local AWS region

    What does it mean to have a local AWS region

    With the latest announcement from AWS on opening of their new Region in Auckland, I thought I should share some…

    6 条评论
  • Web client for AWS SFTP

    Web client for AWS SFTP

    FTP has been around for a long time and still is a strong contender when it comes to transferring data between ad-hoc…

    2 条评论
  • Eventful Days in Seattle

    Eventful Days in Seattle

    Sitting here in Starbucks Reserve (one of the best places to visit if you area coffee lover), I am going over the last…

    5 条评论
  • Getting to know AWS Control Tower

    Getting to know AWS Control Tower

    I first attended training on Landing Zones back in August 2018 when it was introduced to APN partners. My immediate…

    2 条评论
  • Recap from AWS Sydney Summit

    Recap from AWS Sydney Summit

    Since 2018, AWS has stepped up its game when it comes to hosting a tech conference in Southern hemisphere. AWS Sydney…

    7 条评论
  • Lancom Tech Talk: How to deploy S3 Static Websites to Test, UAT, or Production AWS Accounts from CodePipeline

    Lancom Tech Talk: How to deploy S3 Static Websites to Test, UAT, or Production AWS Accounts from CodePipeline

    In this blog post, I will demonstrate how to create a continuous deployment pipeline for Static Website deployment into…

    1 条评论
  • Why I abandoned Facebook...

    Why I abandoned Facebook...

    I abandoned Facebook couple of years back. Well, almost abandoned it.

    9 条评论
  • Off to re:Invent

    Off to re:Invent

    Its that time of the year again for me to pack my bags and head off to #re:Invent. It is perhaps the largest global…

    3 条评论

社区洞察

其他会员也浏览了