AWS's static stability and recent outage
On January 22nd 2020 between 4:07 PM and 11:20PM PST, you could not create new resources in a VPC for AWS Sydney region. In the wake of this outage, AWS has conducted a deep analysis and will be making changes in their processes to avoid this happening again. Any process changes to avoid similar outage is great but I believe there is a fundamental issue that should be addressed.
AWS has this concept of control plane and data plane. Data plane is responsible for making sure any existing resources will keep functioning in case of a system failure. Whereas provisioning of new resources and any configuration changes should be handled by control plane. In that light, data plane is far simpler and robust with few moving parts compared to a control plane. Here is an excerpt on this from one of AWS’s builder’s library architecture patterns.
“The Amazon EC2 data plane has been carefully designed to be statically stable in the face of control plane availability events (such as impairments in the ability to launch EC2 instances). For example, to avoid disruptions in network connectivity, the Amazon EC2 data plane is designed so that the physical machine on which an EC2 instance runs has local access to all of the information it needs to route packets to points inside and outside of its VPC. An impairment of the Amazon EC2 control plane means that during the event the physical server might not see updates like a new EC2 instance added to a VPC, or a new Security Group rule. However, the traffic it had been able to send and receive before the event will continue to work.”
As you can see, purpose of a data plane is to make sure existing provisioned resources keep working in an event of a control plane failure. This works well for static resource like EC2 instances but not so much for elastic resources. This is exactly what the issue was during this recent outage. For example, a Lambda function that needs to interact with VPC resources requires a new ENI created and attached to it. This means provisioning of a new resource, and that is handled by the control plane. Unlike data plane where required information to perform an action is stored locally, control plane relies on a central database for provisioning of new resources. Unavailability of this central database is what caused the control plane to fail and hence led to this outage.
I understand this separation of data plane/control plane and how it provides greater stability within AWS’s systems. But I do believe AWS needs to revise and adopt it to include serverless resources. Unlike static EC2 instances, Lambda functions can’t be static and require dynamic provisioning of resources which should not be dependent on control plane or availability of a central database.