A date that changed Netflix’s Attitude towards Availability & Resiliency...
I’m writing a throwback blog that happened 7 years back that changed Netflix’s attitude towards Availability & Resiliency.
On Christmas Eve 2012, Netflix streaming service experienced an outage. For full details, see “A Closer Look at the Christmas Eve Outage” by Adrian Cockcroft (Ex-VP- CloudPlatform). This particular incident got a lot of media coverage for obvious reasons, see these horror tweets and news coverage.
This kind of incident is very rare where an AWS Region becomes unavailable, but it does happen. And when it happens, then Internet’s biggest sites and applications generally become unavailable with day-long outages. To mitigate region-based outages, Netflix invested heavily in Resiliency Engineering and Cloud Platform teams to create a discipline to break things on purpose. Here are some examples of how it was done:
Create More Failures
In the early days of the Resiliency journey when Chaos Monkey as a service kills other services, but it’s not sufficient to mitigate region outages, Netflix did a step by step improvement by focusing to simulate what happens if entire AZ goes down by “Chaos Gorilla”. And then finally moved on to build “Chaos Kong”, which doesn’t just kill a server, it kills an entire AWS Region. By running experiments on a regular basis that simulate a Regional outage, they were able to identify systemic weaknesses early on and fix them. When one region actually became unavailable, then other region systems were already strong enough to handle a traffic failover.
Graceful Degradation
Another principle that Netflix followed is to assume that services will fail all the time and so it's important to design services strong enough to be able to embrace these failures. Towards this, Netflix followed principles like:
Fail Fast: This is done by setting aggressive timeouts such that failing components won’t make the entire system crawl to a halt.
Fallbacks: In this approach, each feature is designed to degrade or fall back to a lower quality representation. For example, if Netflix users cannot generate personalized rows of the movies, then it would at least fall back to un-personalized results instead of no results at all.
Active-Active
To make the Multi-Regional Resiliency plan a success, one of the mandatory steps is to have an Active-Active solution where all the services on the user call path are deployed across multiple AWS Regions. In the case of Netflix, they are Active-Active in 3 regions being US-East-1, US-West-2, and EU-west-1. This means if they lose any two regions due to any catastrophic event then one region is still able to serve all the global traffic. Keeping AWS resources on standby was an expensive decision but a number of optimizations were made to make sure that the budgets stayed in the desired limit.
There were several requirements to be satisfied with this:
- Services must be stateless — all data/ state replication needs to be handled in the data tier.
- Services could be accessed by any resource locally in-Region. This includes resources like S3, SQS, etc.
- There are no cross-regional calls on the user’s call path. Data replication should be asynchronous.
The following graph shows the traffic switch for one of the key metrics SPS (Stream-starts per seconds)
A cross-functional program was completed to make sure all services were moving in the same direction and their dependency path was on track.
Conclusion
Any company that is starting a Resiliency journey, it’s important to think about some of these key principals when designing the services and making sure they are breaking services on purpose to find the weakness in our systems.
CEO & Founder, stackArmor | NIST, FedRAMP, FISMA/RMF, AI RMF, CMMC Cloud and AI ATOs
5 年Jaspreet B. thanks for sharing. The Netflix team of cloud engineers and Adrian Cockcroft have played a seminal role in changing our thinking of how to leverage cloud for resilient engineering. As systems developer we need to continue to find ways to incorporate true systems engineering principles that reflect the increasing criticality of IT systems in our daily lives. The next step of evolution would be engineer systems incorporating these principles based on requirements and SLA’s. Engineering to SLA’s for IT systems continues to be an art.
Digital Transformation Leader | IT Operations | UVA Darden MBA
5 年You are part of history now.