A date that changed Netflix’s Attitude towards Availability & Resiliency...

A date that changed Netflix’s Attitude towards Availability & Resiliency...

I’m writing a throwback blog that happened 7 years back that changed Netflix’s attitude towards Availability & Resiliency.

On Christmas Eve 2012, Netflix streaming service experienced an outage. For full details, see “A Closer Look at the Christmas Eve Outage” by Adrian Cockcroft (Ex-VP- CloudPlatform). This particular incident got a lot of media coverage for obvious reasons, see these horror tweets and news coverage.

No alt text provided for this image

This kind of incident is very rare where an AWS Region becomes unavailable, but it does happen. And when it happens, then Internet’s biggest sites and applications generally become unavailable with day-long outages. To mitigate region-based outages, Netflix invested heavily in Resiliency Engineering and Cloud Platform teams to create a discipline to break things on purpose. Here are some examples of how it was done: 

Create More Failures

In the early days of the Resiliency journey when Chaos Monkey as a service kills other services, but it’s not sufficient to mitigate region outages, Netflix did a step by step improvement by focusing to simulate what happens if entire AZ goes down by “Chaos Gorilla”. And then finally moved on to build “Chaos Kong”, which doesn’t just kill a server, it kills an entire AWS Region. By running experiments on a regular basis that simulate a Regional outage, they were able to identify systemic weaknesses early on and fix them. When one region actually became unavailable, then other region systems were already strong enough to handle a traffic failover.

Graceful Degradation

Another principle that Netflix followed is to assume that services will fail all the time and so it's important to design services strong enough to be able to embrace these failures. Towards this, Netflix followed principles like:

Fail Fast: This is done by setting aggressive timeouts such that failing components won’t make the entire system crawl to a halt.

Fallbacks: In this approach, each feature is designed to degrade or fall back to a lower quality representation. For example, if Netflix users cannot generate personalized rows of the movies, then it would at least fall back to un-personalized results instead of no results at all.

Active-Active 

To make the Multi-Regional Resiliency plan a success, one of the mandatory steps is to have an Active-Active solution where all the services on the user call path are deployed across multiple AWS Regions. In the case of Netflix, they are Active-Active in 3 regions being US-East-1, US-West-2, and EU-west-1. This means if they lose any two regions due to any catastrophic event then one region is still able to serve all the global traffic. Keeping AWS resources on standby was an expensive decision but a number of optimizations were made to make sure that the budgets stayed in the desired limit.

There were several requirements to be satisfied with this:

  • Services must be stateless — all data/ state replication needs to be handled in the data tier.
  • Services could be accessed by any resource locally in-Region. This includes resources like S3, SQS, etc.  
  • There are no cross-regional calls on the user’s call path. Data replication should be asynchronous.
No alt text provided for this image

The following graph shows the traffic switch for one of the key metrics SPS (Stream-starts per seconds)

No alt text provided for this image

A cross-functional program was completed to make sure all services were moving in the same direction and their dependency path was on track.

Conclusion

Any company that is starting a Resiliency journey, it’s important to think about some of these key principals when designing the services and making sure they are breaking services on purpose to find the weakness in our systems.

Source: https://medium.com/netflix-techblog

Gaurav Pal

CEO & Founder, stackArmor | NIST, FedRAMP, FISMA/RMF, AI RMF, CMMC Cloud and AI ATOs

5 年

Jaspreet B. thanks for sharing. The Netflix team of cloud engineers and Adrian Cockcroft have played a seminal role in changing our thinking of how to leverage cloud for resilient engineering. As systems developer we need to continue to find ways to incorporate true systems engineering principles that reflect the increasing criticality of IT systems in our daily lives. The next step of evolution would be engineer systems incorporating these principles based on requirements and SLA’s. Engineering to SLA’s for IT systems continues to be an art.

Jatinder Singh

Digital Transformation Leader | IT Operations | UVA Darden MBA

5 年

You are part of history now.

要查看或添加评论,请登录

Jaspreet B.的更多文章

  • Mastering FinOps L1 KR: Unveiling Our Secret Strategy

    Mastering FinOps L1 KR: Unveiling Our Secret Strategy

    I owed the L1 KR at Atlassian recently related to Cost Efficiency, and with amazing efforts from the FinOps team we…

    4 条评论
  • 5 Great Ways to Save on Cloud

    5 Great Ways to Save on Cloud

    Rightsizing your resources: Optimizing resource size, known as rightsizing, entails evaluating performance needs and…

    4 条评论
  • FinOps X 2023 : Key Takeaway

    FinOps X 2023 : Key Takeaway

    I attended my first FinOps X Conference in Sunny San Diego, California from June 28 to 30. The conference featured 100+…

    5 条评论
  • Incident Management - Sync Communication tips and tricks

    Incident Management - Sync Communication tips and tricks

    I will be writing a series of short blogs to share my learnings while managing a large-scale incident as Incident…

    1 条评论

社区洞察

其他会员也浏览了