How Netflix Is Building Resilience Engineering Into Its Culture

How Netflix Is Building Resilience Engineering Into Its Culture

In previous articles, I made a case for an organizational resilience strategy and for putting resilience on the leadership team agenda, so that companies can survive and thrive, regardless of the challenge. I also discussed myths relating to untoward incidents, which drew on some of the principles of resilience engineering. In this article, I discuss the importance of embedding resilience thinking into company culture and explore what best practice looks like by examining how Netflix has adopted the principles of resilience engineering.

What is Resilience Engineering?

Resilience Engineering is a multi-disciplinary approach to designing and managing complex, dynamic and uncertain socio-technical systems. The primary goal of resilience engineering is to make high-risk, of infrastructure, systems, organisations and communities “more adaptive to internal and external threats and disruptions to system functioning through enhancing their resilience capability.”[i]

Since its conception in 2006, resilience engineering has become an increasingly recognised and important perspective. For example, the excellent Lloyds Register Foundation Foresight Review of Resilience Engineering claimed it was now “one of the research priorities identified in the Foundation’s strategy.”[i] and the approach is being adopted by many organizations, such as Netflix.

Focus on What Goes Right

Central to Resilience Engineering is the understanding of the complex, dynamic-adaptive nature of systems that cannot be precisely described, specified, codified, predicted, mechanized, or controlled[ii]. Resilience engineering challenges the traditional models of control and standardization of work. 

It is not about the absence of something. It is about the presence of something.

Resilience backwards

According to Hollnagel[iii], the classic approach to thinking about resilience is to investigate malfunctions, errors, failures, incidents, accidents, and near-misses in a backwards-looking fashion to identify the so-called “root causes” and, often, human error. This conventional approach examines resilience by its absence and not by its presence.

Resilience forwards

Resilience engineering has shifted to what goes right to accomplish resilience. That is, we need to understand how a system operates effectively, not how it fails.  The focus shifts from minimising the risk of failure to preserving critical system functionality under both ‘normal’ and varying, often unanticipated, conditions.

No alt text provided for this image

Table 1: Traditional Resilience vs Resilience Engineering, Sources [i][ii][iii][iv]

Sydney Dekker explains[x] some of the things that we see teams and organizations do that are good at resilience engineering:

?     “They don’t take past success as a guarantee of future safety. Past results are not enough for them to be confident that their adaptive strategies will keep working.

?     They keep a discussion of risk alive even when everything looks safe. That things look safe doesn’t mean they are: the model of what is risky may have become old, wrong, so they keep updating it.

?     They are able to bring in different and fresh perspectives on problems. They listen to minority viewpoints, invite doubt, stay curious and open-minded, complexly sensitized.

?     And they inspire and reward in their people the courage to say “no” to trading chronic safety concerns for acute production pressures; the courage to put the foot down and invest in safety when everybody else says that they can’t. Because that is exactly the time when such investments may be necessary.”

So, who has a ‘proactive’ resilience culture that is both effective and aligned with current academic thinking?

Resilience Engineering at Netflix

Whilst the unrelenting pursuit of high performance in Netflix has resulted in some criticism[ii], many also laud the company’s culture of “Freedom & Responsibility”[iii]. Netflix’s policy of ‘people over process’, ‘context not control’, ‘highly aligned, loosely coupled’ has helped to improve reliability and resilience of Netflix services by focusing on the people within the company, “since it's the normal, everyday work of Netflix employees that creates our availability”[iv]. Talking about his learning from a previous company, Pure, Reed Hastings, the CEO of Netflix says[v]:

 

“The mistakes in Pure were that every time we had a significant error -- sales call didn't go well, a bug in the code -- we tried to think about in terms of, what process could we put in place to ensure that this doesn't happen again,” he says. “What we failed to understand is by dummy-proofing all the systems, we would have a system where only dummies wanted to work there, which was exactly what happened.”
 

Ryan Kitchens, a senior site reliability engineer at Netflix, explains, via an facinating podcast, how resilience engineering has helped the company. He says[vii], “failure happens all the time”, and as this is the new normal, we must develop skills for dealing with this. Focusing "on how things go right can provide valuable insight into the resilience within your system, e.g. what are people doing every day that helps us overcome incidents". Finding sources of resilience is somewhat “the story of the incident you didn’t have”[vii]. There is "no root cause with complex socio-technical systems as found at Netflix and most modern web-based organisations" [vii]. Instead, teams "must dig a little deeper, and look for what went well, what contributed to the problem, and where are the recurring patterns" [vii].

A Netflix job advert posted in on LinkedIn[vi] earlier this years demonstrate how resilience engineering is becoming embedded into the company’s logic, outcomes and principles:

Netflix’s Resilience Engineering logic [ix]

The job advert states:

“Netflix as a socio-technical system is formed from the interaction of people and software. This system has many components and is continually undergoing change. Unforeseen interactions are frequent, and operational surprises arise from perfect storms of events.

Surprises over incidents and recovery more than prevention. We encourage highlighting good catches, the things that help make us better, and the capacity we develop to successfully minimize the consequences of encountering inevitable failure. A holistic view of our work involves paying attention to how we are confronted with surprises every day and the actions we take to cope with them.

Discovering new information and actionable outcomes over tracking stagnant action items. We aspire to pursue the ways that help us learn; not chase after numbers. Building a learning organization is a real way that we are able to proactively and continually improve.”

Netflix’s Resilience Engineering Outcomes [ix]

  • “Increase Netflix's capacity to adapt to changes and surprises
  • Enhance operational expertise at Netflix
  • Advance Netflix as a learning organization
  • Change the ways internal tool builders think about how people and tools interact
  • Improve team health by empowering teams to balance operational responsibilities with development.”

Netflix Resilience Engineering Principles [ix]

  • 'Exploring contributions' versus constructing causes
  • ‘I see how that action was reasonable’ versus ‘you shouldn't have done that’
  • ‘Human error’ as symptom versus ‘human error’ as cause
  • Automation as a team player versus automation as a replacement for humans
  • How things went right versus why things went wrong
  • Adapting to new surprises over remediating prior incidents
  • Narrative descriptions of surprising events versus out-of-context quantitative data
  • Deep conversations versus shallow timelines
  • Identifying weak signals versus broadly categorizing incidents
  • Decisions driven by expert judgment versus decisions driven by superficial metrics
  • Influence through developing relationships over exercising authority.”

Netflix has clearly embraced resilience engineering. Do you know of other companies that have developed a culture of resilience? Would your organization benefit from resilience engineering?

 

Sources

[i] https://www.lrfoundation.org.uk/en/publications/resilience-engineering/

[ii] https://www.amazon.co.uk/Resilience-Engineering-Concepts-David-Woods/dp/0754649040

[iii] https://erikhollnagel.com/ideas/resilience-engineering.html

[iv] https://pdfs.semanticscholar.org/a0d3/9cc66adc64e297048a32b71aeee209a451af.pdf

[v] https://www.wsj.com/articles/at-netflix-radical-transparency-and-blunt-firings-unsettle-the-ranks-1540497174

[vi] https://www.slideshare.net/reed2001/culture-1798664/2-Netflix_CultureFreedom_Responsibility2

[vii] hhttps://www.infoq.com/podcasts/netflix-sre-sociotechnical-systems/

[viii] https://drivestartups.com/6-things-you-need-to-know-about-how-netflix-built-its-powerf/

[ix] https://www.dhirubhai.net/jobs/view/sr-resilience-engineering-advocate-at-netflix-1222215926/

[x] https://www.amazon.co.uk/Foundations-Safety-Science-Understanding-Accidents/dp/1138481777


About David Denyer

David Denyer is a highly cited author, engaging keynote speaker and an inspiring educator. He is a Professor of Leadership and Organizational Change, as well as a Commercial Director, at Cranfield School of Management. He runs the Organizational Resilience and Change Leadership Group. David is a trusted advisor to the leaders of some of the world's most renowned companies and government organisations. He helps them to understand issues, identifies their specific needs and then works with them to produce solutions that bring immediate improvement to their business. David also runs the Leading Organisational Resilience Programme at Cranfield, which is consistently rated as one of the world's top providers of executive development.




Arunabh Mitra, PhD

Chief Continuity Officer| Risk & Resilience| Speaker,Researcher,Volunteer|

5 年

Very insightful. Thanks David Denyer

回复

要查看或添加评论,请登录

David Denyer的更多文章

社区洞察

其他会员也浏览了