Understanding Site Reliability Engineering through Movies and Books
Posters courtesy of imdb.com and appropriate copyright holders

Understanding Site Reliability Engineering through Movies and Books

In the past, when asked to explain what Site Reliability Engineering is, I found I sometimes covered the plain facts of the job without conveying the excitement and challenge of the experience. Over the last two years, I’ve started to use movies and books as a frame of reference to describe the role to people interested in understanding what it is like to be an Site Reliability Engineer (SRE) or a manager of SREs.

A clear definition of Site Reliability Engineering has been critical to successfully running an SRE organization. When I started building a new SRE team in 2015, one of the first challenges was explaining what SRE meant to leaders who were not familiar with the term. My SRE book list came into being when the manager of a potential transfer into my team suggested I give the engineer some book recommendations to describe the role. From there, my list expanded as I looked for more succinct, entertaining reads that would explain the concept of Site Reliability Engineering inside and outside the company.

Since 2017, the starting point for reading about SRE has been more obvious. The O’Reilly/Google-published Site Reliability Engineering: How Google Runs Production Systems book is an anthology of short essays on how Google tackles running massive-scale services with an SRE mindset. As the originator of SRE, Google is in an excellent position to evangelize it! However, this is also a book that is intimately about Google as much as it is about SRE, and what works best for Google may not be immediately practical to implement elsewhere. In my experience, a new SRE walking into another company (be it small, medium, or large) and expecting everything to run like Google will be in for a big surprise!

As excellent as the Google SRE book is, it can be too much, too fast. It also didn’t exist in a coherent form back when I started my team, and I had to find alternative ways to explain the role. In picking reading material, I had to be cautious, as the industry was still sorting out its own terminology. My own first exposure to SRE had come out of some celebrated conflicts (Adrian Cockcroft’s Ops, DevOps and PaaS (NoOps) at Netflix blog vs. John Allspaw’s s/NoOps/OpsDoneMaturelyButStillOps/g response). On top of that, the existence of the pliant term DevOps was sure to cause confusion.

First, for new college graduates exploring SRE roles, I led with a Tech Crunch article from early 2016, comparing the rising use of the SRE job title to the success of the data scientist role. It set the context for what an SRE did while referencing Google’s influence on the philosophy, but it did so without principally being about life at Google, which was important when recruiting for a company that was not Google.

Second, I shared a list of recommended reading that conveyed what it meant to be an SRE.

Release It!, by Michael Nygard. At the time, this was the one book I most encouraged reading. It's an overview of what it means to build and run services, which are not like putting software in a box. It is comprehensive, and while it is now a little dated, it is an easy starting point for service-oriented software development.

Antifragile, by Nassim Nicholas Taleb. This is not software-specific as much as a set of essays regarding how to build (or spot) systems that improve as a consequence of failure. Taleb’s ideas are a much deeper, superior version of what the Five Whys attempts to capture, and I far prefer his model for root cause analysis. Taleb is a divisive author, prone to lengthy, articulate rants - think of him as both smart and rude, and prepared to be entertained by the attitude while looking for the ideas underneath.

The Phoenix Project, by Gene Kim and Kevin Behr. As a novel about IT operations, The Phoenix Project is probably the world's best-regarded IT heroic fiction (though I like Snow Crash a lot, too). A significant part of my job running SRE is de-Brenting the organization, which will make more sense to you after reading The Phoenix Project. I gather The Goal is very similar, and common MBA reading material.

Continuous Delivery, by Jez Humble and David Farley. The discussion of DevOps versus SRE philosophy has already filled dozens of blog articles, but for my purpose this is the book that addresses the part of the space that is about using SRE time to make developers’ jobs easier and new code release to production less painful.  Continuous Delivery is an introductory textbook on how to stand up DevOps capabilities. Unlike the prior recommendations, it's a dense read, but I’ve found it to be an excellent reference. If you were starting from scratch in standing up a release pipeline, this would be the book to use as the basis of planning the project.

Finally, while the above books are a great introduction into the role, they can underemphasize what makes it fulfilling to be a site reliability engineer. Contemplating how to explain the role to a group of intern software engineers, I proposed that we should all go watch The Martian. The movie (and book) excellently capture the sense of working under pressure to solve an onrushing series of hard technical problems; better yet, they convey the satisfaction that comes from outsmarting a world that can seem to be working against you. That led to a short list of movies that give the audience the feeling of being a site reliability engineer:

  • The Martian, Apollo 13, Hidden Figures: Movies about space exploration cover an experience familiar to SRE life, being about pairing technical know-how with resourcefulness and resilience. I start with recommending The Martian; while Andy Weir’s book is much better, the movie gets the point across, and Mars is gorgeous in its loneliness. Apollo 13 and Hidden Figures scratch this same itch from different perspectives, with the benefit of being based in science history rather than science fiction.
  • Arrival: Often, the SRE experience is focused on having a problem, a timer, and no clear path forward. In recent movies, Arrival stands out in portraying the stress, confusion, and urgency of situations where everything is broken and every moment matters in making sense out of the chaos. In covering the impact of fatigue with a deft filmmaker’s touch and serving up a series of puzzles to the viewer, Arrival does not disappoint!

With these books and movies, a new SRE can get a taste of not just what the job is, but also what being an SRE feels like. I’m always looking for additions to the list, and would welcome any suggestions that would make the list better!

Postscript: For the SRE Manager

While I put in my share of time as a hands-on SRE, my primary role is to manage SREs. For the aspiring SRE manager looking for entertainment and inspiration, I’d add two more recommendations:

  • Aliens: In general, it’s an oversimplification to draw a connection between a war movie and the SRE experience. In most SRE work, nobody gets shot at when a service goes down. However, the two minutes of this movie that start with Bill Paxton’s immortal “Yeah, man, but it’s a dry heat” are a parable of how to fail to manage - particularly how to fail to manage an SRE organization, where taking away the team’s tools for doing its job will often lead to creative (and potentially regrettable) improvisation. When I am asking my team to do something under adverse circumstances, I do a mental check: am I the lieutenant sitting in the safety of a bunker, or am I part of the squad that owns solving the problem? Don’t be Lieutenant Gorman.
  • The Annihilation Score, by Charles Stross: Stross’s Laundry Files novels start with a quirky medley of horror, humor, bureaucracy and information technology; this one’s about setting up a new governmental agency. If you’ve never managed, and never want to, you’ve probably already clicked out of this article - but if not, give this book a try. If you’ve ever dealt with the complexity of defining a new organization and explaining it to superiors at the same time you have a job to do, this novel will resonate with you. Plus, the main character is a combat epistemologist - her job is to study hostile philosophies and disrupt them. If you’ve ever done SRE work with development teams, I bet you’ll see parallels immediately!


Neil Laughlin

Vice President, Site Reliability Engineering at AuditBoard

4 年

Two+ years later, my intern asked for an extension of the book recommendations for books on management. For future reference, here's what I suggested: Management track: Phoenix Project (mentioned in blog), Antifragile (optional; mentioned in blog), Built to Last by Jim Collins, the latest Brene Brown book (whatever it is at the time), Radical Candor by Kim Scott. I may expand why as an addendum to the blog article at a later date. We will see!

Black Panther is one of the primary color heroes of the day https://goo.gl/LXyCEp

回复

And I am a huge fan of The Imitation Game - how NOT to manage your greatest assets.

Jillian Middleton

Senior IT Manager | Automation Focused | Growth Mindset

7 年

Have you read The Goal? Similar to Phoenix Project with more emphasis on designing a system around constraints and pipeline.

Nick Follett

Production & Cloud Platform Engineering

7 年

I think The Martian is especially accurate because, as an astronaut, he had to be very "T-shaped" (with a very thick upper-bar) where despite being a botanist he was also very capable in other abilities. Not only that, but he had to overcome obstacles by combining knowledge and ability with novel problem solving. He wouldn't have been able to grow anything in the first place had he not solved for water first, and doing so fell outside the typical skillset of a botanist. Apollo 13 may share a lot of these same points, but The Martian is much more recent in my memory. A note about Arrival: I think it's a great example of why managers must fully trust their best people in difficult situations. The main character required that there be a specific, scientific process followed in order to safely arrive at a solution rather than rush through it and land on something that could have been disastrous.

要查看或添加评论,请登录

Neil Laughlin的更多文章

社区洞察

其他会员也浏览了