What is SRE really?

What is SRE really?

Hint: It’s not always what Google says

Last year, I presented at SRECon EMEA on the topic of the biases confronting SREs (site reliability engineers). This article is a written summary of that presentation, inspired by some of the other speakers that picked up on a similar theme (h/t Emil Stolarsky). Simply put, we need to take a hard look at the practice of SRE. This necessitates a more critical examination of the Google SRE origin story and subsequent practices. For me to explore that fully, it means going back even further. Fortunately, I’ve been doing this for over 30 years, so here is my (admittedly biased) treatment of Google SRE.


Although it was created in 2004, SRE practice lay nascent for a decade, from my perspective. I was well into DevOps when SRE burst on the scene in the mid-2010s. Because it was created by Google and supported with books and videos, it was rapidly adopted and anointed as the authoritative source for what SRE should be. But SRE is a derivative of technical operations work, which is itself a derivative of system administration work, whose origins trace back to the Unix Wizards of the 70s and 80s. Interestingly, if you keep following this back through history, you will discover that the first practitioners of this skill were likely the six women who ran ENIAC during World War II. But it was the innovation and market dynamics of the 1980s and 1990s that created the conditions for the birth of modern SRE.

Dave's path to SRE

In the 1980s, Unix and the personal computer were both ascendant, opening up the power of computing to a huge new audience. Digital Equipment Corporation had the beloved VAX 11/780s. IBM was still in the glory days of mainframe computing with the System 370. Sun Microsystems brought Unix and graphical interfaces to the market. TCP/IP emerged, and networking practices exploded, leading to interconnected regional networks that were the beginnings of what is now the Internet. They all needed people to operate them. I worked on VAXes and Suns, among others, during that time, and it was straight-up system administration: fixing issues with processes, scripts, and daemons; backup and restore operations; account management; operating system patches and upgrades; building third-party software packages; configuring device drivers for printers and other peripherals; and so on and so on.

It was the combination of Unix and Perl that made the 1990s the golden age of system administration for me. There were finally tools that were well suited for common system administration tasks, making it a lot easier to automate. For example, it was not uncommon for me to get a dozen or more workstations that needed to be configured, networked, and distributed to each researcher I was supporting. I remember using remote booting, TFTP, and some Bourne shell scripts to clone and customize the drives and operating system, so I didn’t have to manually configure each one.

Why am I reliving the 1990s? I was using software to solve operational problems in the example above. Sound familiar? The founding concept of SRE is biased in this way and treated as revolutionary, when in fact there had been software and automation defining this role for decades prior. But there is a limit to how far you can extend this method. There are numerous problems that SREs face that are not solvable by automation or code. In my own experience, there have also been plenty of times when I didn’t pursue automation or code because business dynamics precluded it. But to take this even further, the notion that software is the solution reinforces a number of industry tropes that bring a whole different kind of bias into the conversation. So let’s go there now, shall we?

Today, if I look in the Urban Dictionary for sysadmin, the top-rated answer is:

A person with much more power than you and who is bitter enough to use it in ways that please him/her

This kind of stereotype or trope is a bias that casts a negative impression on an entire group of people—sysadmins. I hope it is obvious that this is not universally true, or even somewhat true. But it’s hardly the only one. Consider this fascinating Infoworld article: Devs don’t want to do ops. Or these, all of which I have heard:

  • Sysadmins are grumpy and jaded
  • Ops people resist change
  • Developers don’t want to support the services they build
  • Devs can do anything ops can do

That last one, though, is important because it is another way of describing Google SRE. What do I mean by that? Here is the most commonly referenced quote about Google SRE’s founding principle:

SRE is what happens when you ask a software engineer to design an operations team

This has always been problematic for me because, even though I can and have written code and have even been a developer in several situations, I identify as a system administration or system operations person. I have no problem using software engineering and automation as a tool—I've been doing that my whole career.

And there is definitely, for sure, absolutely no elitism in that statement—that software engineers will be more effective at solving my problems than I will. None. OK, there is. Why not have an operations expert design the software engineering teams? What biases do you think that would introduce?

Biases don’t have to be bad. There are plenty of examples of positive bias. You might decide to bias your diet in favor of fruits and vegetables and away from sweets and fatty foods, for example. So I have always chosen to interpret that founding statement as an inspiration to write software to automate things where I can increase reliability and reduce toil, in line with the SRE mission. But software engineering is only part of the solution to solving SRE.

SRE is a somewhat arbitrary collection of roles and responsibilities with very specific relationships to conditions at Google, where it was developed. I believe this is a form of origin bias and makes it harder for non-Googlers to implement SRE. To me, the founding statement about software engineering as the solution is at best an oversimplification. SRE, as I have seen it practiced, has three intersecting disciplines: software engineering, reliability engineering, and technical operations. Here’s how they line up against typical SRE responsibilities:

SRE responsibilities by discipline

The Take?Away

SRE practices bring goodness, but they need to be implemented in the context of the organization implementing them instead of following what works for Google. That is pretty straightforward, and I think the industry generally understands this. But also, software is not actually the right solution to every problem. There are situations where humans are needed, and there are situations that are so complex that even a simple cost-benefit analysis indicates the level of return on investment is not there.

Joshua O'Keefe

Former Senior Site Reliability Software Engineer at Adobe Sign — People. Process. Systems.

1 年

Thanks for this excellent article, Dave. Something that I think always bears keeping in mind, no matter how you're building or operating a system is that software is one (important!) part of a system. Processes, the behavior of humans, architecture, business needs, and the day-to-day realities of change are all part of a system as well. Just like we can't glue everything together with human effort or fat books of processes, we can't glue everything together with software.

Chitra B.

Sr. Specialist - Growth & Demand Generation || Workforce Management | Incident Management | Content | Growth

1 年

Fascinating read! ?? I appreciate the critical look at the Google SRE origin story and the emphasis on adapting practices to the organization's context. Agree that biases, positive or negative, play a role. Looking forward to more insights on evolving SRE practices!

Mike O'Brien

Platform Engineering Leader, Quality and Customer Advocate.

1 年

David Owczarek I always enjoy posts, and you did a great job on this one. You might have to trademark "Dave's path to SRE" ?

要查看或添加评论,请登录

David Owczarek的更多文章

  • Please Give Me the Power

    Please Give Me the Power

    This is a crossover story. It's about audio engineering, but also about reliability engineering.

  • Podcasting Internet Failures

    Podcasting Internet Failures

    About six months ago, I was thinking about putting together a regular podcast or perhaps a newsletter to review major…

  • 4 Ways Performing Is Like Programming

    4 Ways Performing Is Like Programming

    The Set-Up When I started getting ready to perform music as a solo artist, I learned a number of humbling things about…

  • 6 Months and Counting

    6 Months and Counting

    It’s been six months since the layoff that put me back in the job market. It’s been crazy—in both good and bad ways.

  • 10 ways to ruin a lightning talk

    10 ways to ruin a lightning talk

    I'm submitting a lighting talk today for an upcoming SRECon. I haven't done a lightning talk before, and the format is…

    1 条评论
  • Five Timestamps; Four Metrics

    Five Timestamps; Four Metrics

    Introduction There are five timeline events that are so critical you should record them for every outage. This isn’t…

  • The 2023 State of DevOps?Report

    The 2023 State of DevOps?Report

    Background The 2023 State of DevOps report was released recently, and there are some interesting things to discuss…

    1 条评论
  • The Availability Enigma

    The Availability Enigma

    What’s availability? One of the slipperiest terms in site reliability engineering (SRE) is availability. It is intended…

  • SLOConf 2022 - 8 inspiring talks

    SLOConf 2022 - 8 inspiring talks

    SLOConf 2022 is happening right now. I have been watching the content and thinking about service level objectives…

  • Two learnings from SRECon?2022

    Two learnings from SRECon?2022

    MTT* metrics suck and we are still learning how to SRE Any questions? You gotta love a conference that opens with a…

    1 条评论

社区洞察