登录查看更多内容

点击“继续加入或登录”，即表示您同意遵守领英的《用户协议》、《隐私政策》及《Cookie 政策》。

What is SRE really?

David Owczarek

Senior leader, writer, and speaker focused on SRE/DevOps and operating compliant, secure services at scale in the cloud

发布日期: 2023年11月28日

+ 关注

Hint: It’s not always what Google says

Last year, I presented at SRECon EMEA on the topic of the biases confronting SREs (site reliability engineers). This article is a written summary of that presentation, inspired by some of the other speakers that picked up on a similar theme (h/t Emil Stolarsky). Simply put, we need to take a hard look at the practice of SRE. This necessitates a more critical examination of the Google SRE origin story and subsequent practices. For me to explore that fully, it means going back even further. Fortunately, I’ve been doing this for over 30 years, so here is my (admittedly biased) treatment of Google SRE.

Although it was created in 2004, SRE practice lay nascent for a decade, from my perspective. I was well into DevOps when SRE burst on the scene in the mid-2010s. Because it was created by Google and supported with books and videos, it was rapidly adopted and anointed as the authoritative source for what SRE should be. But SRE is a derivative of technical operations work, which is itself a derivative of system administration work, whose origins trace back to the Unix Wizards of the 70s and 80s. Interestingly, if you keep following this back through history, you will discover that the first practitioners of this skill were likely the six women who ran ENIAC during World War II. But it was the innovation and market dynamics of the 1980s and 1990s that created the conditions for the birth of modern SRE.

In the 1980s, Unix and the personal computer were both ascendant, opening up the power of computing to a huge new audience. Digital Equipment Corporation had the beloved VAX 11/780s. IBM was still in the glory days of mainframe computing with the System 370. Sun Microsystems brought Unix and graphical interfaces to the market. TCP/IP emerged, and networking practices exploded, leading to interconnected regional networks that were the beginnings of what is now the Internet. They all needed people to operate them. I worked on VAXes and Suns, among others, during that time, and it was straight-up system administration: fixing issues with processes, scripts, and daemons; backup and restore operations; account management; operating system patches and upgrades; building third-party software packages; configuring device drivers for printers and other peripherals; and so on and so on.

It was the combination of Unix and Perl that made the 1990s the golden age of system administration for me. There were finally tools that were well suited for common system administration tasks, making it a lot easier to automate. For example, it was not uncommon for me to get a dozen or more workstations that needed to be configured, networked, and distributed to each researcher I was supporting. I remember using remote booting, TFTP, and some Bourne shell scripts to clone and customize the drives and operating system, so I didn’t have to manually configure each one.

Why am I reliving the 1990s? I was using software to solve operational problems in the example above. Sound familiar? The founding concept of SRE is biased in this way and treated as revolutionary, when in fact there had been software and automation defining this role for decades prior. But there is a limit to how far you can extend this method. There are numerous problems that SREs face that are not solvable by automation or code. In my own experience, there have also been plenty of times when I didn’t pursue automation or code because business dynamics precluded it. But to take this even further, the notion that software is the solution reinforces a number of industry tropes that bring a whole different kind of bias into the conversation. So let’s go there now, shall we?

Today, if I look in the Urban Dictionary for sysadmin, the top-rated answer is:

A person with much more power than you and who is bitter enough to use it in ways that please him/her

This kind of stereotype or trope is a bias that casts a negative impression on an entire group of people—sysadmins. I hope it is obvious that this is not universally true, or even somewhat true. But it’s hardly the only one. Consider this fascinating Infoworld article: Devs don’t want to do ops. Or these, all of which I have heard:

Sysadmins are grumpy and jaded
Ops people resist change
Developers don’t want to support the services they build
Devs can do anything ops can do

That last one, though, is important because it is another way of describing Google SRE. What do I mean by that? Here is the most commonly referenced quote about Google SRE’s founding principle:

SRE is what happens when you ask a software engineer to design an operations team

This has always been problematic for me because, even though I can and have written code and have even been a developer in several situations, I identify as a system administration or system operations person. I have no problem using software engineering and automation as a tool—I've been doing that my whole career.

And there is definitely, for sure, absolutely no elitism in that statement—that software engineers will be more effective at solving my problems than I will. None. OK, there is. Why not have an operations expert design the software engineering teams? What biases do you think that would introduce?

Biases don’t have to be bad. There are plenty of examples of positive bias. You might decide to bias your diet in favor of fruits and vegetables and away from sweets and fatty foods, for example. So I have always chosen to interpret that founding statement as an inspiration to write software to automate things where I can increase reliability and reduce toil, in line with the SRE mission. But software engineering is only part of the solution to solving SRE.

SRE is a somewhat arbitrary collection of roles and responsibilities with very specific relationships to conditions at Google, where it was developed. I believe this is a form of origin bias and makes it harder for non-Googlers to implement SRE. To me, the founding statement about software engineering as the solution is at best an oversimplification. SRE, as I have seen it practiced, has three intersecting disciplines: software engineering, reliability engineering, and technical operations. Here’s how they line up against typical SRE responsibilities:

The Take?Away

SRE practices bring goodness, but they need to be implemented in the context of the organization implementing them instead of following what works for Google. That is pretty straightforward, and I think the industry generally understands this. But also, software is not actually the right solution to every problem. There are situations where humans are needed, and there are situations that are so complex that even a simple cost-benefit analysis indicates the level of return on investment is not there.

Joshua O'Keefe

Former Senior Site Reliability Software Engineer at Adobe Sign — People. Process. Systems.

1 年

Thanks for this excellent article, Dave. Something that I think always bears keeping in mind, no matter how you're building or operating a system is that software is one (important!) part of a system. Processes, the behavior of humans, architecture, business needs, and the day-to-day realities of change are all part of a system as well. Just like we can't glue everything together with human effort or fat books of processes, we can't glue everything together with software.

1 次回应

Chitra B.

Sr. Specialist - Growth & Demand Generation || Workforce Management | Incident Management | Content | Growth

1 年

Fascinating read! ?? I appreciate the critical look at the Google SRE origin story and the emphasis on adapting practices to the organization's context. Agree that biases, positive or negative, play a role. Looking forward to more insights on evolving SRE practices!

1 次回应

Mike O'Brien

Platform Engineering Leader, Quality and Customer Advocate.

1 年

David Owczarek I always enjoy posts, and you did a great job on this one. You might have to trademark "Dave's path to SRE" ?

1 次回应

查看更多评论

要查看或添加评论，请登录

David Owczarek的更多文章

Please Give Me the Power

2024年12月5日

Please Give Me the Power

This is a crossover story. It's about audio engineering, but also about reliability engineering.
Podcasting Internet Failures

2024年6月20日

Podcasting Internet Failures

About six months ago, I was thinking about putting together a regular podcast or perhaps a newsletter to review major…
4 Ways Performing Is Like Programming

2024年5月28日

4 Ways Performing Is Like Programming

The Set-Up When I started getting ready to perform music as a solo artist, I learned a number of humbling things about…
6 Months and Counting

2024年4月23日

6 Months and Counting

It’s been six months since the layoff that put me back in the job market. It’s been crazy—in both good and bad ways.
10 ways to ruin a lightning talk

2024年2月6日

10 ways to ruin a lightning talk

I'm submitting a lighting talk today for an upcoming SRECon. I haven't done a lightning talk before, and the format is…

1 条评论
Five Timestamps; Four Metrics

2023年12月6日

Five Timestamps; Four Metrics

Introduction There are five timeline events that are so critical you should record them for every outage. This isn’t…
The 2023 State of DevOps?Report

2023年11月8日

The 2023 State of DevOps?Report

Background The 2023 State of DevOps report was released recently, and there are some interesting things to discuss…

1 条评论
The Availability Enigma

2022年7月20日

The Availability Enigma

What’s availability? One of the slipperiest terms in site reliability engineering (SRE) is availability. It is intended…
SLOConf 2022 - 8 inspiring talks

2022年5月11日

SLOConf 2022 - 8 inspiring talks

SLOConf 2022 is happening right now. I have been watching the content and thinking about service level objectives…
Two learnings from SRECon?2022

2022年4月5日

Two learnings from SRECon?2022

MTT* metrics suck and we are still learning how to SRE Any questions? You gotta love a conference that opens with a…

1 条评论

See all articles

The Take?Away

David Owczarek的更多文章

Please Give Me the Power

Podcasting Internet Failures

4 Ways Performing Is Like Programming

6 Months and Counting

10 ways to ruin a lightning talk

Five Timestamps; Four Metrics

The 2023 State of DevOps?Report

The Availability Enigma

SLOConf 2022 - 8 inspiring talks

Two learnings from SRECon?2022

社区洞察