A Year of Lessons: SRE, Organizations, and Leadership
Photo by Ocean Ng on Unsplash

A Year of Lessons: SRE, Organizations, and Leadership

To say that the past year or so has been a learning experience would be an understatement. From the job hunt, to onboarding at my previous company, working with my former boss, leading the SRE team, and much more -- so much of the last 12 months has been challenging and rewarding, validating and transformative. In that spirit I'd like to share a few thoughts I have about some of the things I learned in the past year.

This may come as no surprise to you but, much like "DevOps", the definition of SRE is also a bit vague and depends on who you're talking to. There are the Purists who think that it's not really SRE unless you're following the Site Reliability Engineering book to the letter. On the other end of the spectrum, we have the Anarchists who use SRE as a catch-all for anything loosely related to DevOps, cloud, or platform engineering. I find myself toward the Purist end of the spectrum in that SRE is a unique flavor of job description and people organization but not quite as orthodox as the Purist. What's important to understand, as I hinted in the previous sentence, is that a proper SRE implementation requires synergy between both people and organization.

People

It's tempting as a senior leader to come across a concept like SRE and think "hey, we can do that", then go out and make a team that does it without looking at the wider organization holistically. No team exists in a bubble and SRE is no exception. In order to make SRE work at your company you need the people and the organization to make it work; one without the other will be doomed to fail. Trying to instill an SRE culture in a team of people who are ill equipped to do the job will result in poor outcomes and job dissatisfaction. Trying to instill SRE culture in an organization that isn't set up to accommodate the fundamental shift in priorities, job duties, and overall scope will result in your SRE leader being replaced multiple times in short order for poor performance and an overall frustration with the lack of progress on behalf of senior leadership.

In order to have a functional SRE team you need people who can be SRE's. This sounds rather obvious on the surface, but I assure you this is foundational to the success or failure of any SRE organization. An SRE engineer is more than just a glorified Tier 1 support engineer: a proper, well-trained SRE engineer is more software developer than support personnel. SRE engineers need to understand and be able to reason about how applications, hardware, cloud providers, virtual machines, containers, networks, CDN's, and so much more actually work and function together. Your average production support engineer doesn't have that greater context as they're typically only focused on the application's operation and the middleware layer. SRE's, the good ones at least, need to understand the code, be able to read it, and if necessary, write it. Not only that, but they also need to know how to tie observability into applications and infrastructure in a way that is beneficial to the ultimate goal of maintaining SLO's and error budgets. Let me be clear, a good SRE engineer is not cheap. You're paying for expertise, experience, wisdom, and leadership, not a warm body to sit in a seat.

Organization

Just as important as quality people is an organization that is ready to receive the kind of help that SRE teams can bring. One of the biggest failures of SRE to take hold that I have seen is when you have overlapping responsibilities between an SRE team and a traditional Tier 1 PE team. There needs to be a hard delineation between any kind of Tier 1 group and an SRE group. A mile-long RACI somewhere in Confluence isn't going to cut it -- if the demarcation isn't immediately clear between SRE and PE you're going to be in for a bad time. Confusion about who is responsible for what, who to call, what to do, even who alerts and alarms should go to; all of this spells death for any kind of benefit that can be had from implementing SRE. In a production outage where you want your best people helping triage, any kind of confusion about who is running the incident, who is doing the work, or even if someone has access to fix anything not only creates an environment of uncertainty and panic, but means that any impact you have on your business or customers is only increased while your engineering teams try to deconflict in real-time. I've been there and it's not pretty.

The second organizational mistake I've personally witnessed is turning your SRE group into an advisory team. This is more pronounced in larger organizations where responsibilities are distributed across multiple teams (cloud operations, pipeline teams, release management, etc.), but it can happen in smaller orgs as well. This happens when SRE's aren't empowered and given ownership over the entire application lifecycle, from birth to production deployment. A team with no ownership and no ability to directly affect change hands-on will never be effective at reducing downtime, streamlining deployment processes, better serving customers, or whatever their mission is. Responsibility without agency spells death for your SRE practice as your highly paid, experienced engineers start migrating to other companies.

Leadership

All of this is moot without the proper leadership to back it up. It's not enough to just have Senior Director or VP say, "go do SRE stuff", you need your directors and managers to really believe in the project and to understand the scope of what's being asked. Transforming an organization and its mindset takes time -- more time than you would think -- and patience is the key virtue here. Especially if you're taking an existing PE team and converting it into an SRE team, there's going to be a learning curve for the engineers and for the organization. Essentially, you're taking a team that is historically 100% operational and turning them into a product-oriented team with roadmaps, goals, and obligations beyond just keeping the lights on. In my own SRE team this also involved transitioning from Kanban to sprints, another transition that can be a large adjustment for associates who aren't used to working like that. Transitioning to SRE, the shift in mindset and team priorities, team norms, is a big shift, and not having the proper support from leadership will again spell doom for the team. It takes time to put programs, processes, and ideas in place, and even longer to start making decisions based off those things.

Conclusion

I hope this has been helpful for at least someone. I learned a lot of this the hard way from being on various ends of things both good and bad, so maybe someone can see the mistakes, learn from them (or avoid them) and help bring SRE to their own place of work.



Looking for an engineering leader with a decade of experience in everything from development to IT, SRE, cloud, and more? I'm currently on the market for a new job -- shoot me a message!

要查看或添加评论,请登录

Peter F.的更多文章

  • Standups Are Stupid, Do This Instead

    Standups Are Stupid, Do This Instead

    Standups are stupid. Can we all agree on that? Fifteen seconds per-person to talk about what you worked on the previous…

  • Your Next CTO Should Be an SRE

    Your Next CTO Should Be an SRE

    Startups are fascinating to me for a million different ways. They’re microcosms of technology, leadership, and…

    1 条评论
  • You Don't Deserve Free Software: Redis?Edition

    You Don't Deserve Free Software: Redis?Edition

    Here we are again! Another popular open source product has decided its no longer “open source” and the entire community…

    1 条评论
  • 3 Myths of Leadership

    3 Myths of Leadership

    We all have this mental model of what we think a great leader is and the component parts that make up the kind of…

    1 条评论
  • 2024 And Beyond

    2024 And Beyond

    Each year I like to take a step back and look at overall industry trends and try to chart a course to the future…

  • NoOps: Playing Ice Hockey With No Defenders

    NoOps: Playing Ice Hockey With No Defenders

    I came across a graphic recently from ByteByteGo (you may have seen their cute little animated graphics around…

    1 条评论
  • The Martin Parable

    The Martin Parable

    Martin Williams, VP of Engineering at the hot new startup ShortCell**, stepped out of a meeting with Ruth, the CTO. The…

    1 条评论
  • Its Time For A New Programming Language

    Its Time For A New Programming Language

    Today, when compared to any other point in programming history, we find ourselves spoiled for choice when it comes to…

    2 条评论
  • A Dark Side Effect of Unemployment

    A Dark Side Effect of Unemployment

    I try not to make it a secret that I struggle with mental health and, for the last few years, have been regularly…

    3 条评论
  • 3 Things Recruiters Should Do That Just Make Sense

    3 Things Recruiters Should Do That Just Make Sense

    Being a recruiter is a hard job. I think we can all agree on that fact, and I don't think anyone with an ounce of sense…

社区洞察

其他会员也浏览了