登录查看更多内容

Accelerating Decision-Making within Site Reliability Engineering

Neil Laughlin

Vice President, Site Reliability Engineering at AuditBoard

发布日期: 2018年6月28日

In my last blog article, I shared my thoughts on books and movies that convey parts of the experience of being a Site Reliability Engineer (SRE). A couple of weeks ago, I had the opportunity to watch “Top Gun,” one of the most memorable 80’s movies, with my children. Viewing the movie as an adult, I was struck by how much of the tension in the movie is rooted in hesitance to act. The movie puts its fighter pilots in repeated situations where they are unclear on the right action, and need to make decisions under pressure. It reminded me of a critical concept in the key SRE skill of incident management: decisiveness.

Incident management is a common SRE activity wherein an issue with delivering service has been detected and needs to be dealt with. It’s an SRE’s job to stop the impact to customers, or mitigate the incident, as quickly as possible. In managing an incident, an SRE must be decisive: understand what is failing, compose a plan to mitigate the impact, and then execute the plan successfully. Doing so requires an SRE to make decisions quickly and with limited information, and often with the risk that an error along the way can make a bad situation worse.

Decisiveness, and decision-making in general, are well-studied. One of the models that has most influenced me came out of same source material that created “Top Gun”: the OODA Loop. Developed by US Air Force Colonel John Boyd, a combat pilot and military strategist, the OODA loop breaks decision-making down to four steps in a continuous cycle: observe, orient, decide, act.

Using the OODA loop effectively, an SRE will:

Observe accessible information - what’s happening?
Orient on an understanding of the situation - what does it mean?
Decide on a course of action - what should I do about it?
Act to make the plan reality - now I will do something!

In the original context of a dogfight between military aircraft, a pilot that can get “inside” an opposing pilot’s OODA loop by deciphering the situation faster and acting more effectively than an opponent with a slower OODA loop will have the advantage. In the case of an SRE’s work during an incident, the opponent is typically not a human, but an active, dynamic service created by humans - an environment that can be quite hostile to those working in it!

Take this example: a service is queueing up requests unexpectedly under load, causing performance impact for users. Alerts typically detect impact, but only mature alerts can be counted on to tell an SRE the specific cause of impact. One possible response to queuing requests is to add capacity; another is to drop or throttle user requests; a third is to roll back code or other changes that may have triggered the problem. Each of the options takes time and introduces some amount of risk around making further changes; a skilled SRE will use observations and historical knowledge to find the most likely path to success, and then put it into action. Within the US Air Force, it is the element of historical knowledge that is considered the most critical part of the OODA loop. USAF training room lore stresses the importance of feeding a new pilot’s OODA loop with the experience that comes from surviving the first combat mission, and how every subsequent mission accelerates the loop - a direct parallel to the value of experience in incident management!

The most severe incidents have stakes that can dramatically impact a company, and take SREs into the danger zone. Loss of data, extended customer downtime, or data exposure can be major setbacks to a company’s future; during an incident, these risks draw the attention of many individuals who are interested in the business impact as well as the technical steps to mitigation. In a complex incident where the stakes are high, I recommend clearly identifying the following roles to focus on both mitigation and communication in parallel:

The Subject Matter Expert(s): Focused on mitigation, Subject Matter Experts (SMEs) have the job of getting the incident to mitigation as swiftly as possible. They explain what they are doing to the Incident Captain so that if additional or different SMEs are needed, they can be brought in.
The Incident Captain: The Incident Captain owns the plan for mitigation, summarizing communications on the progress towards completion of the plan, and getting the right SMEs engaged. When a decision involving risk or tradeoffs needs to be made, the Incident Captain has the responsibility for making the necessary calls, always with the goal of getting to mitigation as swiftly as possible.
The Communications Coordinator: The Communications Coordinator handles outward-facing communications about the incident, including to customers and executive stakeholders. The Communications Coordinator pulls information from the Incident Captain, and ensures it is distributed to the people who need to know.

When using this process, I recommend making the following investments in skills and education to be successful:

As an SRE and/or SME: Cultivate your knowledge of the architecture, and particularly its most fragile points. Develop alerting that not only detects symptoms but also points at causes. Own your component and plan for failure - make it resilient from the start!

As an Incident Captain: Follow the SRE/SME advice, but also understand the business impact associated with failure of the product. As an example, you need to be able to make an informed decision when presented with a choice between taking downtime or leaving an impaired service in place.

As a Manager: Remember that your job during an incident is to accelerate the existing OODA loop, not create your own. If there’s no Incident Captain, take the role. If there is an Incident Captain, support them as they manage the incident.

With these distinct roles identified in an incident, SMEs can focus on maintaining a fast OODA loop, supported by an Incident Captain that helps them get the information and resources they need to be successful, and a Communications Coordinator that understands how to manage a customer and/or executive audience interested in the event. Combined with clear roles and processes, a quick OODA loop lets a team of SREs shack* incidents like combat aces!

*: My USAF-veteran peer reviewer assures me “shack” is an Air Force term for a bulls-eye!

Dan Rogers

chief engineer, Brown Dog Fishing

6 年

Totally agree Dave - Telemetry to support proactive decision making is critical to making timely calls.? I love to put measurements and monitoring in place on cloud services so that site reliability conditions can be measured with precision.

2 次回应

David Beers

30y coaching and creating fun, human, high-trust, inclusive, high-speed environments where diverse teams of Architecture, Product, Engineering, Security and Operations folks can thrive building products customers love.

6 年

Good stuff, Neil and I wholeheartedly agree! My best and most instinctive TDO/Incident Commander colleagues have been ex-military, likely due to some of the training you highlight... I think your thoughts apply equally well - if not more importantly- to the proactive data collection & processing that drive action-oriented decision making as a *preventative* urgency that the best SRE’s evidence in true custodianship of their service. (I don’t have the telemetry I need- and the freedom to execute decisions quickly, without a bunch of bureacracy- to prevent things from going wrong, damn it!)

2 次回应

查看更多评论

要查看或添加评论，请登录

Neil Laughlin的更多文章

True Names in Platform Engineering

2023年6月26日

True Names in Platform Engineering

Defining Platform Engineering Across a series of SaaS companies of very different scale, I have seen a common pattern:…

10 条评论
Five Tips for New Job Seekers, Leveraging LinkedIn

2020年4月16日

Five Tips for New Job Seekers, Leveraging LinkedIn

I recently had the opportunity to get involved in my employer’s amazing university recruiting program. I’m looking…

4 条评论
A Modest Gauge for DevOps Maturity

2019年5月31日

A Modest Gauge for DevOps Maturity

Gauging your organization's readiness to adopt DevOps practices is a big, lucrative consulting opportunity for a…

1 条评论
Pain-Free Software Engineering Job Descriptions: Do Right

2019年3月1日

Pain-Free Software Engineering Job Descriptions: Do Right

I’m a hiring manager for a software company, and like every other hiring manager looking for software engineers, I want…
Pain-Free Software Engineering Job Descriptions: On Seniority

2018年11月29日

Pain-Free Software Engineering Job Descriptions: On Seniority

I’m a hiring manager for a software company, and like every other hiring manager looking for software engineers, I want…

3 条评论
Pain-Free Software Engineering Job Descriptions: Roles

2018年11月12日

Pain-Free Software Engineering Job Descriptions: Roles

I’m a hiring manager for a software company, and like every other hiring manager looking for software engineers, I want…

2 条评论
Pain-Free Software Engineering Job Descriptions: The Basics

2018年10月30日

Pain-Free Software Engineering Job Descriptions: The Basics

I’m a hiring manager for a software company, and like every other hiring manager looking for software engineers, I want…
Understanding Site Reliability Engineering through Movies and Books

2018年1月2日

Understanding Site Reliability Engineering through Movies and Books

In the past, when asked to explain what Site Reliability Engineering is, I found I sometimes covered the plain facts of…

11 条评论
You're My Manager. What Are You For?

2017年12月9日

You're My Manager. What Are You For?

What is the best question you’ve ever asked your manager? As the leader of a Site Reliability Engineering (SRE)…

6 条评论

See all articles

Accelerating Decision-Making within Site Reliability Engineering

Neil Laughlin

Vice President, Site Reliability Engineering at AuditBoard

Neil Laughlin的更多文章

社区洞察

其他会员也浏览了

Using Observability to Drive Continuous Improvement in Site Reliability Engineering (SRE)

A Site Reliability Engineering (SRE) Manifesto

Measuring Success in SRE - Part#2

Impact of GenAI on Site Reliability Engineering (SRE)

Impact of GenAI on Site Reliability Engineering (SRE)

“Root Cause Analysis; Improving Performance for Bottom Line Results”, Fifth Edition Review

From Site to Service: The Evolution of SRE

SLO, SLI & SLA in SRE

[S]ilent [M]ice [E]agerly Chase [C]urious [C]ats [P]layfully: The Game Site Reliability Engineers Love.

Monitoring service level matrics

Neil Laughlin的更多文章

True Names in Platform Engineering

Five Tips for New Job Seekers, Leveraging LinkedIn

A Modest Gauge for DevOps Maturity

Pain-Free Software Engineering Job Descriptions: Do Right

Pain-Free Software Engineering Job Descriptions: On Seniority

Pain-Free Software Engineering Job Descriptions: Roles

Pain-Free Software Engineering Job Descriptions: The Basics

Understanding Site Reliability Engineering through Movies and Books

You're My Manager. What Are You For?

社区洞察

其他会员也浏览了

Using Observability to Drive Continuous Improvement in Site Reliability Engineering (SRE)

A Site Reliability Engineering (SRE) Manifesto

Measuring Success in SRE - Part#2

Impact of GenAI on Site Reliability Engineering (SRE)

Impact of GenAI on Site Reliability Engineering (SRE)

“Root Cause Analysis; Improving Performance for Bottom Line Results”, Fifth Edition Review

From Site to Service: The Evolution of SRE

SLO, SLI & SLA in SRE

[S]ilent [M]ice [E]agerly Chase [C]urious [C]ats [P]layfully: The Game Site Reliability Engineers Love.

Monitoring service level matrics