Accelerating Decision-Making within Site Reliability Engineering
In my last blog article, I shared my thoughts on books and movies that convey parts of the experience of being a Site Reliability Engineer (SRE). A couple of weeks ago, I had the opportunity to watch “Top Gun,” one of the most memorable 80’s movies, with my children. Viewing the movie as an adult, I was struck by how much of the tension in the movie is rooted in hesitance to act. The movie puts its fighter pilots in repeated situations where they are unclear on the right action, and need to make decisions under pressure. It reminded me of a critical concept in the key SRE skill of incident management: decisiveness.
Incident management is a common SRE activity wherein an issue with delivering service has been detected and needs to be dealt with. It’s an SRE’s job to stop the impact to customers, or mitigate the incident, as quickly as possible. In managing an incident, an SRE must be decisive: understand what is failing, compose a plan to mitigate the impact, and then execute the plan successfully. Doing so requires an SRE to make decisions quickly and with limited information, and often with the risk that an error along the way can make a bad situation worse.
Decisiveness, and decision-making in general, are well-studied. One of the models that has most influenced me came out of same source material that created “Top Gun”: the OODA Loop. Developed by US Air Force Colonel John Boyd, a combat pilot and military strategist, the OODA loop breaks decision-making down to four steps in a continuous cycle: observe, orient, decide, act.
Using the OODA loop effectively, an SRE will:
- Observe accessible information - what’s happening?
- Orient on an understanding of the situation - what does it mean?
- Decide on a course of action - what should I do about it?
- Act to make the plan reality - now I will do something!
In the original context of a dogfight between military aircraft, a pilot that can get “inside” an opposing pilot’s OODA loop by deciphering the situation faster and acting more effectively than an opponent with a slower OODA loop will have the advantage. In the case of an SRE’s work during an incident, the opponent is typically not a human, but an active, dynamic service created by humans - an environment that can be quite hostile to those working in it!
Take this example: a service is queueing up requests unexpectedly under load, causing performance impact for users. Alerts typically detect impact, but only mature alerts can be counted on to tell an SRE the specific cause of impact. One possible response to queuing requests is to add capacity; another is to drop or throttle user requests; a third is to roll back code or other changes that may have triggered the problem. Each of the options takes time and introduces some amount of risk around making further changes; a skilled SRE will use observations and historical knowledge to find the most likely path to success, and then put it into action. Within the US Air Force, it is the element of historical knowledge that is considered the most critical part of the OODA loop. USAF training room lore stresses the importance of feeding a new pilot’s OODA loop with the experience that comes from surviving the first combat mission, and how every subsequent mission accelerates the loop - a direct parallel to the value of experience in incident management!
The most severe incidents have stakes that can dramatically impact a company, and take SREs into the danger zone. Loss of data, extended customer downtime, or data exposure can be major setbacks to a company’s future; during an incident, these risks draw the attention of many individuals who are interested in the business impact as well as the technical steps to mitigation. In a complex incident where the stakes are high, I recommend clearly identifying the following roles to focus on both mitigation and communication in parallel:
- The Subject Matter Expert(s): Focused on mitigation, Subject Matter Experts (SMEs) have the job of getting the incident to mitigation as swiftly as possible. They explain what they are doing to the Incident Captain so that if additional or different SMEs are needed, they can be brought in.
- The Incident Captain: The Incident Captain owns the plan for mitigation, summarizing communications on the progress towards completion of the plan, and getting the right SMEs engaged. When a decision involving risk or tradeoffs needs to be made, the Incident Captain has the responsibility for making the necessary calls, always with the goal of getting to mitigation as swiftly as possible.
- The Communications Coordinator: The Communications Coordinator handles outward-facing communications about the incident, including to customers and executive stakeholders. The Communications Coordinator pulls information from the Incident Captain, and ensures it is distributed to the people who need to know.
When using this process, I recommend making the following investments in skills and education to be successful:
As an SRE and/or SME: Cultivate your knowledge of the architecture, and particularly its most fragile points. Develop alerting that not only detects symptoms but also points at causes. Own your component and plan for failure - make it resilient from the start!
As an Incident Captain: Follow the SRE/SME advice, but also understand the business impact associated with failure of the product. As an example, you need to be able to make an informed decision when presented with a choice between taking downtime or leaving an impaired service in place.
As a Manager: Remember that your job during an incident is to accelerate the existing OODA loop, not create your own. If there’s no Incident Captain, take the role. If there is an Incident Captain, support them as they manage the incident.
With these distinct roles identified in an incident, SMEs can focus on maintaining a fast OODA loop, supported by an Incident Captain that helps them get the information and resources they need to be successful, and a Communications Coordinator that understands how to manage a customer and/or executive audience interested in the event. Combined with clear roles and processes, a quick OODA loop lets a team of SREs shack* incidents like combat aces!
*: My USAF-veteran peer reviewer assures me “shack” is an Air Force term for a bulls-eye!
chief engineer, Brown Dog Fishing
6 年Totally agree Dave - Telemetry to support proactive decision making is critical to making timely calls.? I love to put measurements and monitoring in place on cloud services so that site reliability conditions can be measured with precision.
30y coaching and creating fun, human, high-trust, inclusive, high-speed environments where diverse teams of Architecture, Product, Engineering, Security and Operations folks can thrive building products customers love.
6 年Good stuff, Neil and I wholeheartedly agree! My best and most instinctive TDO/Incident Commander colleagues have been ex-military, likely due to some of the training you highlight... I think your thoughts apply equally well - if not more importantly- to the proactive data collection & processing that drive action-oriented decision making as a *preventative* urgency that the best SRE’s evidence in true custodianship of their service. (I don’t have the telemetry I need- and the freedom to execute decisions quickly, without a bunch of bureacracy- to prevent things from going wrong, damn it!)