Don’t Panic: A Guide to Incident Management
Miles Goldstein
Global Product & Technical Support Executive | Expert in Designing & Implementing Scalable Support Operations to Drive Customer Satisfaction & Cost Reduction | B2B SaaS
Don’t Panic: A Guide to Incident Management – D M Goldstein, Jan 2022
When everything goes down, your customers have five basic questions:
·??????What’s happening?
·??????How does it affect me?
·??????What are you doing about it?
·??????When will it be fixed?
·??????What are you doing to prevent this from happening again?
As a business you provide a product or service to your customers. An “Incident” is when something happens which makes that product or service unavailable, unusable, or dangerous – in short, “down”. In this article I will discuss the coordination and communication steps and roles necessary to keep your customers safe and informed during the Incident and reduce the likelihood of churn as a result. While this discussion is skewed toward enterprise software, many aspects are applicable to just about any product, from a child’s toy to a hand-held smart device, and from business software to automobiles.
Allow me to start with some definitions, so readers don’t get lost in jargon. These aren’t textbook definitions; they are a quick guide for the reader to understand my shorthand as I use them throughout the article.
·??????Product: This is any product or service. It could be a toy, a piece of home electronics, online business software, a health-care provider, or house-cleaning service. It’s whatever you are selling.
·??????SaaS software: This means “Software as a Service”; it is software that you connect to on-line (“in the cloud”). Nowadays a vast amount of business software is licensed in this manner, as are things like multi-player on-line games, email providers, search engines, and so on. For my purposes, this is any software that is “hosted” on a system over which the user has no ownership or control.
·??????On-prem software: This is software hosted “on premises”; the user typically owns and controls the systems where it resides. For my purposes I will extend this to include things like software which runs on your personal computer, smart phone, “digital assistants” like Alexa or Siri, and even the software that is programmed into your car.
·??????Physical device: This is the physical product, such as a child’s toy, or the device on which the software runs, such as the examples listed above.
·??????Upload: The process by which software is installed on a physical device; this can be “pushed”, with the vendor automatically pushing the software update onto your device, or “pulled”, where you explicitly request the software to be pulled onto your device. Many software vendors, such as those who provide operating systems (MicroSoft, Apple) or browsers (Firefox, Chrome), allow you to choose whether to automatically update (push) or wait for you to ask (pull). If your automobile repair technician has to “flash” some component in your car, that is pushing a software update onto the component.
·??????Subscription / Subscriber: Whether it’s a license to use online software, a “pay wall” for an online news source, product warranty registration, or some sort of registered membership, these all constitute some form of subscription. The provider usually knows something about the subscriber, such as name, phone, and an email or physical address, and an account of some form - possibly a login ID and access to the vendor’s web site or a financial arrangement for billing. Recipients of a recurring service, such as medical care, professional services like a CPA, or home-cleaning services, also fall into this category.
·??????Registered owner: There are situations where you are known to own a product or device but are not required to subscribe to anything. This includes things like being the registered owner of a specific automobile, where you might never have explicitly given the manufacturer any personal information.
An Incident occurs. This could be an interruption in service (outage) for a SaaS application, a discovery of a faulty part on a car or other physical device, a discovery of some other dangerous condition such as possible contamination or exposure, or any other disruption that puts your users or product at risk. This is not “Billy can’t login” or “some cosmetic damage may occur”; this is usually an event which impacts a significant number of customers and could cause physical or financial harm. In this situation waking up the boss is far less risky than not waking them.
In the United States, the Department of Homeland Security (DHS) manages Incidents (emergency situations) through FEMA (the Federal Emergency Management Agency). They have designed an Incident Command System (ICS)*, which is a “standardized management tool for meeting the demands of small or large emergency or nonemergency situations.” This is used by most organizations under their control, such as local fire departments, and defines several key roles and functions which must be provided for any incident, including the one your company is facing. Among the key roles are the Incident Commander, the Public Information Officer, and a Communications Unit. Other roles include Operations, Planning/Resources, Logistics (which includes Communications), and Finance.
Internal verification or validation. This may happen with different timelines or urgency depending on the nature of the problem. In the case of SaaS software, it’s usually quite quick: either through monitoring or customer complaints you become aware that something is down or seriously malfunctioning. In other industries it may be much slower, such as an automobile manufacturer or public agency collecting statistics over time which eventually point to a pattern implicating a faulty component. Either way, the first natural reaction is to verify and validate the problem. Is there evidence to corroborate the Incident? Do you have any information as to the extent of the incident? How many customers/users/owners are at risk of being impacted? To what extent could they be impacted? Do they have an alternate means of using the product? What steps should customers be taking to protect themselves? What do you think is wrong and what are you going to do about it?
Identify your Incident Commander (IC). This is the first step after – or in parallel with – validation, and happens in various ways. In a SaaS company, this is often a rotating duty manager out of Support or Operations, or the product owner/lead. For time-sensitive incidents this person is likely invoked via some automated paging or messaging system which pushes an alert, preferably with automatic escalation if no response. They own the IC role until and unless they transfer responsibility to a new IC based on company policy. Your Incident Commander is the conductor of this orchestra; there can be only one, and there must be one, though assignment can change as the problem progresses. This person is responsible for ensuring these next roles are quickly filled and execution begins.
Assign resources. For our purposes, “Operations” and “Planning” overlap. Resources must be identified and a “Strike Team” or “Task Force” assigned. This is the team of one or more people who will investigate and resolve the actual problem. Phrases like “War Room” are used to describe how this team swarms on the problem to brainstorm, trouble-shoot, test, and resolve. With something like SaaS software this happens immediately and often continues non-stop until a resolution is found. For a physical device, such as a car, this may take much longer, but should be met with urgency appropriate to the problem.
Internal communications. While the Strike Team are fixing the problem they should not be interrupted with a constant barrage of “everybody else” asking for status. An internal Communications person or team must be identified quickly. They will listen in on the War Room and share status with your company-internal audience, as well as any questions or requests the Strike Team has about the problem. They will also field questions and provide status back from that audience and relay them back to the Strike Team.
External communications. The worst thing you can do is to not communicate with your customers. The means vary, but the need persists. Your Public Information Officer could come from one of many departments: Customer Support, Product or Customer Marketing, Customer Success, or somewhere else. In the case of something like an automobile recall this could be the Legal department. This person or group is responsible for crafting the message, choosing the methods to be used, and providing timely updates. It is best to keep this role separate from the Communications person. Communication methods include:
·??????Company web site: many companies (especially SaaS ones) have a web page with a name like “trust.[company].com” or “status.[company].com” where they post real-time information about an Incident. In the absence of that, companies may post an article or headline about the Incident on their home page.
·??????In-app messaging: many SaaS companies post messages within the actual application to inform active users of a partial disruption or risk. In situations where the application URL is down, they may have configured a “Failover page” for customers to inform them of a problem, which is much more user-friendly than a “404” Page Not Found error.
领英推荐
·??????Mass Email: companies which have owner, subscriber, or user information often email users to inform them of a problem. Depending on how you send email, users may be able opt-out of receiving such messages; never assume that everybody got the message. For companies with a large customer base, this email is typically done using an email marketing system capable of doing large email campaigns. Similarly, some companies use SMS messaging or a messaging application to reach subscribers.
·??????Individualized outreach via phone call or Email: In addition to any broad outreach, it is advisable to personally contact your key customers with an individualized message. This is often done via an “Executive Sponsor” for an account, or through any of: Account Executive, Account Manager, Customer Success Manager, Technical Account Manager, or anybody else deemed to be the “owner” or primary contact for a customer or account. This is typically done for your highest-value or most strategic customers; it does not scale with large customer bases.
·??????Physical mail: companies which have customers’ physical mail addresses will send a notification if time permits. This is way too slow for something like a SaaS application outage, but very common for things like an automobile recall notice. In these situations, it is typically “registered owners” receiving the notice, as with a recall notice for a physical device.
Clarity and detail matter. Your customer messaging must have the right level of detail and clarity. Saying, “The system is down, we’re working on it,” is content-free and will cause unnecessary problems. Similarly, saying, “There is a problem with the flux capacitor not sending the right frequencies to the beebleblaster,” is perhaps a bit too technical. You want to clearly answer the questions from the top of this article in user-friendly language:
·??????What’s happening?
·??????How does it affect me?
·??????What are you doing about it?
·??????When will it be fixed?
·??????What are you doing to prevent this from happening again?
You will also want to set expectations around when the next action or communication is going to take place. “We expect resolution in two hours and will post updates every 30 minutes until then.” These answers and timelines are essential in any of the above communication types. I have provided an email template below for a SaaS software Incident; as discussed above, other products may have different means, such as a recall notice. If emailing something like this template, it is important to post regular updates based on the commitment you make.
+++++++++++++++ EMAIL TEMPLATE for SaaS issue ++++++++++++++++
Dear Customer,
?
This is to advise you of an incident with your [product] service.
?
?????????????Issue: [brief problem description, such as “Login service unavailable”]
?????????????Impact: [additional details about what is not working and who is affected, such as “East Coast customers may receive errors logging in. All other customers and services are operational.”]
?????????????Start Date & Time: [be sure to include your reference time-zone and include GMT equivalents for clarity]
?????????????End Date & Time: [or Duration or ETA]
?????????????Status: [Provide a meaningful status of the problem, such as “Engineering is working with our service provider to reroute traffic through the West Coast”]
?
[Detailed problem description if not covered under Status, and notes on progress since any previous notification. If problem has been resolved, a brief high-level discussion of the fix or root cause analysis may be provided]
?
[Resolution ETA or plan or workaround if not covered under Status, or preventive measures for next time. For an ongoing problem, include a commitment for your next update such as, “We will provide an update when the current step completes or in 30 minutes, whichever is sooner.”]
?
We apologize for any inconvenience this situation has caused you.?Please reach out to your normal channels if you have any concerns about this incident.
?
Sincerely,
?
[company] Team
+++++++++++++++ End SaaS TEMPLATE ++++++++++++++++
After the Incident. It is important to let your customers know that the incident has been resolved, including any details as to what customers should look out for. An internal meeting should be held (often called a post-mortem or retrospective) to determine the root cause of the Incident. This meeting should identify any preventive actions which should be taken to avoid this problem in the future or should have been taken to prevent the Incident, as well as any lessons learned from the Incident. This is called a “Root Cause Analysis” (RCA) and should be documented and available for reference. Any action items should be called out, and progress against them should be tracked and updated in the document. Any information on the RCA should be tuned for external use and shared with appropriate customers.
Summing it up. Regardless of the type of product or industry, the best thing you can do for your customers it to provide timely and useful information. While customers will remember the Incident, what they remember more is how you handled it. Informing them as soon as possible and throughout the Incident is far more likely to retain their loyalty, while delays, denial, or insufficient information will cost you in the long run.
Impressive catalog of what to do (and what not to!).