Five Timestamps; Four Metrics
David Owczarek
Senior leader, writer, and speaker focused on SRE/DevOps and operating compliant, secure services at scale in the cloud
Introduction
There are five timeline events that are so critical you should record them for every outage. This isn’t always easy because some of these are difficult to identify or problematic to automate.
As always, the reality of incidents can be messy and there can be some nuance to these terms. Here’s a quick rundown on each of them.
The Five Most Important Timestamps
The Initial Fault
Somewhere in your tech stack, something goes wrong. It is quite possible that whatever this fault is, it will not manifest in a way that causes problems until later. Or perhaps it requires an interaction with another component. But there is usually an originating event that can be identified as the initial point of the sequence that will lead to the outage. One this point happens, there will be an outage, it’s just a matter of time. If you are lucky, the initial fault has a clear signature in logs or metrics somewhere.
In many cases, the initial fault is the act of promoting code some time in the past. That can be enough to create sufficient conditions for an outage. Then this code lays effectively dormant until it combines with something else, like perhaps a sudden increase in an unusual and complex use pattern, or a feature flag is enable. Then, poof! All hell breaks loose. Or, perhaps there is an impact, but it is extremely light - one use case for a handful of customers. You find out about it a week later. There are going to be some difficult judgment calls in there. Do you mark the initial fault as the time of the code release, or the time that the feature flag is enabled? Those will have to be decided by the engineering teams closest to the issues, often as a result of discussions that take place in an incident review or the preparation of the root cause document for the outage.
The First Alert
In many cases, it is a monitoring system that detects some aspect of a fault and escalates appropriately. That makes identifying the time of first alert simple. However, one cannot assume how much detection will occur for any particular outage. As careful as we try, there are invariably outages that occur that are not detected by monitoring. In these cases, the “alert” is a human being. It might be an employee who experiences or hears about the issue, or it could arise from a customer that call or logs a ticket. In these cases, that person usually launches the incident response process, and this effectively becomes the time of the first alert.
The Start of Customer Impact
The fact that a fault occurred does not mean that customers are impacted, but many problems evolve and grow over time, eventually impacting the customer experience. This can often be determined through examining request latencies, or error rates, or some other telemetry stream. After the outage is over,? you will have to do some forensic investigation to identify the first point at which customers were impacted. This is often used to calculate a customer-facing SLA.?
Also, bear in mind that the customer impact could happen before the first alert is received. That would be the case if the first awareness of the issue was a customer call.
Something Responds (time of first response)
Alerts should be handled in some way. There can be automatic interventions tied to alerts that attempt to remediate the issue. In that case, the automation will have a timestamp that can be used directly. Many alerts page an on-call engineer. In those cases, they are typically required to “ACK” the alert, or acknowledge it, which becomes the timestamp.
If there isn’t an alert and the incident response process is launched manually, that becomes the time of first response.
领英推荐
Initial Resolution / End of Customer Impact
The initial resolution represents the time when the ongoing customer impact has been resolved and the system is behaving as normal again. There may still be customers who have lasting impact, and there may still be clean up work to do, but the system itself is back in a nominal state.?It's possible also that these are two different timestamps, but let's assume they are one for now.
Note that outages can be crazy. The impact can vary for no discernible reason. You may go back and forth between stable and unstable states for hours. All of those state changes are fair game to be documented in the timeline, and the start and stop times for those series of events are important timestamps for the incident analysis. That makes the calculation of metrics more difficult, though, so we'll keep it simple for this example.
Why These Five?
These five timestamps are important because you use them to calculate the most critical metrics for any software as a service (SaaS) shop:
Here they are all together, expressed visually with a reference set of timeline points. Note that I'm using downtime as a proxy for availability, and have labeled it as such. And again, not all outages are the same, and some things could happen in a different sequence than what is represented below.
Once you have those, you can further calculate time-between-failure (TBF), uptime, and availability. And then, of course, at the end of every reporting period, especially monthly and quarterly, you'd gather them up and create averages, so you'd get mean-time-to-detect (MTTD), mean-time-to-acknowledge (MTTA) and so forth. That’s a lot of metrics from five timestamps!
TTD = t2 - t1
TTA = t4 - t2
DOWNTIME = t5 - t3
TTR = t5 - t1
UPTIME = t3 - t0
AVAILABILITY = UPTIME / (UPTIME + DOWNTIME)
TBF = t1 - t0
Other Important Considerations
There are other timings that help to view the performance of the organization. Many of these focus on activities after the outage - the time to close critical action items, or to complete the root cause investigation, for example. It would also be lovely to know when, in each outage, the team determined the problem and then pivoted to working on a solution. However, this is not consistently feasible in the real world. Some outages are so confusing that all you can do is try well-behaved corrective actions like restarts and sometimes you never end up knowing if they made a difference. Sometimes the behavior spontaneously corrects for no discernible reason. In these situations, there is no inflection point to record as a timestamp. In really confounding outages, the root cause is never conclusively determined.?
And finally, organizations define metrics differently, so what I write as availability here, or MTTR, may not be how it is practiced in your organization.
The power of those four metrics is considerable. Detection time measures a form of monitoring effectiveness. Acknowledgement time measures on-call readiness. Downtime is a direct measure of the customer experience. And MTTR is a measure of the organization’s resiliency and performance against failures. So while there are other things you will want to measure, you can get a solid set of incident metrics using this tactics.
Five timestamps. Four metrics.?