Why so mean about MTTR?

Why so mean about MTTR?

Since long before the Accelerate Book enshrined it into the gospel according to Dora, MTTR (mean time to restore/recover) has been the de-facto metric for incident response. And what’s not to love? After all, given the choice between 2 incidents, identical in every way other than their duration, we’d all pick the shorter one. And it naturally follows that measuring time to resolution across multiple incidents appears a sensible idea. While we’re at it, let’s calculate the average, and relax in the comfort of having captured our most important resilience statistic in a single number. We could even plot it on a time series and high-five when the line heads south, and perhaps engage in a bit of root cause analysis (more on this in another post) when it tips north. The board will love it, they may even demand it.

But hold up, this MTTR thing…what job are we hiring it to do for us? Perhaps:

  • To tell us whether our efforts to improve resilience are effective or not?
  • To tell us how incident impact is changing over time?
  • To help inform key decisions about our approach to resilience?
  • To tell us how good our incident response team is?
  • To keep the boss happy?

Unfortunately, MTTR fails at most of these jobs, even if the boss is happy.

Courtney Nash at The VOID and 谷歌 engineer Stepan Davidovic in his report ‘incident metrics in SRE’ have done some amazing work to show that while time to resolution may be important, expressing it as a mean is unhelpful and potentially misleading. Here are some of the, perhaps counter intuitive conclusions:

When TTR changes, it’s unlikely to be visible in the mean

The mean is employed to represent a typical value, or yardstick for a distribution of measures.

For example, one might use the mean to gain an understanding of the typical physical height of a population. The mean is most informative when applied to normal distributions, and the distribution of physical height is such an example, where few people and extremely short, few are extremely tall and most stand somewhere in the middle. Incident durations within organisations however, tend not to be normally distributed. They tend instead to be heavily skewed to the right, with many short incidents and fewer long incidents.


A ‘normal’ incident duration distribution (most don’t look anything like this):

A skewed incident duration distribution (yours probably looks more like this):

This skewed distribution of incident durations results in a fuzzy mean that is less helpful that you might hope, especially if you’re comparing means to tell a story about your incident response capability over time. The VOID report 2022 describes a Monte Carlo simulation approach that demonstrates this, with real world incident duration data.

Can I just use the Median instead?

Not really. The median is less influenced by ‘outliers’ than the mean, so it may give you a more representative picture of a typical incident duration compared to the mean. However, when comparing TTR changes over time, it’s still unlikely to tell a useful story. In incident response, it’s the outliers that we really care about. Do we really want a stat that renders that 2 day monster outage invisible?

Sample sizes tend to be low

Statistically, distributions tend to become ‘more normal’ as the sample size grows. However most organisations thankfully suffer too few incidents for the duration distribution to approach normality. Given a choice between an informative MTTR or fewer incidents, which would you choose?

Duration != Severity

Even if MTTR did paint an accurate picture of how incident durations were changing over time, it doesn’t necessarily tell an accurate story about incident impact. Again, the VOID report 2022 demonstrates a poor correlation between duration and severity, so it’s possible that your MTTR could be decreasing while impact is increasing.

So what?

A useful measure reduces uncertainty in decision making. What might you do if your MTTR increased? How about if it reduced? Unfortunately, MTTR is unlikely to help inform your decision making. Improving your incident response is unlikely to manifest in the mean, so you could be doing a great job, or a poor job and your MTTR won’t tell you. Or worse, it’ll tell you something and you’ll go looking for the cause, and you won’t find it.

So what might you use instead of, or in addition to MTTR? That’ll be be the subject of the next post.


References

The void report 2022

incident metrics in SRE Stepan Davidovic

How to Measure Anything: Finding the Value of Intangibles in Business Doug Hubbard


要查看或添加评论,请登录

Uptime Labs的更多文章

社区洞察

其他会员也浏览了