Looking beyond MTTR
In last week's post, we explored how the ubiquitous MTTR (Mean Time To Restore/Recover) metric may not be as useful as you’d hope – if your aim is to measure IT incident response effectiveness.
We described how Courtney Nash , Stepan Davidovic and others, demonstrated that the typically skewed statistical distribution of organisational incident resolution times, renders the mean (the ‘M’ in MTTR) unhelpful, and even misleading as a signal to identify change in performance over time.
So if MTTR isn’t useful, what else can we measure to demonstrate the effectiveness, or otherwise, of our efforts to improve resilience?
First, the bad news: distilling a complex socio-technical phenomenon such as incident response into a single number is likely to be an unfulfilling exercise. The huge variety of influencing factors contributing to IT incidents tends to render the signal of incident response performance invisible amongst the noise of variation outside of our control.
If a single metric such as MTTR is too blunt an instrument, perhaps a collection of measures will be more effective?
There are plenty of MTT’X’ measures that we could deploy in-aggregate to provide a more nuanced illustration of incident response, examples include:
This infoQ article does a good job of dissecting such metrics. However, it’s unclear without further analysis whether such metrics are less susceptible to the statistical variance that renders MTTR futile. It is clear that these metrics are lagging indicators, and they also suffer from requiring large sample sizes to normalise the mean. The last thing we want is a large number of incidents just to achieve a normal distribution.
领英推荐
Incident response lends itself more naturally to qualitative analysis. This is an answer that’s unlikely to be satisfying to those who crave the deterministic satisfaction of a single number, or a graph that trends in a positive direction, but this doesn’t make it less true. Savvy organisations already do a lot of qualitative analysis during post-incident reviews (PIRs), where responders and stakeholders gather to share their incident experience and learnings from multiple different perspectives. Such reviews can also give birth to quantitative data that can illustrate a team’s efforts to learn and improve following incidents.
For example:
Qualitative analysis also allows you to dive into subtler aspects of behaviour during incident response that can make the difference between a slow, inflexible response and a collaborative, agile response. Dr Laura Maguire, PhD 's research into the Cost of Coordination highlights several behaviours that positively contribute to effective incident response including:
Such attributes may be more difficult to measure than “time to resolve” but they do represent aspects of incident response that teams and organisations would do well to nurture, monitor and improve.
So while there are alternatives to MTTR if you’re looking to track your incident response improvement, these measures are best combined with a qualitative analysis approach that allows you to reflect on, and learn from how your responders act together under conditions of surprise, uncertainty and ambiguity. This is precisely what Uptime Labs is designed to help you to experience.