登录查看更多内容

Five Timestamps; Four Metrics

David Owczarek

Senior leader, writer, and speaker focused on SRE/DevOps and operating compliant, secure services at scale in the cloud

发布日期: 2023年12月6日

Introduction

There are five timeline events that are so critical you should record them for every outage. This isn’t always easy because some of these are difficult to identify or problematic to automate.

The time of the initial fault?
The start of customer impact
The first alert from monitoring
The first response from either a human or automaton
The time of the initial resolution / end of customer impact

As always, the reality of incidents can be messy and there can be some nuance to these terms. Here’s a quick rundown on each of them.

The Five Most Important Timestamps

The Initial Fault

Somewhere in your tech stack, something goes wrong. It is quite possible that whatever this fault is, it will not manifest in a way that causes problems until later. Or perhaps it requires an interaction with another component. But there is usually an originating event that can be identified as the initial point of the sequence that will lead to the outage. One this point happens, there will be an outage, it’s just a matter of time. If you are lucky, the initial fault has a clear signature in logs or metrics somewhere.

In many cases, the initial fault is the act of promoting code some time in the past. That can be enough to create sufficient conditions for an outage. Then this code lays effectively dormant until it combines with something else, like perhaps a sudden increase in an unusual and complex use pattern, or a feature flag is enable. Then, poof! All hell breaks loose. Or, perhaps there is an impact, but it is extremely light - one use case for a handful of customers. You find out about it a week later. There are going to be some difficult judgment calls in there. Do you mark the initial fault as the time of the code release, or the time that the feature flag is enabled? Those will have to be decided by the engineering teams closest to the issues, often as a result of discussions that take place in an incident review or the preparation of the root cause document for the outage.

The First Alert

In many cases, it is a monitoring system that detects some aspect of a fault and escalates appropriately. That makes identifying the time of first alert simple. However, one cannot assume how much detection will occur for any particular outage. As careful as we try, there are invariably outages that occur that are not detected by monitoring. In these cases, the “alert” is a human being. It might be an employee who experiences or hears about the issue, or it could arise from a customer that call or logs a ticket. In these cases, that person usually launches the incident response process, and this effectively becomes the time of the first alert.

The Start of Customer Impact

The fact that a fault occurred does not mean that customers are impacted, but many problems evolve and grow over time, eventually impacting the customer experience. This can often be determined through examining request latencies, or error rates, or some other telemetry stream. After the outage is over,? you will have to do some forensic investigation to identify the first point at which customers were impacted. This is often used to calculate a customer-facing SLA.?

Also, bear in mind that the customer impact could happen before the first alert is received. That would be the case if the first awareness of the issue was a customer call.

Something Responds (time of first response)

Alerts should be handled in some way. There can be automatic interventions tied to alerts that attempt to remediate the issue. In that case, the automation will have a timestamp that can be used directly. Many alerts page an on-call engineer. In those cases, they are typically required to “ACK” the alert, or acknowledge it, which becomes the timestamp.

If there isn’t an alert and the incident response process is launched manually, that becomes the time of first response.

Cprime, Inc 2 年前

Putting an End to Human Error Outages

Tony Grayson 1 年前

How we helped major utility companies achieve quicker…

Pronix Inc 2 年前

Initial Resolution / End of Customer Impact

The initial resolution represents the time when the ongoing customer impact has been resolved and the system is behaving as normal again. There may still be customers who have lasting impact, and there may still be clean up work to do, but the system itself is back in a nominal state.?It's possible also that these are two different timestamps, but let's assume they are one for now.

Note that outages can be crazy. The impact can vary for no discernible reason. You may go back and forth between stable and unstable states for hours. All of those state changes are fair game to be documented in the timeline, and the start and stop times for those series of events are important timestamps for the incident analysis. That makes the calculation of metrics more difficult, though, so we'll keep it simple for this example.

Why These Five?

These five timestamps are important because you use them to calculate the most critical metrics for any software as a service (SaaS) shop:

time-to-detect (TTD)
time-to-acknowledge (TTA)
time-to-repair (TTR)
downtime

Here they are all together, expressed visually with a reference set of timeline points. Note that I'm using downtime as a proxy for availability, and have labeled it as such. And again, not all outages are the same, and some things could happen in a different sequence than what is represented below.

Five Key Timestamps and Resulting Four Metrics Illustrated

Once you have those, you can further calculate time-between-failure (TBF), uptime, and availability. And then, of course, at the end of every reporting period, especially monthly and quarterly, you'd gather them up and create averages, so you'd get mean-time-to-detect (MTTD), mean-time-to-acknowledge (MTTA) and so forth. That’s a lot of metrics from five timestamps!

TTD = t2 - t1
TTA = t4 - t2
DOWNTIME = t5 - t3
TTR = t5 - t1
UPTIME = t3 - t0
AVAILABILITY = UPTIME / (UPTIME + DOWNTIME)
TBF = t1 - t0

Other Important Considerations

There are other timings that help to view the performance of the organization. Many of these focus on activities after the outage - the time to close critical action items, or to complete the root cause investigation, for example. It would also be lovely to know when, in each outage, the team determined the problem and then pivoted to working on a solution. However, this is not consistently feasible in the real world. Some outages are so confusing that all you can do is try well-behaved corrective actions like restarts and sometimes you never end up knowing if they made a difference. Sometimes the behavior spontaneously corrects for no discernible reason. In these situations, there is no inflection point to record as a timestamp. In really confounding outages, the root cause is never conclusively determined.?

And finally, organizations define metrics differently, so what I write as availability here, or MTTR, may not be how it is practiced in your organization.

The power of those four metrics is considerable. Detection time measures a form of monitoring effectiveness. Acknowledgement time measures on-call readiness. Downtime is a direct measure of the customer experience. And MTTR is a measure of the organization’s resiliency and performance against failures. So while there are other things you will want to measure, you can get a solid set of incident metrics using this tactics.

Five timestamps. Four metrics.?

Five Timestamps; Four Metrics

David Owczarek

Senior leader, writer, and speaker focused on SRE/DevOps and operating compliant, secure services at scale in the cloud

Introduction

The Five Most Important Timestamps

The Initial Fault

The First Alert

The Start of Customer Impact

Something Responds (time of first response)

领英推荐

Initial Resolution / End of Customer Impact

Why These Five?

Other Important Considerations

更多精彩文章

社区洞察

其他会员也浏览了

Preparing For The Worst

Reinventing 911 – From the Inside Out

Incident Response - More on the Windows PEB

Microsoft CrowdStrike Outage: Key Insights & Early Takeaways

5 Ways Your Reliability Metrics are Fooling You

Hidden Contact Center Outage Cost Today?!?

How to calculate availability of the system?

Quantifying the Cost of an Outage

Uptime and Quota: Exploring Our Love-Hate Relationship with Metrics

Uptime Percentages, Recovery Time Objective and Error Budgets

Introduction

The Five Most Important Timestamps

The Initial Fault

The First Alert

The Start of Customer Impact

Something Responds (time of first response)

领英推荐

Initial Resolution / End of Customer Impact

Why These Five?

Other Important Considerations

Podcasting Internet Failures

2024年6月20日

4 Ways Performing Is Like Programming

2024年5月28日

6 Months and Counting

2024年4月23日

10 ways to ruin a lightning talk

2024年2月6日

What is SRE really?

2023年11月28日

The 2023 State of DevOps?Report

2023年11月8日

The Availability Enigma

2022年7月20日

SLOConf 2022 - 8 inspiring talks

2022年5月11日

Two learnings from SRECon?2022

2022年4月5日

Finding the unknown unknowns

2022年2月2日

社区洞察

其他会员也浏览了

Preparing For The Worst

Reinventing 911 – From the Inside Out

Incident Response - More on the Windows PEB

Microsoft CrowdStrike Outage: Key Insights & Early Takeaways

5 Ways Your Reliability Metrics are Fooling You

Hidden Contact Center Outage Cost Today?!?

How to calculate availability of the system?

Quantifying the Cost of an Outage

Uptime and Quota: Exploring Our Love-Hate Relationship with Metrics

Uptime Percentages, Recovery Time Objective and Error Budgets