Zero to 0.99999 - Notes on Metrics and Alarming for Software That Doesn’t Have it Yet

Why?

While there are plenty of tools for implementing alarms, there is little advice on how to design alarms to achieve better availability. Most engineers rely on their own experiences and what they have seen in systems they have worked with, making setting up alarming for software in an organization that does not have it yet challenging. ? This article lays out some general advice for implementing alarms and a description of common alarm types and investigation strategies, providing an abstract runbook that can be used as the basis for prioritizing building alarms.

General Advice

Know How You Will Respond to the Alarm Going Off

When creating an alarm, the action that gets taken in response is as important as the alarm going off under the right conditions.? The alarm should alert an engineer through a channel appropriate for its severity and contain enough information to allow the engineer to identify the next step to take to investigate the issue and know how urgent it is to do so.? This is particularly important if your response includes paging off-hours support who will not have access to other engineers who may be more familiar with that part of the system.

Track Alarms in an Issue Tracker

Integrating your alarms with an issue tracker is essential for being able to measure your availability outcomes as well as keeping track of issue statuses. ? Classify issues resulting from alarms by impact and if the alarm was a duplicate or false positive and provides metrics for monitoring the level of support your system is requiring.

Avoid Alarm Fatigue

An initial temptation upon coming up with a good technical solution is to place alarms on everything and trim them back as they are shown to create too many false positives.? This can easily lead to alarm fatigue, desensitizing your engineers to alarms going off and spreading their effort over triaging a large number of alarms, resulting in less effort going into identifying root causes and improving your availability.???

Instead, add alarms incrementally so that the level of alerts produced does not exceed what the support engineer can investigate.? Whenever possible, merge notifications about the same alarm to minimize noise and make it easier for engineers to identify patterns and track progress.? Monitor the rate of false positive and duplicate alarms that engineers receive.

Monitor Using Percentiles of Sample

Monitoring percentiles enables you to focus on the parts of the distribution that are most likely to indicate problems. ? The median/50th is a good indicator of the typical experience, while the 90th is good for detecting recurring problems in edge cases.? Better than median outcomes are generally less interesting, so 2-3 metrics in the worse half of the distribution are generally sufficient.

Work Backwards from the Highest Impact Problems

To prioritize, think of the problems that it would be most important to address quickly and build alarms for those first.? It is tempting to instrument and alarm on what is convenient first, but engineer attention for responding to alarms is limited as well as implementation time, so focusing on the alarms that are most needed provides the path with the least overall effort to the optimal alarm set.

Is it Down?

The most important question your alarms will answer about your system is whether it is available or not, that is "down". There are many different ways in which a system might not be in a usable state, so alarming on multiple different metrics is necessary to ensure availability.

Lots of Errors

Lots of errors are generally a good indicator of unavailability, so this is a fairly obvious alarming condition. ? For a web application, you would want to monitor internal errors by both 5xx responses and occurrences of error-level logging.? You will also want to monitor the rates of error responses to external inputs like data validation failures or failed logins, which will be a web app’s 4xx responses.? If the user cannot interact successfully, the system will be down for them even if the specific problem is on the client side.

What to do about it?

This will generally be investigated by using the application logs, which should contain detailed enough logging to identify the cause of error.

No Activity

Your system can be down due to issues that are upstream of where you can monitor, in which case unavailability will be indicated by an unexpected lack of activity. ? For a web app, this would be reflected in your request count metrics going down substantially.? If your application does not have continuous traffic, a canary process which interacts with the system but doesn’t meaningfully change it can be created to provide events to monitor.

What to do about it?

If you have a canary, check its logs.? If you do not have a canary, manually testing the system to generate activity and observe the results to narrow down the root cause.

Very High Latency

High latency can also cause an effective downtime, if the system hits a timeout or latency is high enough that users give up.? For a web app, server side latency is the most readily available metric, but client latency from an address outside your network should be monitored too if it can easily be obtained.

What to do about it?

Check dependency latencies to see if the issue is outside the system. ? Check your system resource metrics (CPU, DB connections, memory, throttling) - if you see a bottleneck, increase the capacity to mitigate the issue.? For investigating root causes, a sampling profiler will allow the source of the latency to be identified in the code.

Next Steps

These alarms answer whether your system is available and help you identify the most common sources of issues.? To improve availability further, alarms can be used to detect problems before they become availability issues and assist with narrowing down root causes quickly. ? If your availability is meeting your goals, monitoring outcomes will be your next priority.

Detecting Problems in the Making

Many problems occur over time, getting gradually worse until the impact becomes noticeable.? Detecting these problems early allows you to prevent downtime.

Running Out of Capacity

For resources with quantifiable capacity, alarms should be used to notify support engineers when it is running out. ? For resources where hitting the limit will result in immediate problems, the alarm should be set at a lower threshold and shorter interval, while a longer duration and max utilization should be used for resources that occasionally spike like CPU usage.

What to do about it?

Increase the capacity of the resource to mitigate the issue - this can often be automated.? Investigate whether the increase in resource usage is due to a change in quantity of activity or higher per event resource demands and whether the change is expected or may indicate a deeper problem.

Getting Behind

In systems with asynchronous processing, the average processing rate may be slower than the average rate of submission, causing processing to become increasingly delayed.? The rate of submission is likely to vary more, so it will not be uncommon to see it exceed the processing rate safely, so it is difficult to alarm quickly on a problem without false positives from spikes in activity. ? Alternate metrics that may provide a more reliable alarm include the age of the oldest unprocessed item or the current number of unprocessed items - the former is also an important performance metric since it directly represents the impact of the delayed processing.

What to do about it?

Same as for Running Out of Capacity

Out of Balance

In systems with flows of some quantity in and out, the net accumulation in the system can be measured by the difference between the flows.? The value of this difference is often worth monitoring to detect unexpected accumulations or flows that are being missed by monitoring.? Alarms can be constructed by balancing the flows against the expected value.? In the case of a queue, compare the items reported submitted to the queue to the items processed form the queue - a large difference can indicate insufficient processing capacity or defects that cause duplicate processing or submissions.? In the case of money or inventory flows, use double-entry bookkeeping principles and whenever a quantity moves, record a transaction representing the withdrawal from the source and a transaction representing the deposit in destination.? The flows at the system boundary for a time period should balance with the change in the account balances between its start and end. ? A periodic balancing process that performs the calculation as-of a specific time and approximating over a sliding window using real-time data can be useful together to provide both speed and sensitivity.

What To Do About It

Look for the specific inconsistency.? For queues, look for whether the submission rate increased or the processing rate dropped.? For flows, check each withdrawal-deposit pair to see if it matches.

Anomaly Detection

Anomaly detection uses statistical methods to identify variations in metrics relative to previous data.? This can detect undesired conditions that it is not practical to express in a fixed alarm threshold and unexpected cases that may cause problems.? Since the alarm only detects unusualness, not a specific undesirable condition, it is important to ensure that the alarm has specific follow-up steps to determine if a concrete problem is occuring so it is actionable. ? Anomaly detection may also be useful when investigating issues by highlighting things that are behaving unusually.

What to do about it?

This will generally require some knowledge of the specific system to interpret what may be causing the anomalous event.? Document common patterns and their causes in the runbook.

Assisting with Investigating Issues

In addition to identifying issues, alarms can be useful for identifying unusual conditions that are likely to be related to more directly observable problems, speeding up issue resolution.

Dependencies

Monitoring dependencies allows the sources of issues to be narrowed down quickly. ? Latency and error metrics should be monitored from the perspective of your system’s interactions.? Comparison of these metrics with the values from the dependency’s internal metrics can quickly identify where in the integration a problem is occurring.

What to do about it?

Contact the owners of the dependency - provide them with your metrics so they can check them against the values they are observing from their system.

Log Scanning

Alarms based on scanning logs can help narrow down root causes on the basis of how often matching events are logged.? Alarming based on log level is useful for highlighting parts of the system with unexpected errors and warnings.? This is particularly useful for monitoring for issues where the error handling is to catch the error and resume processing.

What to do about it?

Query the logs for the matching events and examine the details; the query should be included in the runbook and preferably automated.

Monitoring Outcomes

In addition to monitoring the behavior of the system itself, monitoring outcomes is important to ensure that it is fulfilling its larger role.

Responsiveness

While very high latency that likely indicates a system availability issue, smaller amounts of latency still impact the user experience negatively. ? Ideally, measure the total time for the action to complete from the user perspective to capture the delay as experienced.? Set the latency threshold based on what would feel unresponsive to a user.? 100ms is a good starting point for user interactions but it is valuable to test the user experience with simulated delays. ? Use a lower percentile in the distribution and more data points than your availability metrics to reduce noise. ? Monitoring for changes after deployments is useful and is more sensitive than monitoring absolute values for detecting changes in the application itself.

What to do about it?

Check the Running Out of Capacity and Dependency metrics.? For Dependencies, check error rates as well as latency since delays in retry logic can convert intermittent failures to higher latency. ? If the metrics do not show an obvious cause, use a profiler to identify where specifically in the action the latency is occurring.

Problem Domain Metrics

Problem domain metrics describe the system in terms of the broader system in which it fulfills a role.? For example, orders placed, product page views, and abandoned checkouts would all be problem domain metrics to monitor for an eCommerce storefront.? Monitoring these metrics can identify the presence of issues that are not easily captured otherwise.? Problem domain metrics are also impacted by problems not directly related to the system’s behavior, so alarms should have relatively high thresholds.

What to do about it?

Inform colleagues in other roles that are also responsible for these outcomes.? They should also investigate the root cause and mitigation strategies within their area of responsibility.? Investigate whether the root cause is within the system by checking related metrics and logging.

Zero to 0.99999 - Notes on Metrics and Alarming for Software That Doesn’t Have it Yet

Matt F.

cryptocurrency / cloud computing / these are words that start with a C this time

Why?

General Advice

Know How You Will Respond to the Alarm Going Off

Track Alarms in an Issue Tracker

Avoid Alarm Fatigue

Monitor Using Percentiles of Sample

Work Backwards from the Highest Impact Problems

Is it Down?

Lots of Errors

No Activity

Very High Latency

Next Steps

Detecting Problems in the Making

Running Out of Capacity

领英推荐

Getting Behind

Out of Balance

Anomaly Detection

Assisting with Investigating Issues

Dependencies

Log Scanning

Monitoring Outcomes

Responsiveness

Problem Domain Metrics

社区洞察

其他会员也浏览了

Computer Software Assurance-A Practical Approach

Testing in Production: A Calculated Approach to Unlocking Value Without Risking Stability

Automated testing strategies and tools for network performance benchmarking

The Importance of Quality Assurance: Lessons from the Global Microsoft Windows Outage

Streamlining Managed Problems of COBIT 2019 with ESTIM Software: A Strategic IT Governance and Management Solution

Leveraging ESTIM Software for Effective COBIT 2019 Implementation of Managed Service Requests and Incidents

Introduction to Performance Testing

Decision Analysis: System Safety Integration?

Computer System Validation (CSV) at a glance #Part02

Demystifying Non-Functional System Design