Monitoring load balancers using Amazon CloudWatch anomaly detection alarms
Originally published on the?AWS networking and content delivery blog?- reproduced here for visibility.
Reading time: 6 minutes
Load balancers are a critical component in the architecture of distributed software services.?AWS Elastic Load Balancing?(ELB) provides highly performant automatic distribution for any scale of incoming traffic across many compute targets?(Amazon Elastic Compute Cloud (Amazon EC2),?Amazon Elastic Container Service (Amazon ECS),?AWS Lambda, etc.), while enabling developers to adopt security best practices at the network boundary (among many other features).
As a result of being high-up in the service stack, the metrics emitted by load balancers provide crucial and unique insight into service health, service performance, and end-to-end network performance.?Monitoring of these metrics provides visibility into many kinds of incidents across the service stack and the network. This visibility can result in quick detection and mitigation of an incident rather than a prolonged outage.
This post begins with a brief overview of?AWS Network Load Balancer?(NLB) monitoring. This is followed by a look at the NLB metric?TCP_Target_Reset_Count?and why conventional?Amazon CloudWatch?alarms using static thresholds can’t be used for monitoring this class of metrics. Then a brief look at CloudWatch anomaly detection alarms is presented, followed by a deep-dive into how this can be used for monitoring?TCP_Target_Reset_Count. In conclusion, we highlight some of the situations where this monitoring can be useful.
NLB TCP reset count metrics
TCP Reset Count?by Target or Clients are a set of metrics emitted by NLBs. The TCP reset flag is summarized next, followed by the definition of these metrics and what they might indicate.
TCP reset flag
Every packet in a TCP connection contains a TCP header. Each header contains a bit knows as the “reset“ (RST) flag. Setting this bit to 0 has no effect, however setting it to 1 indicates to the receiver that the given TCP connection shouldn’t be used anymore. A reset closes the TCP connection instantly.
TCP_Target_Reset_Count,?TCP_Client_Reset_Count and?TCP_ELB_Reset_Count
TCP_Target_Reset_Count?is an ELB metric published in CloudWatch. This monitors the?total number of reset (RST) packets sent from a target (Amazon EC2 host) to a client. A reset packet is one with no payload and with the?RST?bit set in the TCP header flags. These resets are?generated by the target?and forwarded by the load balancer.?Sum?is the most useful statistic for this metric.?Similarly, the NLB also emits metrics corresponding to resets generated by the load balancer itself (TCP_ELB_Reset_Count) and resets generated by the client (TCP_Client_Reset_Count).
For a generic system comprising of an NLB and underlying compute (such as Amazon EC2 hosts), TCP connections are short lived (represented by the time-to-live (TTL) configurations). Therefore, these reset metrics are expected to have a baseline value which is greater than 1 (in a given time-period) as TCP connections are opened and closed continuously.
Spikes in these reset metrics can occur when the target, client or load balancer is closing more connections than usual. Some situations when this can occur:
Monitoring TCP reset count metrics
As explained previously, NLB reset count metrics can highlight critical issues in the client → NLB → target communication. This can lead to increased errors and a detrimental customer experience. Accurate alarming of these NLB reset count metrics can notify the service owner and enable them to activate mitigation strategies.
Static threshold alarming
Conventional?CloudWatch alarms?monitor a metric with a static threshold. For example, the alarm is triggered when a metric has a value greater than a threshold X, for Y data points in a given time duration. The threshold X is configured in advance and is a constant.
This kind of alarming strategy will fail for those metrics where a?static?threshold can’t represent normal operating conditions of the system. This is the case if the?safe-values?of the metric (indicating normal operating conditions) change frequently. For example, the threshold is dependent on the daily traffic pattern or the size of the auto-scaled service fleet.
The reset count metrics (TCP_Target_Reset_Count) fall under this category. By definition the threshold of the metric is dependent on the number of hosts in the underlying fleet (among other factors).
For example, in the previous figure, a CloudWatch snapshot of?TCP_Target_Reset_Count?over six consecutive days is shown.?Region A?and?Region B?indicate anomalies in the system (irregular spikes or dips in RST count) while?Region C?is healthy.
领英推荐
An alarm threshold value of 1365 is sufficient to detect the spike in?Region A, but this value fails to capture the dip shown in?Region B. One possible solution could be to create another separate alarm which triggers if the threshold falls below a new lower threshold value of 1200. However, both of these alarms will be static and will fail to adapt to changes in the contributing factors (for example, the host count).
The previous example is a small snapshot (six days), and over a longer period (months) this metric can have even more variation. TCP reset metrics thus can’t be monitored by a static threshold.
CloudWatch anomaly detection alarms
CloudWatch anomaly detection alarms?solve the above problem by building a statistical model of the underlying metric. This enables the creation of dynamic alarm thresholds with both an upper and a lower limit. These statistical models are continuously re-trained,?which account for changing trends in the metrics (the different regions in the previous figure).
In the previous figure, a CloudWatch anomaly detection alarm is used for monitoring?TCP_Target_Reset_Count. The anomaly detection dynamic threshold is denoted by the grey band which is continuously adjusting to changes in the metric trends. Some interesting things denoted in the figure alarm gets triggered (in red) in a more interesting set of situations:
Creating an anomaly detection alarm for?TCP_Target_Reset_Count using AWS Cloud Development Kit
AWS Cloud Development Kit?(AWS CDK) is an open source software development framework to define your cloud application resources using familiar programming languages.
Here are some things to note in the following implementation:
Anomaly detection alarm class
typescript code
import {CfnAlarm, CfnAnomalyDetector, Metric, TreatMissingData} from "@aws-cdk/aws-cloudwatch"
import {Construct, Duration} from "@aws-cdk/core";
export interface AnomalyDetectionAlarmProps {
readonly alarmName: string;
readonly alarmDescription: string;
readonly metric: Metric;
readonly comparisonOperator: string;
readonly evaluationPeriods: number;
readonly period: Duration;
readonly standardDeviation: number;
readonly alarmActions?: string[];
readonly modelConfiguration?: CfnAnomalyDetector.ConfigurationProperty;
}
export class AnomalyDetectionAlarm extends Construct {
constructor(scope: Construct, id: string, props: AnomalyDetectionAlarmProps) {
super(scope, id);
const metricName = props.metric.metricName || "";
const anomalyDetectorMetricId = `anomalyDetectorMetricId`;
const anomalyDetectorId = `anomalyDetectorId`;
const metricStats = props.metric.toMetricConfig().metricStat;
const namespace = metricStats?.namespace || "";
const stats = metricStats?.statistic || "";
const dimensions = metricStats?.dimensions || undefined;
const alarmActions = props?.alarmActions || [];
new CfnAnomalyDetector(this, anomalyDetectorId, {
configuration: props.modelConfiguration,
namespace,
metricName,
stat: stats,
dimensions,
});
return new CfnAlarm(this, props.alarmName, {
alarmName: props.alarmName,
alarmDescription: props.alarmDescription,
comparisonOperator: props.comparisonOperator,
evaluationPeriods: props.evaluationPeriods,
thresholdMetricId: anomalyDetectorMetricId,
treatMissingData: TreatMissingData.MISSING,
metrics: [
{
expression: `ANOMALY_DETECTION_BAND(m1, ${props.standardDeviation})`,
id: anomalyDetectorMetricId,
},
{
id: "m1",
metricStat: {
metric: {
namespace,
metricName,
dimensions,
},
period: props.period.toSeconds(),
stat: stats,
},
},
],
alarmActions,
});
}
};
Instantiate the alarm
typescript code
private createNLBAnomalyDetectionAlarm(alarmName: string)
const nlbName = loadBalancerNameFromListenerArn(Fn.importValue("ServiceLoadBalancer"));
const metricName = "TCP_Target_Reset_Count";
const metric = new Metric({
statistic: "Sum",
label: nlbName,
metricName,
namespace: "AWS/NetworkELB",
period: Duration.minutes(5),
dimensions: {
LoadBalancer: nlbName,
},
});
new AnomalyDetectionAlarm(this, `${metricName}_Alarm`, {
alarmName,
alarmDescription: "TCP_Target_Reset_Count below the anomaly detector threshold",
metric,
comparisonOperator: "LessThanLowerThreshold",
evaluationPeriods: 3,
period: Duration.minutes(5),
standardDeviation: 8,
});
}{
Conclusion
We presented an overview of?NLB?reset count metrics and their utility. This was followed by describing why conventional?CloudWatch?alarms can’t be used for monitoring these metrics. Finally, we conducted a deep-dive for using CloudWatch anomaly detection alarms and?AWS CDK?to monitor these metrics.
These alarms can be used in conjunction with conventional NLB alarms, such as unhealthy host count. This setup is being used by a software development team in?Prime Video?to?improve time-to-detection (and time-to-mitigation) for certain incidents (mentioned above) by more than one hour.
Originally published on - https://aws.amazon.com/blogs/networking-and-content-delivery/monitoring-load-balancers-using-amazon-cloudwatch-anomaly-detection-alarms/