登录查看更多内容

Monitoring load balancers using Amazon CloudWatch anomaly detection alarms

Varun Jewalikar

发布日期: 2023年3月16日

Originally published on the?AWS networking and content delivery blog?- reproduced here for visibility.

Reading time: 6 minutes

Load balancers are a critical component in the architecture of distributed software services.?AWS Elastic Load Balancing?(ELB) provides highly performant automatic distribution for any scale of incoming traffic across many compute targets?(Amazon Elastic Compute Cloud (Amazon EC2),?Amazon Elastic Container Service (Amazon ECS),?AWS Lambda, etc.), while enabling developers to adopt security best practices at the network boundary (among many other features).

As a result of being high-up in the service stack, the metrics emitted by load balancers provide crucial and unique insight into service health, service performance, and end-to-end network performance.?Monitoring of these metrics provides visibility into many kinds of incidents across the service stack and the network. This visibility can result in quick detection and mitigation of an incident rather than a prolonged outage.

This post begins with a brief overview of?AWS Network Load Balancer?(NLB) monitoring. This is followed by a look at the NLB metric?TCP_Target_Reset_Count?and why conventional?Amazon CloudWatch?alarms using static thresholds can’t be used for monitoring this class of metrics. Then a brief look at CloudWatch anomaly detection alarms is presented, followed by a deep-dive into how this can be used for monitoring?TCP_Target_Reset_Count. In conclusion, we highlight some of the situations where this monitoring can be useful.

NLB TCP reset count metrics

TCP Reset Count?by Target or Clients are a set of metrics emitted by NLBs. The TCP reset flag is summarized next, followed by the definition of these metrics and what they might indicate.

TCP reset flag

Every packet in a TCP connection contains a TCP header. Each header contains a bit knows as the “reset“ (RST) flag. Setting this bit to 0 has no effect, however setting it to 1 indicates to the receiver that the given TCP connection shouldn’t be used anymore. A reset closes the TCP connection instantly.

TCP_Target_Reset_Count,?TCP_Client_Reset_Count and?TCP_ELB_Reset_Count

TCP_Target_Reset_Count?is an ELB metric published in CloudWatch. This monitors the?total number of reset (RST) packets sent from a target (Amazon EC2 host) to a client. A reset packet is one with no payload and with the?RST?bit set in the TCP header flags. These resets are?generated by the target?and forwarded by the load balancer.?Sum?is the most useful statistic for this metric.?Similarly, the NLB also emits metrics corresponding to resets generated by the load balancer itself (TCP_ELB_Reset_Count) and resets generated by the client (TCP_Client_Reset_Count).

For a generic system comprising of an NLB and underlying compute (such as Amazon EC2 hosts), TCP connections are short lived (represented by the time-to-live (TTL) configurations). Therefore, these reset metrics are expected to have a baseline value which is greater than 1 (in a given time-period) as TCP connections are opened and closed continuously.

Spikes in these reset metrics can occur when the target, client or load balancer is closing more connections than usual. Some situations when this can occur:

Breakdown or delay in the ‘client → NLB → target’ communication
For example, a networking issue which prevents the targets, NLB, or the clients from successfully communicating with one another. This would lead to a dip in reset metrics, lasting for the duration of the underlying issue. This scenario indicates either a full or partial outage of the service (such as a spike in 4xx errors on the client side), which should be alarmed on and mitigated appropriately.
A code-deployment on the underlying targets
This leads to a spike in reset metrics as the targets will send a reset packet before shutting down the application and starting the deployment. This is expected behavior and not an issue.

Monitoring TCP reset count metrics

As explained previously, NLB reset count metrics can highlight critical issues in the client → NLB → target communication. This can lead to increased errors and a detrimental customer experience. Accurate alarming of these NLB reset count metrics can notify the service owner and enable them to activate mitigation strategies.

Static threshold alarming

Conventional?CloudWatch alarms?monitor a metric with a static threshold. For example, the alarm is triggered when a metric has a value greater than a threshold X, for Y data points in a given time duration. The threshold X is configured in advance and is a constant.

No alt text provided for this image — Figure 1. Static threshold alarm for request duration

This kind of alarming strategy will fail for those metrics where a?static?threshold can’t represent normal operating conditions of the system. This is the case if the?safe-values?of the metric (indicating normal operating conditions) change frequently. For example, the threshold is dependent on the daily traffic pattern or the size of the auto-scaled service fleet.

The reset count metrics (TCP_Target_Reset_Count) fall under this category. By definition the threshold of the metric is dependent on the number of hosts in the underlying fleet (among other factors).

For example, in the previous figure, a CloudWatch snapshot of?TCP_Target_Reset_Count?over six consecutive days is shown.?Region A?and?Region B?indicate anomalies in the system (irregular spikes or dips in RST count) while?Region C?is healthy.

领英推荐

AWS - Networking, Compute, Storage

ARYAN KYATHAM 6 个月前

Navigating DNS Management: Unveiling Amazon Route 53…

Jon Bonso 10 个月前

What is Azure Load Balancer?

NetCom Learning 3 年前

An alarm threshold value of 1365 is sufficient to detect the spike in?Region A, but this value fails to capture the dip shown in?Region B. One possible solution could be to create another separate alarm which triggers if the threshold falls below a new lower threshold value of 1200. However, both of these alarms will be static and will fail to adapt to changes in the contributing factors (for example, the host count).

The previous example is a small snapshot (six days), and over a longer period (months) this metric can have even more variation. TCP reset metrics thus can’t be monitored by a static threshold.

CloudWatch anomaly detection alarms

CloudWatch anomaly detection alarms?solve the above problem by building a statistical model of the underlying metric. This enables the creation of dynamic alarm thresholds with both an upper and a lower limit. These statistical models are continuously re-trained,?which account for changing trends in the metrics (the different regions in the previous figure).

In the previous figure, a CloudWatch anomaly detection alarm is used for monitoring?TCP_Target_Reset_Count. The anomaly detection dynamic threshold is denoted by the grey band which is continuously adjusting to changes in the metric trends. Some interesting things denoted in the figure alarm gets triggered (in red) in a more interesting set of situations:

The alarm is triggered (in red) for the extreme peaks and valleys of the metric, indicating either an increased rate of failure (on the target) or some kind of a networking issue respectively.
The alarm is already triggered while the metric has started a rapid descent or ascent. This leads to an earlier detection of the event, allowing service owners to trigger mitigations earlier and shorten the time-to-mitigation.
The width of the threshold band can be controlled by a single parameter – large values lead to a thicker band while small values lead to a thinner band. A larger threshold band is less sensitive compared to a smaller band.

Creating an anomaly detection alarm for?TCP_Target_Reset_Count using AWS Cloud Development Kit

AWS Cloud Development Kit?(AWS CDK) is an open source software development framework to define your cloud application resources using familiar programming languages.

Here are some things to note in the following implementation:

This assumes the NLB ARN is being exported from the corresponding stack, which isn’t necessary if you’re creating the NLB in the same AWS CDK package being used to create the alarm.
The?standard deviation?for the anomaly detection model is set to?8. This should be tuned depending on the desired sensitivity of the alarm. Increasing it makes the anomaly detection band larger, and thus the alarm becomes less sensitive to small changes in the?TCP_Target_Reset_Count?metric.

Anomaly detection alarm class

typescript code

import {CfnAlarm, CfnAnomalyDetector, Metric, TreatMissingData} from "@aws-cdk/aws-cloudwatch"
import {Construct, Duration} from "@aws-cdk/core";

export interface AnomalyDetectionAlarmProps {
    readonly alarmName: string;
    readonly alarmDescription: string;
    readonly metric: Metric;
    readonly comparisonOperator: string;
    readonly evaluationPeriods: number;
    readonly period: Duration;
    readonly standardDeviation: number;
    readonly alarmActions?: string[];
    readonly modelConfiguration?: CfnAnomalyDetector.ConfigurationProperty;
}

export class AnomalyDetectionAlarm extends Construct {
    constructor(scope: Construct, id: string, props: AnomalyDetectionAlarmProps) {
        super(scope, id);

        const metricName = props.metric.metricName || "";
        const anomalyDetectorMetricId = `anomalyDetectorMetricId`;
        const anomalyDetectorId = `anomalyDetectorId`;
        const metricStats = props.metric.toMetricConfig().metricStat;
        const namespace = metricStats?.namespace || "";
        const stats = metricStats?.statistic || "";
        const dimensions = metricStats?.dimensions || undefined;
        const alarmActions = props?.alarmActions || [];

        new CfnAnomalyDetector(this, anomalyDetectorId, {
            configuration: props.modelConfiguration,
            namespace,
            metricName,
            stat: stats,
            dimensions,
        });

        return new CfnAlarm(this, props.alarmName, {
            alarmName: props.alarmName,
            alarmDescription: props.alarmDescription,
            comparisonOperator: props.comparisonOperator,
            evaluationPeriods: props.evaluationPeriods,
            thresholdMetricId: anomalyDetectorMetricId,
            treatMissingData: TreatMissingData.MISSING,
            metrics: [
                {
                    expression: `ANOMALY_DETECTION_BAND(m1, ${props.standardDeviation})`,
                    id: anomalyDetectorMetricId,
                },
                {
                    id: "m1",
                    metricStat: {
                        metric: {
                            namespace,
                            metricName,
                            dimensions,
                        },
                        period: props.period.toSeconds(),
                        stat: stats,
                    },
                },
            ],
            alarmActions,
        });
    }
};

Instantiate the alarm

typescript code

private createNLBAnomalyDetectionAlarm(alarmName: string) 
    const nlbName = loadBalancerNameFromListenerArn(Fn.importValue("ServiceLoadBalancer"));
    const metricName = "TCP_Target_Reset_Count";
    const metric = new Metric({
        statistic: "Sum",
        label: nlbName,
        metricName,
        namespace: "AWS/NetworkELB",
        period: Duration.minutes(5),
        dimensions: {
            LoadBalancer: nlbName,
        },
    });
    new AnomalyDetectionAlarm(this, `${metricName}_Alarm`, {
        alarmName,
        alarmDescription: "TCP_Target_Reset_Count below the anomaly detector threshold",
        metric,
        comparisonOperator: "LessThanLowerThreshold",
        evaluationPeriods: 3,
        period: Duration.minutes(5),
        standardDeviation: 8,
    });
}{

Conclusion

We presented an overview of?NLB?reset count metrics and their utility. This was followed by describing why conventional?CloudWatch?alarms can’t be used for monitoring these metrics. Finally, we conducted a deep-dive for using CloudWatch anomaly detection alarms and?AWS CDK?to monitor these metrics.

These alarms can be used in conjunction with conventional NLB alarms, such as unhealthy host count. This setup is being used by a software development team in?Prime Video?to?improve time-to-detection (and time-to-mitigation) for certain incidents (mentioned above) by more than one hour.

References

Varun Jewalikar

2 年

Originally published on - https://aws.amazon.com/blogs/networking-and-content-delivery/monitoring-load-balancers-using-amazon-cloudwatch-anomaly-detection-alarms/

要查看或添加评论，请登录

Varun Jewalikar的更多文章

Adopting Kotlin at Prime Video for higher developer satisfaction and less code

2021年3月10日

Adopting Kotlin at Prime Video for higher developer satisfaction and less code

Originally published on the AWS open source blog - reproduced here for visibility. Authors: Marcos Arranz and Varun…
Building resilient services at Prime Video with chaos engineering

2020年8月19日

Building resilient services at Prime Video with chaos engineering

Originally published on the AWS open source blog - reproduced here for visibility. Authors: Adrian Hornsby and Varun…

3 条评论
Predicting the Grammys with data

2016年2月15日

Predicting the Grammys with data

Since 1959, the National Academy of Recording Arts and Sciences has awarded a Grammy for Song of the Year, choosing…

40 条评论
The Largest Vocabulary in Music

2016年1月19日

The Largest Vocabulary in Music

Originally published at lab.musixmatch.
Most used swear words in lyrics and their usage by popular genres

2015年12月16日

Most used swear words in lyrics and their usage by popular genres

Originally published at lab.musixmatch.
Hip hop has the largest average vocabulary size followed by Heavy Metal

2015年12月9日

Hip hop has the largest average vocabulary size followed by Heavy Metal

Originally published at lab.musixmatch.

See all articles

Monitoring load balancers using Amazon CloudWatch anomaly detection alarms

Varun Jewalikar

NLB TCP reset count metrics

TCP reset flag

TCP_Target_Reset_Count,?TCP_Client_Reset_Count and?TCP_ELB_Reset_Count

Monitoring TCP reset count metrics

Static threshold alarming

领英推荐

CloudWatch anomaly detection alarms

Creating an anomaly detection alarm for?TCP_Target_Reset_Count using AWS Cloud Development Kit

Anomaly detection alarm class

Instantiate the alarm

Conclusion

References

Varun Jewalikar的更多文章

社区洞察

其他会员也浏览了

AWS Networking: VPC, Internet Gateway, NAT Gateway, Route Table, Network ACL, Security Group, and Endpoints.

GCP Routing Basics Part 1

A Lesson in Cloud Grammar: When Singular Names Mean Plural Resources

Scaling Applications with Azure Load Balancer

Establishing a Direct Connection to AWS with AWS Direct Connect

Amazon Route 53 Routing Policies

How AWS Uses BGP for Dynamic Routing Between Your Data Center and AWS

Oracle Cloud Load Balancer

What is AWS Network Load Balancer (NLB)?

AWS PrivateLink and Interface Endpoints: Best Practices and Advanced Techniques

NLB TCP reset count metrics

TCP reset flag

TCP_Target_Reset_Count,?TCP_Client_Reset_Count and?TCP_ELB_Reset_Count

Monitoring TCP reset count metrics

Static threshold alarming

领英推荐

CloudWatch anomaly detection alarms

Creating an anomaly detection alarm for?TCP_Target_Reset_Count using AWS Cloud Development Kit

Anomaly detection alarm class

Instantiate the alarm

Conclusion

References

Varun Jewalikar的更多文章

Adopting Kotlin at Prime Video for higher developer satisfaction and less code

Building resilient services at Prime Video with chaos engineering

Predicting the Grammys with data

The Largest Vocabulary in Music

Most used swear words in lyrics and their usage by popular genres

Hip hop has the largest average vocabulary size followed by Heavy Metal

社区洞察

其他会员也浏览了

AWS Networking: VPC, Internet Gateway, NAT Gateway, Route Table, Network ACL, Security Group, and Endpoints.

GCP Routing Basics Part 1

A Lesson in Cloud Grammar: When Singular Names Mean Plural Resources

Scaling Applications with Azure Load Balancer

Establishing a Direct Connection to AWS with AWS Direct Connect

Amazon Route 53 Routing Policies

How AWS Uses BGP for Dynamic Routing Between Your Data Center and AWS

Oracle Cloud Load Balancer

What is AWS Network Load Balancer (NLB)?

AWS PrivateLink and Interface Endpoints: Best Practices and Advanced Techniques