登录查看更多内容

Implementing Cloudwatch alarm re-triggering

Alexey Beley

Infrastructure developer at ControlUp

发布日期: 2024年9月16日

+ 关注

Introduction
Solution (TL/DR)
Real life flow
Miscellaneous

Introduction

While improving my Serverless Alert System I encountered an obstacle: Cloudwatch alarms are not repetitive. Once the condition is met and actions performed - it stays “In Alarm” state forever. We want a solution to repeat the actions every 5 minutes while the alarm remains in “In Alarm” state.

Solution proposed on the official AWS site is to use step-functions:

https://aws.amazon.com/blogs/mt/how-to-enable-amazon-cloudwatch-alarms-to-send-repeated-notifications/

This solution is good for one alarm. Alert System Lambda handles dozens of them. For instance monitoring Postgres watches 109 RDS and 5 self monitoring metrics. It means many, many, many, many lambdas and state-machines to manage.

So the obvious solution is to use persistent storage to store the states of all the alarms handled by this Lambda. One of these can do the job: Redis, S3, EFS, RDS, DynamoDB.

I am using DynamoDB:

It has 0 maintenance and configuration compared to others.
Values can be changed quickly using AWS Console. (Read about the Cooldown period below)

Solution

So how does it work? (TL/DR)

AWS Lambda receives an event
AWS Lambda? fires the notification and stores the alarm’s next notification time in DynamoDB.
Even bridge triggers the AWS Lambda once in a minute. After 5 times the AWS Lambda was triggered it ->
Resets the Cloudwatch Alarm status to “OK” and sets the alarm’s next notification time to 0 in DynamoDB. And loops again to ->
AWS Lambda receives an event.
AWS Lambda? fires the notification and stores the alarm’s next notification time in DynamoDB.

DynamoDB record Format:

{

????“AlarmName”: str,

????“Cooldown”: int

????“NextAlarm”: int

}

领英推荐

Cloud-Native Essentials: Abstracted Endpoints

Jason Bloomberg 3 年前

Mastering the Design of the CNF CLI Authentication

Defne 1 年前

OSS Kubernetes and Container Storage Interface (CSI)…

PibyThree 1 年前

AlarmName - name of the alarm to monitor.
Cooldown - time in between the alarms. You can change it manually per alarm, in case of a known issue being fixed - you can put 3600 for the alarm to be triggered once in an hour.
NextAlarm - epoch UTC time to trigger the alarm. Calculated: Alarm State Change Time + Cooldown

Real life flow

[1] RDS Cluster reports 95% CPUUsage at 12:00 AM

[2] Cloudwatch Alarm state changed to “ALARM”

[3] Alert System Lamda gets the event and performs:?

a. Notifications fired towards Slack/Email and Jira/Opsgenie.

b. Alarm name saved in DynamoDB with default 300 seconds of cooldown

and NextAlarm = 12:05 AM.

[4] Event bridge triggers the Alert System Lambda 4 times. The Lambda sees it has not yet reached the NextAlarm time and does nothing.

[5] Event bridge triggers the Alert System Lambda fifth time and it performs:

a. Sees it has reached the NextAlarm time stored in the DynamoDB.

b. Resets the Cloudwatch Alarm state to “OK”.

c. Sets the NextAlarm to 0 in DynamoDB.

[6] RDS Cluster reports 95% CPUUsage at 12:05 AM and the cycle repeats.

Miscellaneous

[1] The solution is integrated with other alert sources not only Cloudwatch.

Opensearch and Gragana manage their own “Cooldown” mechanisms.

Jenkins sends build status notifications without need for “Cooldown”.

[2] SES/SESv2 metrics, bounces and rejects monitored but do not present in this article.

要查看或添加评论，请登录

Alexey Beley的更多文章

Converting Identity-based into resource-based policies (a know-how)

2024年7月5日

Converting Identity-based into resource-based policies (a know-how)

(TL/DR Goto: Solution) Well, I am not going to explain you what are the Identity-based policies and Resource-based…
AWS Security Domain Tree

2024年1月22日

AWS Security Domain Tree

Errare humanum est (Making mistakes is human nature). Who can measure our mistakes’ consequences? As the infrastructure…
Protecting production from CI/CD

2023年11月27日

Protecting production from CI/CD

About this article Vulnerability explained in a nutshell More detailed presentation Terminology and concepts Security…
Reliable AWS Serverless Monitoring

2023年9月29日

Reliable AWS Serverless Monitoring

Monitoring in the cloud has some challenges. One of them is trusting your Monitoring system works as expected.
Daily Tasks Reporting Automation

2023年7月3日

Daily Tasks Reporting Automation

Every day we spend 15-40 minutes sharing our work status during the Daily / Stand Up / STUM / YTB meetings. However, we…
Bash script logger

2022年7月4日

Bash script logger

If only I could use logger and traceback in bash..
boto3 S3 upload

2021年10月21日

boto3 S3 upload

Foreword This article’s auditory Solution architecture AWS CLI vs boto3 comparison - sequential (no threads) AWS CLI vs…
If only I could "try and catch" in bash...

2021年9月1日

If only I could "try and catch" in bash...

From time to time we need to run heavy scripts with unpredictable IO behavior. Network glitches, lock-files locked etc.

See all articles

Implementing Cloudwatch alarm re-triggering

Alexey Beley

Infrastructure developer at ControlUp

Introduction

Solution

领英推荐

Real life flow

Miscellaneous

Alexey Beley的更多文章

社区洞察

其他会员也浏览了

NuNet Technical Roadmap Update Q1

How We Reimagined Data Storage

Prometheus Triggers for the KEDA for the Kubernetes Autoscalling - KEDA 2/4

Open the Door to openEuler

IBM details storage portfolio for AI infrastructure

Navigating the Future: Insights from the 2nd IBM Symphony Compute and Data Intensive User Group

Distributed System Design Patterns

Personal Private Cloud with Palm Sized System On Chips (Part 1 - Build the Cloud)

AVP Blog Series - Summary

Common Load-balancing Algorithms

Introduction

Solution

领英推荐

Real life flow

Miscellaneous

Alexey Beley的更多文章

Converting Identity-based into resource-based policies (a know-how)

AWS Security Domain Tree

Protecting production from CI/CD

Reliable AWS Serverless Monitoring

Daily Tasks Reporting Automation

Bash script logger

boto3 S3 upload

If only I could "try and catch" in bash...

社区洞察

其他会员也浏览了

NuNet Technical Roadmap Update Q1

How We Reimagined Data Storage

Prometheus Triggers for the KEDA for the Kubernetes Autoscalling - KEDA 2/4

Open the Door to openEuler

IBM details storage portfolio for AI infrastructure

Navigating the Future: Insights from the 2nd IBM Symphony Compute and Data Intensive User Group

Distributed System Design Patterns

Personal Private Cloud with Palm Sized System On Chips (Part 1 - Build the Cloud)

AVP Blog Series - Summary

Common Load-balancing Algorithms