Implementing Cloudwatch alarm re-triggering

Implementing Cloudwatch alarm re-triggering

  • Introduction
  • Solution (TL/DR)
  • Real life flow
  • Miscellaneous


Introduction

While improving my Serverless Alert System I encountered an obstacle: Cloudwatch alarms are not repetitive. Once the condition is met and actions performed - it stays “In Alarm” state forever. We want a solution to repeat the actions every 5 minutes while the alarm remains in “In Alarm” state.


Solution proposed on the official AWS site is to use step-functions:

https://aws.amazon.com/blogs/mt/how-to-enable-amazon-cloudwatch-alarms-to-send-repeated-notifications/

This solution is good for one alarm. Alert System Lambda handles dozens of them. For instance monitoring Postgres watches 109 RDS and 5 self monitoring metrics. It means many, many, many, many lambdas and state-machines to manage.

So the obvious solution is to use persistent storage to store the states of all the alarms handled by this Lambda. One of these can do the job: Redis, S3, EFS, RDS, DynamoDB.

I am using DynamoDB:

  1. It has 0 maintenance and configuration compared to others.
  2. Values can be changed quickly using AWS Console. (Read about the Cooldown period below)


Solution

So how does it work? (TL/DR)

  1. AWS Lambda receives an event
  2. AWS Lambda? fires the notification and stores the alarm’s next notification time in DynamoDB.
  3. Even bridge triggers the AWS Lambda once in a minute. After 5 times the AWS Lambda was triggered it ->
  4. Resets the Cloudwatch Alarm status to “OK” and sets the alarm’s next notification time to 0 in DynamoDB. And loops again to ->
  5. AWS Lambda receives an event.
  6. AWS Lambda? fires the notification and stores the alarm’s next notification time in DynamoDB.


DynamoDB record Format:

{

????“AlarmName”: str,

????“Cooldown”: int

????“NextAlarm”: int

}

  1. AlarmName - name of the alarm to monitor.
  2. Cooldown - time in between the alarms. You can change it manually per alarm, in case of a known issue being fixed - you can put 3600 for the alarm to be triggered once in an hour.
  3. NextAlarm - epoch UTC time to trigger the alarm. Calculated: Alarm State Change Time + Cooldown


Real life flow

[1] RDS Cluster reports 95% CPUUsage at 12:00 AM

[2] Cloudwatch Alarm state changed to “ALARM”

[3] Alert System Lamda gets the event and performs:?

a. Notifications fired towards Slack/Email and Jira/Opsgenie.

b. Alarm name saved in DynamoDB with default 300 seconds of cooldown

and NextAlarm = 12:05 AM.

[4] Event bridge triggers the Alert System Lambda 4 times. The Lambda sees it has not yet reached the NextAlarm time and does nothing.

[5] Event bridge triggers the Alert System Lambda fifth time and it performs:

a. Sees it has reached the NextAlarm time stored in the DynamoDB.

b. Resets the Cloudwatch Alarm state to “OK”.

c. Sets the NextAlarm to 0 in DynamoDB.

[6] RDS Cluster reports 95% CPUUsage at 12:05 AM and the cycle repeats.


Miscellaneous

[1] The solution is integrated with other alert sources not only Cloudwatch.

Opensearch and Gragana manage their own “Cooldown” mechanisms.

Jenkins sends build status notifications without need for “Cooldown”.

[2] SES/SESv2 metrics, bounces and rejects monitored but do not present in this article.



要查看或添加评论,请登录

Alexey Beley的更多文章

  • Converting Identity-based into resource-based policies (a know-how)

    Converting Identity-based into resource-based policies (a know-how)

    (TL/DR Goto: Solution) Well, I am not going to explain you what are the Identity-based policies and Resource-based…

  • AWS Security Domain Tree

    AWS Security Domain Tree

    Errare humanum est (Making mistakes is human nature). Who can measure our mistakes’ consequences? As the infrastructure…

  • Protecting production from CI/CD

    Protecting production from CI/CD

    About this article Vulnerability explained in a nutshell More detailed presentation Terminology and concepts Security…

  • Reliable AWS Serverless Monitoring

    Reliable AWS Serverless Monitoring

    Monitoring in the cloud has some challenges. One of them is trusting your Monitoring system works as expected.

  • Daily Tasks Reporting Automation

    Daily Tasks Reporting Automation

    Every day we spend 15-40 minutes sharing our work status during the Daily / Stand Up / STUM / YTB meetings. However, we…

  • Bash script logger

    Bash script logger

    If only I could use logger and traceback in bash..

  • boto3 S3 upload

    boto3 S3 upload

    Foreword This article’s auditory Solution architecture AWS CLI vs boto3 comparison - sequential (no threads) AWS CLI vs…

  • If only I could "try and catch" in bash...

    If only I could "try and catch" in bash...

    From time to time we need to run heavy scripts with unpredictable IO behavior. Network glitches, lock-files locked etc.

社区洞察

其他会员也浏览了