Implementing Cloudwatch alarm re-triggering
Introduction
While improving my Serverless Alert System I encountered an obstacle: Cloudwatch alarms are not repetitive. Once the condition is met and actions performed - it stays “In Alarm” state forever. We want a solution to repeat the actions every 5 minutes while the alarm remains in “In Alarm” state.
Solution proposed on the official AWS site is to use step-functions:
This solution is good for one alarm. Alert System Lambda handles dozens of them. For instance monitoring Postgres watches 109 RDS and 5 self monitoring metrics. It means many, many, many, many lambdas and state-machines to manage.
So the obvious solution is to use persistent storage to store the states of all the alarms handled by this Lambda. One of these can do the job: Redis, S3, EFS, RDS, DynamoDB.
I am using DynamoDB:
Solution
So how does it work? (TL/DR)
DynamoDB record Format:
{
????“AlarmName”: str,
????“Cooldown”: int
????“NextAlarm”: int
}
领英推荐
Real life flow
[1] RDS Cluster reports 95% CPUUsage at 12:00 AM
[2] Cloudwatch Alarm state changed to “ALARM”
[3] Alert System Lamda gets the event and performs:?
a. Notifications fired towards Slack/Email and Jira/Opsgenie.
b. Alarm name saved in DynamoDB with default 300 seconds of cooldown
and NextAlarm = 12:05 AM.
[4] Event bridge triggers the Alert System Lambda 4 times. The Lambda sees it has not yet reached the NextAlarm time and does nothing.
[5] Event bridge triggers the Alert System Lambda fifth time and it performs:
a. Sees it has reached the NextAlarm time stored in the DynamoDB.
b. Resets the Cloudwatch Alarm state to “OK”.
c. Sets the NextAlarm to 0 in DynamoDB.
[6] RDS Cluster reports 95% CPUUsage at 12:05 AM and the cycle repeats.
Miscellaneous
[1] The solution is integrated with other alert sources not only Cloudwatch.
Opensearch and Gragana manage their own “Cooldown” mechanisms.
Jenkins sends build status notifications without need for “Cooldown”.
[2] SES/SESv2 metrics, bounces and rejects monitored but do not present in this article.