Monitoring Oracle RDS with CloudWatch Alarms
One of the things I really enjoy in AWS is how easy it is to setup decent monitoring on any service. For monitoring Oracle databases running on RDS, CloudWatch can help with:
This is quite a potent set of tools to start with. Adding RDS Performance Insights in the mix for real-time SQL monitoring makes the tooling complete. But if this is not enough, deploying OEM CloudControl agent via the option group is just a few clicks.
CloudWatch alarms deployment architecture
Every AWS service emits a variety of metrics that can be monitored and alerted on out-of-the-box. Those are things like CPU usage, number of connections, disk IO and throughput, swap usage and the like.
There are a few details I want to mention:
Here is a sample architecture:
Setting up centralized SNS topics
There are two caveats with the proposed approach. First, the centralized SNS topics should allow CloudWatch alarms from other accounts to post notifications. This can be achieved by adding the following IAM permissions on the SNS topic in the shared account:
...,
{
"Effect": "Allow",
"Principal": {
"AWS": "*"
},
"Action": [
"SNS:Publish",
"SNS:GetTopicAttributes"
],
"Resource": "arn:aws:sns:eu-central-1:999999999999:SNS_Critical_db",
"Condition": {
"ArnLike": {
"aws:SourceArn": [
"arn:aws:cloudwatch:eu-central-1:111111111111:alarm:*",
"arn:aws:cloudwatch:eu-central-1:222222222222:alarm:*",
"arn:aws:cloudwatch:eu-central-1:333333333333:alarm:*",
"arn:aws:cloudwatch:eu-central-1:444444444444:alarm:*",
"arn:aws:cloudwatch:eu-central-1:555555555555:alarm:*",
"arn:aws:cloudwatch:eu-central-1:666666666666:alarm:*",
"arn:aws:cloudwatch:eu-central-1:777777777777:alarm:*"
]
}
}
},
...
In this example, 999999999999 is the monitoring (shared) account, and 111111111111 to 777777777777 are the business accounts.
Secondly, you can put the SNS topic as a target to the alarm, but you will always have a warning like this in the console:
This is fine. It simply means, that the alarm (in the business account) does not know if it has the right privileges on the SNS topic (in the monitoring account). Once the IAM policy is properly configured, it will work fine.
CloudWatch Alarm features
There are a lot of parameters you can tweak in an alarm. I find the following ones most useful.
Statistic (max, min, average, p90, etc.) - for most of the alarms I use average. Nevertheless, I love how many options we get out-of-the-box.
Period (10 sec, 1 min, 5 min, etc.) - how often the alarm should be evaluated. This impacts the cost of for the alarms. Typically, 1 minute is fine, especially if the alarm has to be acted upon by humans.
Alarm condition (greater, greater or equal, lower, etc.) - what check you want for the value. E.g. I want to be notified if my CPU goes above 60%. Most of the times greater is the trigger I use.
Value - what is the threshold that triggers the alarm. The alarm gets triggered (Alarm state) once the value passes the threshold given number of times (see next paragraph), and is set to OK state once it passes the threshold in the other direction given number of times. Please note that you can have different actions for going to Alarm and OK state.
Datapoints to alarm - this is a neat addition that helps to ignore short-lived spikes. For instance, I want to be notified if my CPU goes above 70% average in 3 out of 5 consecutive checks (this is 3 out of 5 minutes).
Missing data treatment - now sometimes the RDS system may fail to report fresh data for the given minute. Here you can say what should happen with the alarm in such cases - it can set the alarm to missing data state; set the alarm directly to Alarmed or to OK; or it can ignore this failed check and retain the alarm state. Typically I use "treat missing data as ignore".
领英推荐
How to find the right thresholds
The list of metrics is relatively easy to pick. The actual thresholds are a different topic. Even for a seasoned DBA, it may require some trial and error to find the best values. You want to be notified for any unexpected database behavior, before the end customer even notices any impact. On the other hand, the monitoring team should not get too many "false positive" notifications to avoid alarm fatigue.
Generally I set some conservative numbers. Then, if I notice alarms that do not get acted upon (and still everything runs fine), bump the threshold up.
Remember, make sure you automate the deployment and keep it as consistent as possible across different DBs. There will be exceptions - like your dev/test instances may lack critical threshold, so your Teraform should be customizable.
What parameters should I monitor for RDS Oracle
RDS provides dozen of metrics. Not all are suitable for alarms. I recommend the following list for a starter.
CPUUtilization?
Here you should know your typical CPU utilization and set the warning somewhere above it. The idea is not to be triggered by "normal" spikes (avoid too many false positives), but to get notified when things go consistently south. For example, if your typical workload uses between 10 and 45% CPU, put warning if it reaches above 55% for 5 out of 7 minutes.
The critical threshold can be in the 70-85% range. Even there, you may have very little time to react. On some very critical DBs we keep it as low as 65%.
ReadIOPS?
I always advocate for using PIOPS storage for critical databases. Having in mind Oracle does way more reads that writes, alarming on ReadIOPS makes sense.
Again, the warning should be somewhere above the typical usage for the specific database. Critical can be around the provisioned number of IOPS (or the IOPS you get for the DB size on GP2).
DiskQueueDepth
Disk queuing is a sign of an overloaded IO subsystem. And an overloaded IO subsystem is a sign of an overloaded database.
I have seen databases survive short bursts of 100+ IOs in the queue without visible impact on end customers. Other workloads are more "response time"-critical, and even queue length of 30 starts to get noticed. So this metric is very workload-dependent.
ReadThroughput
AWS will allow you to go above the provisioned IOPS every now and then. However, you can never go above your allocated storage throughput. And once you reach this hard limit, all DB operations become slow. I elaborate on those parameters here.
Once you know your available throughput, you can put the critical around 80-90%. E.g. a good db.R5b RDS instance with PIOPS storage typically delivers 2000 MBps. So 1600-1800 is your critical threshold.
SwapUsage
Now, I haven't ever witnessed an RDS instance reaching unhealthy swap levels. Still, this could happen - either by wrong SGA size setting (typically the default is fine), or by using too much PGA (not unheard of). And a heavily swapping DB server is a dead DB server.
You can set something in the single-digit gigabytes range for critical. If your instance is big enough, even low double-digit will work. If you get notifications and your DB still responds fine, bump it up.
Connections
Unfortunately we do not have Active sessions metric - this would be more useful. But still, a typical OLTP application has a more or less predictable number of connections to the DB. Generally we put a hefty margin in the DB - having SESSIONS parameter set to at least double the typical usage. However, it cannot be unlimited - every session is an OS process (unless DRCP is set up), and every process means memory. So if you set SESSIONS double your normal usage, your warning and critical can go around 60 and 80% of the value you set in the DB. A sudden change may be an indication of:
Conclusion
For an experienced Oracle DBA, CloudWatch Alarms is a great tool with lots of options and a shallow learning curve. With SNS integration, the sky is the limit for notifications. And, like with every monitoring and alerting tool, the devil is in the thresholds. Don't forget the danger of alarm fatigue.