Monitoring Oracle RDS with CloudWatch Alarms

Monitoring Oracle RDS with CloudWatch Alarms

One of the things I really enjoy in AWS is how easy it is to setup decent monitoring on any service. For monitoring Oracle databases running on RDS, CloudWatch can help with:

  • Alarms to notify us if some parameter goes above normal thresholds
  • Dashboards to visually show database behavior at a glance
  • Logs / Log Insights to monitor the alert log or other files for errors
  • Custom metrics to extend CloudWatch with business thresholds or 3rd party parameters (e.g. GoldenGate lag)

This is quite a potent set of tools to start with. Adding RDS Performance Insights in the mix for real-time SQL monitoring makes the tooling complete. But if this is not enough, deploying OEM CloudControl agent via the option group is just a few clicks.

CloudWatch alarms deployment architecture

Every AWS service emits a variety of metrics that can be monitored and alerted on out-of-the-box. Those are things like CPU usage, number of connections, disk IO and throughput, swap usage and the like.

There are a few details I want to mention:

  • Setting up an alarm via the AWS console is easy. But after you create 5 it gets boring. Having more than 10, and for each database, means manual setup is a no-go. Do yourself a favor, use parametrized Terraform or CloudFormation. It makes your life easier for both mass deployment and further tweaking
  • CloudWatch Alarms can detect an event, but you need SNS to send the notification. I find it a good practice to deploy the SNS topic/s in a centralized monitoring account, where you can easily change the recipient/s of all alarms
  • Just like with any notifications, I find it good practice to have at least 2 levels: warning and critical. The warning SNS topic should send information to a general repository, e.g. a PDL, where those can be acted upon during normal business hours. The critical is for things that need immediate attention by the on-duty DBA - e.g. through SNS-to-PagerDuty integration
  • CloudWatch alarms can also take corrective actions - e.g. scaling a storage. This can be useful for non-critical setups (but then you can as well have autoextend enabled on the storage). Call me a laggard, but I still do not feel comfortable of automating it on business-critical Oracle databases. Maybe it is more suitable for stateless objects like EKS cluster, where driving the herd is a viable strategy.

Here is a sample architecture:

No alt text provided for this image

Setting up centralized SNS topics

There are two caveats with the proposed approach. First, the centralized SNS topics should allow CloudWatch alarms from other accounts to post notifications. This can be achieved by adding the following IAM permissions on the SNS topic in the shared account:

    ...,
      {
      "Effect": "Allow",
      "Principal": {
        "AWS": "*"
      },
      "Action": [
        "SNS:Publish",
        "SNS:GetTopicAttributes"
      ],
      "Resource": "arn:aws:sns:eu-central-1:999999999999:SNS_Critical_db",
      "Condition": {
        "ArnLike": {
          "aws:SourceArn": [
            "arn:aws:cloudwatch:eu-central-1:111111111111:alarm:*",
            "arn:aws:cloudwatch:eu-central-1:222222222222:alarm:*",
            "arn:aws:cloudwatch:eu-central-1:333333333333:alarm:*",
            "arn:aws:cloudwatch:eu-central-1:444444444444:alarm:*",
            "arn:aws:cloudwatch:eu-central-1:555555555555:alarm:*",
            "arn:aws:cloudwatch:eu-central-1:666666666666:alarm:*",
            "arn:aws:cloudwatch:eu-central-1:777777777777:alarm:*"
          ]
        }
      }
    },
...        

In this example, 999999999999 is the monitoring (shared) account, and 111111111111 to 777777777777 are the business accounts.

Secondly, you can put the SNS topic as a target to the alarm, but you will always have a warning like this in the console:

No alt text provided for this image

This is fine. It simply means, that the alarm (in the business account) does not know if it has the right privileges on the SNS topic (in the monitoring account). Once the IAM policy is properly configured, it will work fine.

CloudWatch Alarm features

There are a lot of parameters you can tweak in an alarm. I find the following ones most useful.

Statistic (max, min, average, p90, etc.) - for most of the alarms I use average. Nevertheless, I love how many options we get out-of-the-box.

Period (10 sec, 1 min, 5 min, etc.) - how often the alarm should be evaluated. This impacts the cost of for the alarms. Typically, 1 minute is fine, especially if the alarm has to be acted upon by humans.

Alarm condition (greater, greater or equal, lower, etc.) - what check you want for the value. E.g. I want to be notified if my CPU goes above 60%. Most of the times greater is the trigger I use.

Value - what is the threshold that triggers the alarm. The alarm gets triggered (Alarm state) once the value passes the threshold given number of times (see next paragraph), and is set to OK state once it passes the threshold in the other direction given number of times. Please note that you can have different actions for going to Alarm and OK state.

Datapoints to alarm - this is a neat addition that helps to ignore short-lived spikes. For instance, I want to be notified if my CPU goes above 70% average in 3 out of 5 consecutive checks (this is 3 out of 5 minutes).

Missing data treatment - now sometimes the RDS system may fail to report fresh data for the given minute. Here you can say what should happen with the alarm in such cases - it can set the alarm to missing data state; set the alarm directly to Alarmed or to OK; or it can ignore this failed check and retain the alarm state. Typically I use "treat missing data as ignore".

How to find the right thresholds

The list of metrics is relatively easy to pick. The actual thresholds are a different topic. Even for a seasoned DBA, it may require some trial and error to find the best values. You want to be notified for any unexpected database behavior, before the end customer even notices any impact. On the other hand, the monitoring team should not get too many "false positive" notifications to avoid alarm fatigue.

Generally I set some conservative numbers. Then, if I notice alarms that do not get acted upon (and still everything runs fine), bump the threshold up.

Remember, make sure you automate the deployment and keep it as consistent as possible across different DBs. There will be exceptions - like your dev/test instances may lack critical threshold, so your Teraform should be customizable.

What parameters should I monitor for RDS Oracle

RDS provides dozen of metrics. Not all are suitable for alarms. I recommend the following list for a starter.

CPUUtilization?

Here you should know your typical CPU utilization and set the warning somewhere above it. The idea is not to be triggered by "normal" spikes (avoid too many false positives), but to get notified when things go consistently south. For example, if your typical workload uses between 10 and 45% CPU, put warning if it reaches above 55% for 5 out of 7 minutes.

The critical threshold can be in the 70-85% range. Even there, you may have very little time to react. On some very critical DBs we keep it as low as 65%.

ReadIOPS?

I always advocate for using PIOPS storage for critical databases. Having in mind Oracle does way more reads that writes, alarming on ReadIOPS makes sense.

Again, the warning should be somewhere above the typical usage for the specific database. Critical can be around the provisioned number of IOPS (or the IOPS you get for the DB size on GP2).

DiskQueueDepth

Disk queuing is a sign of an overloaded IO subsystem. And an overloaded IO subsystem is a sign of an overloaded database.

I have seen databases survive short bursts of 100+ IOs in the queue without visible impact on end customers. Other workloads are more "response time"-critical, and even queue length of 30 starts to get noticed. So this metric is very workload-dependent.

ReadThroughput

AWS will allow you to go above the provisioned IOPS every now and then. However, you can never go above your allocated storage throughput. And once you reach this hard limit, all DB operations become slow. I elaborate on those parameters here.

Once you know your available throughput, you can put the critical around 80-90%. E.g. a good db.R5b RDS instance with PIOPS storage typically delivers 2000 MBps. So 1600-1800 is your critical threshold.

SwapUsage

Now, I haven't ever witnessed an RDS instance reaching unhealthy swap levels. Still, this could happen - either by wrong SGA size setting (typically the default is fine), or by using too much PGA (not unheard of). And a heavily swapping DB server is a dead DB server.

You can set something in the single-digit gigabytes range for critical. If your instance is big enough, even low double-digit will work. If you get notifications and your DB still responds fine, bump it up.

Connections

Unfortunately we do not have Active sessions metric - this would be more useful. But still, a typical OLTP application has a more or less predictable number of connections to the DB. Generally we put a hefty margin in the DB - having SESSIONS parameter set to at least double the typical usage. However, it cannot be unlimited - every session is an OS process (unless DRCP is set up), and every process means memory. So if you set SESSIONS double your normal usage, your warning and critical can go around 60 and 80% of the value you set in the DB. A sudden change may be an indication of:

  • application-level connection leak, usually occurring after fresh deployment - which can happen at any time with CI/CD. Depending on how popular the piece of code that leaks sessions is, it may be quite rapid - and exhaust the database SESSIONS in minutes, even seconds; or it can be slow and take days to detect.
  • broken execution plan of a popular query. Now this is bad, as it can exhaust the CPU or IO capacity of the instance quite rapidly. Even worse, it can come out of the blue - even a system running stable for months can suddenly break as a result of fresh stats being gathered, a change in data distribution, some DDL expiring a cursor, an automatic resize of the spared pool, or the moon going into Aquarius. In all cases, a typical app will start adding more and more connections to the DB, trying to compensate for the stalling ones. If you are lucky, you can detect the increase of sessions before any business impact.
  • locking in the database. Locks do not waste notable amount of CPU or IO. However, locks on very popular table can put your app’s session pool dry in minute, even seconds - depends on how popular the locked resource is. (as a side note, Oracle Resource manager can kill idle sessions who block others for given time - neat feature for fighting application bugs and/or humans forgetting to commit before going out for lunch. With Oracle 20c and up, there is even a DB parameter for it)

Conclusion

For an experienced Oracle DBA, CloudWatch Alarms is a great tool with lots of options and a shallow learning curve. With SNS integration, the sky is the limit for notifications. And, like with every monitoring and alerting tool, the devil is in the thresholds. Don't forget the danger of alarm fatigue.

要查看或添加评论,请登录

Yavor Ivanov的更多文章

社区洞察

其他会员也浏览了