登录查看更多内容

Monitoring Oracle RDS with CloudWatch Alarms

Yavor Ivanov

Senior Solutions Engineer at Cintra: The Cloud Architects

发布日期: 2022年8月17日

One of the things I really enjoy in AWS is how easy it is to setup decent monitoring on any service. For monitoring Oracle databases running on RDS, CloudWatch can help with:

Alarms to notify us if some parameter goes above normal thresholds
Dashboards to visually show database behavior at a glance
Logs / Log Insights to monitor the alert log or other files for errors
Custom metrics to extend CloudWatch with business thresholds or 3rd party parameters (e.g. GoldenGate lag)

This is quite a potent set of tools to start with. Adding RDS Performance Insights in the mix for real-time SQL monitoring makes the tooling complete. But if this is not enough, deploying OEM CloudControl agent via the option group is just a few clicks.

CloudWatch alarms deployment architecture

Every AWS service emits a variety of metrics that can be monitored and alerted on out-of-the-box. Those are things like CPU usage, number of connections, disk IO and throughput, swap usage and the like.

There are a few details I want to mention:

Setting up an alarm via the AWS console is easy. But after you create 5 it gets boring. Having more than 10, and for each database, means manual setup is a no-go. Do yourself a favor, use parametrized Terraform or CloudFormation. It makes your life easier for both mass deployment and further tweaking
CloudWatch Alarms can detect an event, but you need SNS to send the notification. I find it a good practice to deploy the SNS topic/s in a centralized monitoring account, where you can easily change the recipient/s of all alarms
Just like with any notifications, I find it good practice to have at least 2 levels: warning and critical. The warning SNS topic should send information to a general repository, e.g. a PDL, where those can be acted upon during normal business hours. The critical is for things that need immediate attention by the on-duty DBA - e.g. through SNS-to-PagerDuty integration
CloudWatch alarms can also take corrective actions - e.g. scaling a storage. This can be useful for non-critical setups (but then you can as well have autoextend enabled on the storage). Call me a laggard, but I still do not feel comfortable of automating it on business-critical Oracle databases. Maybe it is more suitable for stateless objects like EKS cluster, where driving the herd is a viable strategy.

Here is a sample architecture:

Setting up centralized SNS topics

There are two caveats with the proposed approach. First, the centralized SNS topics should allow CloudWatch alarms from other accounts to post notifications. This can be achieved by adding the following IAM permissions on the SNS topic in the shared account:

    ...,
      {
      "Effect": "Allow",
      "Principal": {
        "AWS": "*"
      },
      "Action": [
        "SNS:Publish",
        "SNS:GetTopicAttributes"
      ],
      "Resource": "arn:aws:sns:eu-central-1:999999999999:SNS_Critical_db",
      "Condition": {
        "ArnLike": {
          "aws:SourceArn": [
            "arn:aws:cloudwatch:eu-central-1:111111111111:alarm:*",
            "arn:aws:cloudwatch:eu-central-1:222222222222:alarm:*",
            "arn:aws:cloudwatch:eu-central-1:333333333333:alarm:*",
            "arn:aws:cloudwatch:eu-central-1:444444444444:alarm:*",
            "arn:aws:cloudwatch:eu-central-1:555555555555:alarm:*",
            "arn:aws:cloudwatch:eu-central-1:666666666666:alarm:*",
            "arn:aws:cloudwatch:eu-central-1:777777777777:alarm:*"
          ]
        }
      }
    },
...

In this example, 999999999999 is the monitoring (shared) account, and 111111111111 to 777777777777 are the business accounts.

Secondly, you can put the SNS topic as a target to the alarm, but you will always have a warning like this in the console:

This is fine. It simply means, that the alarm (in the business account) does not know if it has the right privileges on the SNS topic (in the monitoring account). Once the IAM policy is properly configured, it will work fine.

CloudWatch Alarm features

There are a lot of parameters you can tweak in an alarm. I find the following ones most useful.

Statistic (max, min, average, p90, etc.) - for most of the alarms I use average. Nevertheless, I love how many options we get out-of-the-box.

Period (10 sec, 1 min, 5 min, etc.) - how often the alarm should be evaluated. This impacts the cost of for the alarms. Typically, 1 minute is fine, especially if the alarm has to be acted upon by humans.

Alarm condition (greater, greater or equal, lower, etc.) - what check you want for the value. E.g. I want to be notified if my CPU goes above 60%. Most of the times greater is the trigger I use.

Value - what is the threshold that triggers the alarm. The alarm gets triggered (Alarm state) once the value passes the threshold given number of times (see next paragraph), and is set to OK state once it passes the threshold in the other direction given number of times. Please note that you can have different actions for going to Alarm and OK state.

Datapoints to alarm - this is a neat addition that helps to ignore short-lived spikes. For instance, I want to be notified if my CPU goes above 70% average in 3 out of 5 consecutive checks (this is 3 out of 5 minutes).

Missing data treatment - now sometimes the RDS system may fail to report fresh data for the given minute. Here you can say what should happen with the alarm in such cases - it can set the alarm to missing data state; set the alarm directly to Alarmed or to OK; or it can ignore this failed check and retain the alarm state. Typically I use "treat missing data as ignore".

领英推荐

SAP HANA Architecture on Azure with Oracle Database as…

Javid Ur R. 4 个月前

Strategy to Migrate Large Oracle Database from…

Muhammad Shuja 2 年前

How to Install Oracle Database 23ai, ORDS and Oracle…

Madhusudhan Rao 10 个月前

How to find the right thresholds

The list of metrics is relatively easy to pick. The actual thresholds are a different topic. Even for a seasoned DBA, it may require some trial and error to find the best values. You want to be notified for any unexpected database behavior, before the end customer even notices any impact. On the other hand, the monitoring team should not get too many "false positive" notifications to avoid alarm fatigue.

Generally I set some conservative numbers. Then, if I notice alarms that do not get acted upon (and still everything runs fine), bump the threshold up.

Remember, make sure you automate the deployment and keep it as consistent as possible across different DBs. There will be exceptions - like your dev/test instances may lack critical threshold, so your Teraform should be customizable.

What parameters should I monitor for RDS Oracle

RDS provides dozen of metrics. Not all are suitable for alarms. I recommend the following list for a starter.

CPUUtilization?

Here you should know your typical CPU utilization and set the warning somewhere above it. The idea is not to be triggered by "normal" spikes (avoid too many false positives), but to get notified when things go consistently south. For example, if your typical workload uses between 10 and 45% CPU, put warning if it reaches above 55% for 5 out of 7 minutes.

The critical threshold can be in the 70-85% range. Even there, you may have very little time to react. On some very critical DBs we keep it as low as 65%.

ReadIOPS?

I always advocate for using PIOPS storage for critical databases. Having in mind Oracle does way more reads that writes, alarming on ReadIOPS makes sense.

Again, the warning should be somewhere above the typical usage for the specific database. Critical can be around the provisioned number of IOPS (or the IOPS you get for the DB size on GP2).

DiskQueueDepth

Disk queuing is a sign of an overloaded IO subsystem. And an overloaded IO subsystem is a sign of an overloaded database.

I have seen databases survive short bursts of 100+ IOs in the queue without visible impact on end customers. Other workloads are more "response time"-critical, and even queue length of 30 starts to get noticed. So this metric is very workload-dependent.

ReadThroughput

AWS will allow you to go above the provisioned IOPS every now and then. However, you can never go above your allocated storage throughput. And once you reach this hard limit, all DB operations become slow. I elaborate on those parameters here.

Once you know your available throughput, you can put the critical around 80-90%. E.g. a good db.R5b RDS instance with PIOPS storage typically delivers 2000 MBps. So 1600-1800 is your critical threshold.

SwapUsage

Now, I haven't ever witnessed an RDS instance reaching unhealthy swap levels. Still, this could happen - either by wrong SGA size setting (typically the default is fine), or by using too much PGA (not unheard of). And a heavily swapping DB server is a dead DB server.

You can set something in the single-digit gigabytes range for critical. If your instance is big enough, even low double-digit will work. If you get notifications and your DB still responds fine, bump it up.

Connections

Unfortunately we do not have Active sessions metric - this would be more useful. But still, a typical OLTP application has a more or less predictable number of connections to the DB. Generally we put a hefty margin in the DB - having SESSIONS parameter set to at least double the typical usage. However, it cannot be unlimited - every session is an OS process (unless DRCP is set up), and every process means memory. So if you set SESSIONS double your normal usage, your warning and critical can go around 60 and 80% of the value you set in the DB. A sudden change may be an indication of:

application-level connection leak, usually occurring after fresh deployment - which can happen at any time with CI/CD. Depending on how popular the piece of code that leaks sessions is, it may be quite rapid - and exhaust the database SESSIONS in minutes, even seconds; or it can be slow and take days to detect.
broken execution plan of a popular query. Now this is bad, as it can exhaust the CPU or IO capacity of the instance quite rapidly. Even worse, it can come out of the blue - even a system running stable for months can suddenly break as a result of fresh stats being gathered, a change in data distribution, some DDL expiring a cursor, an automatic resize of the spared pool, or the moon going into Aquarius. In all cases, a typical app will start adding more and more connections to the DB, trying to compensate for the stalling ones. If you are lucky, you can detect the increase of sessions before any business impact.
locking in the database. Locks do not waste notable amount of CPU or IO. However, locks on very popular table can put your app’s session pool dry in minute, even seconds - depends on how popular the locked resource is. (as a side note, Oracle Resource manager can kill idle sessions who block others for given time - neat feature for fighting application bugs and/or humans forgetting to commit before going out for lunch. With Oracle 20c and up, there is even a DB parameter for it)

Conclusion

For an experienced Oracle DBA, CloudWatch Alarms is a great tool with lots of options and a shallow learning curve. With SNS integration, the sky is the limit for notifications. And, like with every monitoring and alerting tool, the devil is in the thresholds. Don't forget the danger of alarm fatigue.

要查看或添加评论，请登录

Yavor Ivanov的更多文章

Transforming Exadata architectures to AWS, part two

2024年9月3日

Transforming Exadata architectures to AWS, part two

In the first part of this series, I outlined the process for migrating Oracle workloads from on-prem to AWS, focusing…

1 条评论
Transforming Exadata architectures to AWS, part one

2024年7月2日

Transforming Exadata architectures to AWS, part one

Three years ago, I started a series on Exadata migration to AWS, aiming to help IT professionals by sharing real-life…

30 条评论
Breaking Free from Legacy Oracle into AWS, Part Three: Oracle EE to SE migration

2024年2月28日

Breaking Free from Legacy Oracle into AWS, Part Three: Oracle EE to SE migration

It won’t surprise anyone if I say Oracle Database Enterprise Edition (EE) has more features and options than Standard…
Breaking Free from Legacy Oracle into AWS, Part Two: Sustainability

2023年9月19日

Breaking Free from Legacy Oracle into AWS, Part Two: Sustainability

Article updated in Mar.2025 to celebrate 250 OLAs completed by Cintra.

4 条评论
Breaking Free from Legacy Oracle into AWS, part one

2023年4月5日

Breaking Free from Legacy Oracle into AWS, part one

Article updated in Mar.2025 to celebrate 250 OLAs completed by Cintra.

10 条评论
AWS EBS volumes comparison

2023年1月19日

AWS EBS volumes comparison

I wanted to write this article for quite a while. We all know storage is the essential infrastructure part of a…
Setting up CloudWatch Dashboard for Oracle RDS

2022年8月29日

Setting up CloudWatch Dashboard for Oracle RDS

In my previous articles, I describe how to troubleshoot RDS performance issues under pressure. To be successful in…
Oracle RDS vs Aurora: comparing storage cost

2022年8月2日

Oracle RDS vs Aurora: comparing storage cost

Migrating Oracle database workloads to AWS is a multifaceted challenge. There is no silver bullet, no easy approach to…
Running Oracle on RDS: performance troubleshooting under pressure

2022年4月22日

Running Oracle on RDS: performance troubleshooting under pressure

A few months ago, I wrote six conceptual articles on migrating Oracle to RDS. My initial plan was to continue the…

4 条评论
Migrating Exadata workloads to AWS, part six: predicting storage requirements

2021年8月4日

Migrating Exadata workloads to AWS, part six: predicting storage requirements

As you have seen in part five, the IO subsystem is by far the biggest challenge we are facing during the migration. The…

See all articles

Monitoring Oracle RDS with CloudWatch Alarms

Yavor Ivanov

Senior Solutions Engineer at Cintra: The Cloud Architects

CloudWatch alarms deployment architecture

Setting up centralized SNS topics

CloudWatch Alarm features

领英推荐

How to find the right thresholds

What parameters should I monitor for RDS Oracle

CPUUtilization?

ReadIOPS?

DiskQueueDepth

ReadThroughput

SwapUsage

Connections

Conclusion

Yavor Ivanov的更多文章

社区洞察

其他会员也浏览了

Introducing Oracle Database@Azure

Introduction to Oracle Database @ Azure

Oracle extends support for on-prem Database 19c

On-Premise Oracle DB migration to Oracle Cloud with Oracle Zero Downtime Migration (ZDM)

Migrating Legacy Oracle Database to Cloud With Minimum Downtime

Into the Multi-Cloud Part 1: The Oracle DB Operator for Kubernetes, Containerized Oracle Databases and Azure

How to Migrate an On-Premises Oracle Database to Amazon RDS for Oracle Using Direct Oracle Data Pump Import Over a Database Link

Top 10 Oracle Customer Tech Updates

Using ZFS Storage Appliance with Exadata: Match Made in Haven

How to migrate database into Oracle Database Cloud Service...

CloudWatch alarms deployment architecture

Setting up centralized SNS topics

CloudWatch Alarm features

领英推荐

How to find the right thresholds

What parameters should I monitor for RDS Oracle

CPUUtilization?

ReadIOPS?

DiskQueueDepth

ReadThroughput

SwapUsage

Connections

Conclusion

Yavor Ivanov的更多文章

Transforming Exadata architectures to AWS, part two

Transforming Exadata architectures to AWS, part one

Breaking Free from Legacy Oracle into AWS, Part Three: Oracle EE to SE migration

Breaking Free from Legacy Oracle into AWS, Part Two: Sustainability

Breaking Free from Legacy Oracle into AWS, part one

AWS EBS volumes comparison

Setting up CloudWatch Dashboard for Oracle RDS

Oracle RDS vs Aurora: comparing storage cost

Running Oracle on RDS: performance troubleshooting under pressure

Migrating Exadata workloads to AWS, part six: predicting storage requirements

社区洞察

其他会员也浏览了

Introducing Oracle Database@Azure

Introduction to Oracle Database @ Azure

Oracle extends support for on-prem Database 19c

On-Premise Oracle DB migration to Oracle Cloud with Oracle Zero Downtime Migration (ZDM)

Migrating Legacy Oracle Database to Cloud With Minimum Downtime

Into the Multi-Cloud Part 1: The Oracle DB Operator for Kubernetes, Containerized Oracle Databases and Azure

How to Migrate an On-Premises Oracle Database to Amazon RDS for Oracle Using Direct Oracle Data Pump Import Over a Database Link

Top 10 Oracle Customer Tech Updates

Using ZFS Storage Appliance with Exadata: Match Made in Haven

How to migrate database into Oracle Database Cloud Service...