Critical Alerts System using Datadog

Critical Alerts System using Datadog

No alt text provided for this image


One of the tenants in monitoring an enterprise’s network infrastructure is knowing when critical changes occur. In our Infrastructure team, monitoring 101 begins with the focus on alerting what matters. After the facilitation of observing via metrics and measurements, alerts are the systems programmed to draw attention to the particular observations and bring in the desired call for intervention and action.

Building an effective alerting system is managed using Datadog, the IT Infrastructure team’s infrastructure monitoring tool. With the creation of monitors, integration, network endpoints, and notifications, the critical alerts system is designed to alert liberally and page on symptoms rather than causes. This guide will illustrate why critical alerts are necessary and how you can access automated critical alerting systems for your needs.

2.0 Importance and Need

It is pertinent in ensuring the business continuity of an enterprise to fight all possible ways against disruption. To warrant real-time, actionable effect during disruptions, our team finds actionable alerting as one of the most powerful and essential tools in its utility.

No alt text provided for this image


Figure I: When to Alert (Le-Quoc, 2015)

Actionable alerting is so critical that the development of the critical alerts system is an assimilation of measurement, monitoring, and management of all key metrics and the devising of multiple actionable real-time alerting systems for users to act without losing any time. Datadog’s actionable alerts are driven by ML (Machine Learning), which results in optimal performance of the monitoring and management of the infrastructure environment.

 

3.0 How It Works

No alt text provided for this image


The mechanism of the critical alerts system is relatively simple. The IT Infrastructure team is responsible for creating the alerting triage, and the alerts will be configured and sent to recipients according to an opt-in form sign-up. Recipients are members who want to be alerted of critical events for their systems.

Process

  1. The IT Infrastructure team will devise a list of critical events.
  2. The list will be distributed to all users via Slack and email as an opt-in Google form.
  3. Based on the events applicable to the recipient, they can fill in the opt-in form accordingly and sign up for critical alerts.
  4. As the network identifies critical alerts matching the recipient’s applicable events, automated alerts will be sent to the recipient via email and SMS notifications.

 

Opt-In Form

The opt-in form will be distributed to users for signing up for the critical events applicable to their requirements. The IT Infrastructure team will configure the alerts specified by the recipient and input communication details, including their email address and phone number, for automated email alert notifications and SMS alert notifications as the critical events are identified within the network, respectively. The alerts list will be revised and updated regularly as the IT Infrastructure team onboards different services and recipient requests.

Opt-Out Form

The opt-out form is designed for existing recipients who wish to disable receiving alerts after they have already signed up. It will be distributed to existing users periodically via Slack and email.

Methods of Alerting

The key critical alerts notifications will be communicated to recipients via three methods,

  1. SMS notification,
  2. Email notification, and
  3. Slack notification.

Critical alert notifications via phone call will be configured and added to the alerting methods by the IT Infrastructure team shortly and are currently set as the next goal.

How You Receive Alerts

No alt text provided for this image


The critical alerts notification configuration is managed with the integration of multiple products for respective alerting systems. It enables eliminating a single point of failure even from a product perspective and ensures critical alerts are notified to recipients irrespective of an alerting system failure. Furthermore, to provide service reliability and user experience alerts are managed immediately, Datadog’s webhook integration service utilizes influence in how the team can interact with internal service and third-party service changes within the infrastructure environment.

The SMS notifications for critical alerts are configured with the help of Twilio integration with Datadog and the usage of webhooks1. Real-time communication for critical alerts is enabled through synthetic monitoring. Leveraging webhooks integration in Datadog allows the automatic trigger of actions in any services and measures actions including, metric alerts, anomalies, event monitors, and forecasts. As the synthetic test is used to call a webhook via the Twilio SMS API, the critical alerts, can automatically measure and notify immediately in real-time.

The Slack notifications for critical alerts are configured via Slack notification in integration with Datadog. With this integration, incident declaration, @-mentions for monitor alerts, and receiving alerts from Datadog within Slack are enabled for a specific user and user groups.

The email notifications for critical alerts are configured using the Datadog emails integration3. Both Datadog users and non-Datadog users can be notified by email.

The utilization of webhooks enables the team to alert services during the trigger of metric alerts. The critical advantage of webhooks is the feature of a webhook queue created based on a per- service level when multiple webhook endpoints are notified. Therefore, this enables the team to configure multiple alerting systems in Datadog without affecting each other and securely managing multiple endpoints for critical alerts to recipients.

 

4.0 What Will The Critical Alerts Measure

Our Infrastructure team’s critical alerts system is developed to notify members and recipients through monitor management and third-party service integration.

The Datadog Notifications service is the key component in developing the critical alerts system. The team can generate multi-alert monitors, tag variables, triggering scope identification, renotification and priority notification for critical alerts.

No alt text provided for this image


Figure II: Test Notifications for Monitors (Datadog, 20214)

Recipients are grouped as per critical alert requirements through the variables including tag variables, multi-alert group by host, tag key period, facets for log monitors, conditional variables, and composite monitor variables.

The critical alerts will also measure the metric monitors for alerts based on multiple detection methods, including the threshold alert to notify metric values against a static threshold and compares if a threshold limit is crossed over a given period. The change alert compares the relative change in value between N minutes ago, and now against a threshold, the anomaly detection alert monitors abnormal metric behavior with past behavior data, outlier monitor detection focuses on a group of hosts, partitions, availability zones, etc., for abnormal behaviors, and lastly, the forecast alert detection computes prediction of a metric’s future behavior against a static threshold for recurring patterns and strong trends metrics.

No alt text provided for this image


Log monitoring and management is another critical alert measuring tool deployed by the IT Infrastructure team to alert for use cases having specified log types exceeding the user-defined threshold over a given period through the log monitor.

Alerts are further measured based on cases when specific tags stop reporting7. When one or more tags disappear from the system, setting up a metric monitor allows the generation of an alert scheme.

 

1 https://www.datadoghq.com/blog/send-alerts-sms-customizable-webhooks-twilio/

2 https://docs.datadoghq.com/integrations/slack/?tab=slackapplicationus#mentions-in-slack-from-monitor- alert

3 https://docs.datadoghq.com/monitors/notifications/?tab=dashboards

4.https://docs.datadoghq.com/monitors/notifications/?tab=is_alert

5 https://docs.datadoghq.com/monitors/monitor_types/metric/?tab=forecast

6 https://docs.datadoghq.com/monitors/monitor_types/log/

7 https://docs.datadoghq.com/monitors/faq/how-can-i-setup-an-alert-for-when-a-specific-tag-stops- reporting/

要查看或添加评论,请登录

Chris Gascon的更多文章

社区洞察

其他会员也浏览了