How to Fix Alerts Hell in the Cloud World using DevOps Intelligence?
As an engineer wearing multiple hats for over a decade now, I have worked on building systems to operating them in production from data centers to cloud environments. With that experience, I can tell that the traditional IT teams or modern DevOps engineers, especially those who handled operational support for their applications and production infrastructure, are exacerbated by the alerts flood & monitoring fatigue caused by their operational management systems in this 24*7 uptime world.
Since 2006, I have been through it, and I know many of us have become immune to the monitoring fatigue & alerts hell. Basically, due to a plethora of notifications/email/chats that are generated because of events, consequently taking our focus out in looking at every alert event coming in. Many have resorted to creating filtering rules to mark all events as ‘read’ until there is critical incident or application outage crisis.
The problem of too many alerts is a known issue in the data center world, and is popularly called as Ops fatigue. The traditional NOC team (look at alert emails), IT support team (review tickets & respond) and then engineers looking into the critical problems is broken in the cloud world with DevOps Engineers managing all these tasks.
There is a wide discussion in the technical community on managing the alerts hell, and it has bothered both engineering and operational teams for decades. Many large companies like Google, Facebook & Amazon have built systems to handle the events hell in their large infrastructure so that engineers are bothered only when there is a real problem or anticipation of a critical issue instead of getting bothered on every symptom. In this context, I strongly recommend you to read “My Philosophy on Alerting” by Rob Ewaschuk, who was SRE at Google. The notes from the Facebook team on self-healing is also a great read.
With increased adoption of Cloud, and emergence of micro services architecture for building the new generations systems, we are quadrupling the amount of metrics monitored (server metrics, container metrics, app/web/DB server metrics, application metrics) due to monitoring hell (the need to monitor more things than we used to do in traditional world). And the problem of alerts hell is only going to increase for most of us.
What are DevOps and Cloud Engineers interested in instead of alert emails?
- Understanding of signal over noise: We are all interested in the problem or potential issue rather than scouting through endless alert emails. Most of the times, we lose track of signals due to the flooding of noisy alerts in our production environments. Wouldn’t it be great if we can reduce the ops fatigue of engineers by eliminating the noise?
- Need scope-aware alerting to reduce the flood: All we need is to get that one alert if a service goes down instead of getting alerts from every instance in the service cluster so we can reduce the noise of alerts and focus on the problem at hand.
- Alerts intelligence and event diagnostics over emails: What is that we need is to understand why we have an alert instead of just getting a notification saying, “Your server CPU is high or your application service is down.” Wouldn’t it be great if we get alert diagnostics like why the CPU is high and what caused the application service to go down? Also, humans cannot remember events information and patterns over a long period. So, we need intelligent analytics on whether an alert is a known issue? What is the pattern of it? Should it be even sent to an engineer in the first place?
- Event remediation with workflow handlers: Most of us would have written scripts for handling the known events, so that when a web server goes down, we can restart it automatically instead of an engineer looking at an alert and then restart it manually. However, defining workflow rules for our scripts that can handle all events and create triggers for it is a cumbersome process. For the reason that, most of the operational engineers don’t have the expertise or bandwidth and resources to execute it.
At Botmetric, we have faced these problems too. Hence, we have been working on launching an application as part of our Ops & Automation offering, so that our engineers can easily understand the alert events through intelligence. Moreover, the application will tell the engineers why is it happening? Is there a pattern in the problem?
We want engineers to focus on solving their most noisy issues, diagnose events and define auto-remediation handlers, especially for the periodic known issues.
On December 12th, We are rolling out the beta launch of Ops Intelligence in Botmetric. Please write to us at [email protected], if you are interested in testing it out. We would love to hear how we can together, as a DevOps community, find a better alternative and fix the monitoring alerts hell to help engineers get their time back!