Event Based IT Systems
Amit Sengupta
Portfolio Lead (Associate Director) - Cloud and Open Source Capability Unit at CAPGEMINI AMERICA, INC
Signal Vs Noise
- How to figure out what matters
- Solve Problems Faster
The business world is in the midst of a dramatic digital transformation, and IT is at the center of it. Agile and SaaS are becoming the prevalent ways to develop and deliver software. Cloud and virtualization are key initiatives of nearly every IT director’s strategy. Modern, web-based applications are highly distributed and rely upon many services and components.
All of these changes, while positive, also make services and applications more susceptible to downtime. The dynamic nature of these environments means that these already complex applications are in constant flux, making it much harder to pinpoint the cause of an outage or failure. As today’s environments compound the problem, IT teams find themselves stuck with brittle and inflexible tools that drown them in a flood of events, false positives and irrelevant alerts.
This alert exhaustion coming from your IT and service environments during an outage can make it hard to establish and see event ownership, escalation paths and issue resolution as it is in the picture above.
Another challenge these events don’t “understand” is context. For example, let’s say 11 servers have reported problems in the last five minutes, generating over 100 events. How do you decide out of these 100 events which ones are false positives, which ones need to be prioritized and which is related to a business critical service? Service context is critically important!
What’s missing in the 100 events is visibility into how the infrastructure maps to the business service and context of how an event impacts that service. You need to be alerted on the findings that require immediate action, so you can spend less time sifting through the alerts that don’t require any action.
Bad Data = Bad IT = Bad Operations
To maintain high service levels and availability, IT organizations have adopted event management as the central practice. To prevent outages and recover quickly when things break down, these tool sets attempt to collect events from across the IT environment, and combine, correlate, filter and de duplicate events.
However, these tools demand extensive administration and rules management to avoid event overload. This is cumbersome at a time when IT infrastructures are evolving to agile frameworks and sustaining workloads through small component failures. In such dynamic infrastructures, even more effort is needed to filter and de duplicate with new rules. As a result, the constant need to keep these tools updated is unmanageable.
Adding to the complexity, an infrastructure component can be responsible for causing a service outage even without generating an event. For example, let’s say you have 10 databases, but only two of them are performing poorly. This indicates that 80% of the databases in your environment are performing well — an acceptable performance threshold. But if the two poorly performing databases serve a critical service and no alerts or events are triggered, all hell breaks loose. Your IT is reactive, your customers are irate and your business suffers. In this scenario, simply finding clever ways to reduce the volume of events — without any context into your services — is doomed to fail.
These “single-source-of-truth” event management solutions struggle to provide service-level insights on problems and incidents, because they weren’t built to deliver context to events. These tools:
? Are unable to handle any and all kinds of data and formats
? Are brittle and make it difficult to maintain integrations that often do not communicate well with each other
? Can’t scale due to big data volumes, resulting in aggregated and pre-normalized data
? Require untenable deployments — often taking months or years
Retrofitting various solutions with complex integrations and interoperability only adds to the chaos by overloading IT teams with irrelevant data and unnecessary overhead. You essentially drown in the data. As a result, there continues to be confusion and siloed responses, extended war room scenarios, duplicated efforts, unclear ownership, buck passing and a reliance on manually determining what’s broken.
Saving Event Analytics (and Our Sanity)
When there’s a lot of data and no knowledge, it’s difficult to act on anything. Every IT organization’s #1 goal is to deliver reliable services. When an outage does happen on one of their services, finding and fixing the problem quickly and with the right priority is business critical.
Unlike conventional approaches that attempt to solve the problem of managing event storms, IT organizations need to focus on fixing a business critical issue with a solution that:
? Provides a single, unified repository of all data at scale, for a full understanding of the problem
? Combines data with machine learning capabilities to discover patterns, baseline normal behavior and anomalous activity
? Notifies on meaningful and impactful events that are at human-scale and actionable
? Frees IT from the mundane task of managing events, so they can focus on quickly managing the incident and finding and fixing what’s broken
Resolving outages faster means better event analytics. And that requires IT to stop pointing at people, and instead point to machine data — including logs, metrics, events and wire data. It means using artificial intelligence (AI) capabilities to elicit patterns and service dependencies.
To truly help your IT teams fix what’s broken, it’s critical to be able to discern normal vs. abnormal behavior, link causal relationships to find the root cause of issues, and triage problems quickly, proactively and effectively. Additional qualities of an effective event analytics solution include:
Computational power and mathematical sophistication, using artificial intelligence and machine learning to:
? Establish baselines and identify normal behavior to adapt to ever-changing thresholds
? Group high-volume and low-value events to generate alerts that are manageable at a human-scale rate
? Use those patterns to detect departures from normal behavior and highlight anomalies
? Reduce the need for rules management and scripting vulnerabilities to empower IT with the right data and context to make better and faster data-driven decisions
Contextual insights so IT can deliver tight alignment with the business via insights that help prioritize events according to business needs, ensuring that IT’s focus is on the right tasks at the right time.
Use case coverage as the benefit of greater intelligence on operating and delivering on IT and business services is real. IT must deliver in-depth and sophisticated visualizations across many different data sources to deliver service-level insights. In addition, IT needs to use AI to detect patterns, baseline behavior, establish correlations and detect anomalies to help accelerate incident resolution and root cause, and operate with a business lens to drive service excellence.
Data versatility, so IT can easily and quickly combine data, including logs, events, metrics, wire data and other types of data from across the enterprise, including any heterogeneous data source, at scale. This empowers IT organizations to see correlations and make observations that are otherwise hidden.
New solutions that provide a service-centric view of IT are crucial to achieving these benefits of effective event analytics. The AIOps Tool stack like Sumo Logic, Logstash, Prometheus, Moogsoft etc. delivers the power of artificial intelligence to simplify service operations with advanced event analytics and service context. This empowers IT to help prioritize incident investigation, simplify triage and accelerate resolution.
The Path Forward
If you’re part of IT, you’re faced with endless demands on your time and resources. So it’s critical that you have the right tools to do your job better — including one that helps you identify and solve the issues that matters most. This can only be done with a new approach that solves the event management problems that have plagued IT for decades. By deploying a new event analytics solution that includes artificial intelligence, you give the gift of data-led insights.