We have a Bot for that! Reinventing Root Cause Analysis with vuRCABot 2.0
Our AIOps series focuses on how VuNet’s flagship product, vuSmartMaps?, provides the platform for a rich set of ML models to be built upon it, and enables enterprises to graduate to complete automation.?
In the first blog of the series, we examine how enterprises can deal with Murphy’s Law in their IT Operations Management (ITOM), by employing techniques for effective root cause analysis (RCA). Our proprietary tool – vuRCABot 2.0 –makes RCA faster, explainable, and more accurate.
“Anything that can go wrong, will go wrong.”- Murphy’s law
It was 8 PM on a Friday evening, and the Network Operations Center (popularly called the War Room) in a large private bank was buzzing with activity. Twenty odd IT Operations personnel were huddled around a cluster of screens, sifting through service dashboards and frantically correlating alert data to determine the root cause of an issue which had started 2 hours prior. Social media was abuzz with a hashtag trend expressing user outrage that payments and Net banking transactions were failing, with some users failing to get an OTP to authenticate their transactions. Tempers were running high; the pressure was mounting and the clock was ticking. But one thing everyone in the room knew for certain – they were not going home that night until the issue was fixed.?
At 2:15 AM, the issue was finally attributed to an erroneously triggered reporting batch job overloading the database, and resulting in abnormally long query execution and turnaround times between the critical payments application and database. The problem was fixed by force-stopping the reporting jobs running on the database – which needed the lead of the respective team to be awoken from his slumber – and all was well again.?
Everyone in the troubleshooting team who went home that night had at least one of the following questions running through their minds:
What ITOM Teams are facing today
The above incident is a norm in the life of an IT Operations Manager or Site Reliability Engineer. Digital-first enterprises have complex and heterogeneous deployment environments, distributed across a hybrid software stack, with a single transaction traversing multiple touchpoints, internal as well as external APIs and microservices. Added to that, the pressure of negative customer feedback on social media makes it impossible to think first and react later. In such a situation, when things go wrong, ITOM and Site Reliability Engineering (SRE) teams are hard-pressed to answer two important questions:
MTTD and Anomaly Detection
MTTD depends on the volume of logs and metrics with alerts on them. Typically, IT Operations teams tackle issues by sifting through multiple dashboards, each dedicated to a data silo. Alerts flow in based on static thresholds on infrastructure and transaction level metrics. Examples of infrastructure metrics include CPU, Memory Usage, Active Threads, Query execution times etc., typically collected by intrusive agents. Transaction level metrics are collected from logs and used to track turnaround times (TATs), business and IT error codes, and response times. Alerts on these metrics, if any, give ITOM teams an idea of the state of the system.?
Alerts do not necessarily mean that systems are in the red – static thresholds are not only subjective but also misleading. For instance, CPU utilization could have an upper threshold of 95% set on it, resulting in alerts for high CPU usage, but transactions could still be concluding successfully with no errors or slowness.
MTTR and Issue Resolution
The MTTR metric, on the other hand, depends on the number of alerts across services that ITOM teams can correlate by looking through log files and transaction traces, to arrive at possible root causes for an issue. These issues can originate either internally in the enterprise’s own infrastructure, or externally in one of the many microservices and APIs a transaction converses with. An added layer of complexity originates from the fact that errors could be business or technical in nature.?
Not only does downtime result in revenue loss and customer churn, but MTTR requirements have also become more stringent in recent years, with regulatory authorities stating that enterprises will be held responsible for failures of transactions that are technical in nature. A penalty is also levied if TATs are above a certain threshold. This means that along with the need for MTTR to be reduced, the issue also needs to be remediated as soon as possible before transaction failures occur.
Some drawbacks of the existing approaches to issue detection and resolution include:
A new paradigm for optimizing MTTD and MTTR is clearly the need of the hour.?
Enter vuRCABot 2.0?
In what way is vuRCABot 2.0 faster, smarter, better??
vuRCABot 2.0 helps address the most common challenges faced by ITOM teams in multi-party transaction environments with complex and heterogeneous deployment architectures and multiple monitoring tools. It correlates anomalies in the context of a BUSINESS JOURNEY and presents them using principles of Unified Observability on a single Incident Resolution dashboard, cutting across infrastructure, cloud, applications?and transactions. Some ways in which it has helped achieve the objectives of predictive analytics and ML-driven incident triage are listed below:
How does vuRCABot 2.0 fix the above issue?
vuRCABot 2.0 was tested in a similar customer environment on a similar issue, and here is how it performed:
2. Previously, the team had to manually correlate golden signals across service/tool dashboards AFTER the transaction declines had already started. The vuRCABot, on the other hand, flagged anomalous behaviour in the transaction turnaround times (“Significant drop in TAT starting at 18:56”) and correlated it to an anomalous reporting batch job running on the database. This is an example of:
3. vuRCABot 2.0 uses a combination of transformer learning and a hierarchical approach to capture the most likely root causes, thereby decreasing MTTR. In this case, indicative golden signals across services were grouped together. Consequently, the overload on the database due to the large number of batch jobs was flagged as the most probable root cause.
4. vuRCABot 2.0 prescribed healing actions that could be performed to mitigate the incident. Instead of just being passive messengers of information, the ITOM team could initiate remediation and stop the rogue reporting batch jobs through integrations with IT Automation solutions.
5. vuRCABot allows collaboration between teams as well as closing of feedback loops via the user interface. For instance, in the above case, root cause suggestions can be ranked, which makes the algorithm either reinforce or relearn previously drawn hypotheses.
Results Obtained via vuRCABot 2.0 Deployment in Test Environments
vuRCABot performed significantly better than conventional RCA in the test environment mentioned in the above use case. The following table captures the time taken for each stage of the RCA using traditional monitoring and observability tools (for a typical issue) compared to the time taken for RCA using vuRCABot’s Incident Resolution Dashboard. Further, with incident feedback loops and a larger store of incident analysis, the modules learn over time and make the root cause analysis even faster, reducing the surface area of the unknowns unknowns.
Conclusion
vuRCABot 2.0 uses lead indicators on journey metrics and AI/ML approaches to correlate anomalies and reduce alert noise. This not only optimizes MTTD, but also provides a means to generate intelligent root cause recommendations that reduce the time for issue triage and resolution i.e MTTR, by up to 75% over conventional methods.?
So how does vuRCABot do this? We will explore this and the ML Operating System layer of VuNet called vuCoreMLOps, which sits on top of vuSmartMaps? and gives you the necessary AI/ML infra and accelerators to automate your enterprise, in the next blog of the series – Under the Hood of vuRCABot 2.0 – What makes VuNet’s RCA Assistant tick.
Product Management | Head - Platform Evangelist | Information Development | MBA | MS (Research), Liverpool | IIIT-B
1 年This easy to understand blog gives a great insight as to how VuNet avoids a deluge of Alerts and will be able to improve your MTTD and MTTR.