登录查看更多内容

We have a Bot for that! Reinventing Root Cause Analysis with vuRCABot 2.0

VuNet Systems

Driving Business Outcomes Through Business Observability

发布日期: 2023年3月17日

Our AIOps series focuses on how VuNet’s flagship product, vuSmartMaps?, provides the platform for a rich set of ML models to be built upon it, and enables enterprises to graduate to complete automation.?

In the first blog of the series, we examine how enterprises can deal with Murphy’s Law in their IT Operations Management (ITOM), by employing techniques for effective root cause analysis (RCA). Our proprietary tool – vuRCABot 2.0 –makes RCA faster, explainable, and more accurate.

“Anything that can go wrong, will go wrong.”- Murphy’s law

It was 8 PM on a Friday evening, and the Network Operations Center (popularly called the War Room) in a large private bank was buzzing with activity. Twenty odd IT Operations personnel were huddled around a cluster of screens, sifting through service dashboards and frantically correlating alert data to determine the root cause of an issue which had started 2 hours prior. Social media was abuzz with a hashtag trend expressing user outrage that payments and Net banking transactions were failing, with some users failing to get an OTP to authenticate their transactions. Tempers were running high; the pressure was mounting and the clock was ticking. But one thing everyone in the room knew for certain – they were not going home that night until the issue was fixed.?

At 2:15 AM, the issue was finally attributed to an erroneously triggered reporting batch job overloading the database, and resulting in abnormally long query execution and turnaround times between the critical payments application and database. The problem was fixed by force-stopping the reporting jobs running on the database – which needed the lead of the respective team to be awoken from his slumber – and all was well again.?

Everyone in the troubleshooting team who went home that night had at least one of the following questions running through their minds:

Could the system have given us a heads-up that there was something running in the database layer that should ideally not run at this time?
Could the correlations between alerts across different services have been done automatically?
Could we have been certain that we had got all the data we needed from all internal and external APIs and microservices in order to troubleshoot?
Most importantly, could the issue have been prevented BEFORE user experience got adversely affected, and the negative hashtag trend swamped social media?

What ITOM Teams are facing today

The above incident is a norm in the life of an IT Operations Manager or Site Reliability Engineer. Digital-first enterprises have complex and heterogeneous deployment environments, distributed across a hybrid software stack, with a single transaction traversing multiple touchpoints, internal as well as external APIs and microservices. Added to that, the pressure of negative customer feedback on social media makes it impossible to think first and react later. In such a situation, when things go wrong, ITOM and Site Reliability Engineering (SRE) teams are hard-pressed to answer two important questions:

How do we know when something has gone wrong? In ITOM parlance - How do we reduce MTTD – Mean Time to Detect an issue??
How can we get to the root cause of an issue so we can resolve it, i.e., how can we minimize MTTR – Mean Time to Resolve?

MTTD and Anomaly Detection

MTTD depends on the volume of logs and metrics with alerts on them. Typically, IT Operations teams tackle issues by sifting through multiple dashboards, each dedicated to a data silo. Alerts flow in based on static thresholds on infrastructure and transaction level metrics. Examples of infrastructure metrics include CPU, Memory Usage, Active Threads, Query execution times etc., typically collected by intrusive agents. Transaction level metrics are collected from logs and used to track turnaround times (TATs), business and IT error codes, and response times. Alerts on these metrics, if any, give ITOM teams an idea of the state of the system.?

Alerts do not necessarily mean that systems are in the red – static thresholds are not only subjective but also misleading. For instance, CPU utilization could have an upper threshold of 95% set on it, resulting in alerts for high CPU usage, but transactions could still be concluding successfully with no errors or slowness.

MTTR and Issue Resolution

The MTTR metric, on the other hand, depends on the number of alerts across services that ITOM teams can correlate by looking through log files and transaction traces, to arrive at possible root causes for an issue. These issues can originate either internally in the enterprise’s own infrastructure, or externally in one of the many microservices and APIs a transaction converses with. An added layer of complexity originates from the fact that errors could be business or technical in nature.?

Not only does downtime result in revenue loss and customer churn, but MTTR requirements have also become more stringent in recent years, with regulatory authorities stating that enterprises will be held responsible for failures of transactions that are technical in nature. A penalty is also levied if TATs are above a certain threshold. This means that along with the need for MTTR to be reduced, the issue also needs to be remediated as soon as possible before transaction failures occur.

Some drawbacks of the existing approaches to issue detection and resolution include:

Different teams spending considerable time engaged on the same problem.?
Bottlenecks in problem identification due to a high volume of data as well as multiple touchpoints (applications, components?and servers) per user transaction.
End-user responses adversely affected when business metrics are treated the same as IT metrics.
Temporary fixes instead of permanent solutions, resulting in recurrence of issues.
High cost of buying and maintaining disparate and incompatible tools.
Experimenting with AIOps solutions with no clear outcomes.
Lack of workforce with twin knowledge of AI/ML applications as well as domain experience.

A new paradigm for optimizing MTTD and MTTR is clearly the need of the hour.?

Enter vuRCABot 2.0?

No alt text provided for this image — Fig: Benefits of using vuRCABot 2.0

领英推荐

DIGITAL STORM weekly #7

Dr. Joerg Storm 1 年前

Artefact Data & AI Digest | BNP Paribas & Robertet…

Artefact 5 个月前

VASS SPOTLIGHTS #8

VASS 6 个月前

In what way is vuRCABot 2.0 faster, smarter, better??

vuRCABot 2.0 helps address the most common challenges faced by ITOM teams in multi-party transaction environments with complex and heterogeneous deployment architectures and multiple monitoring tools. It correlates anomalies in the context of a BUSINESS JOURNEY and presents them using principles of Unified Observability on a single Incident Resolution dashboard, cutting across infrastructure, cloud, applications?and transactions. Some ways in which it has helped achieve the objectives of predictive analytics and ML-driven incident triage are listed below:

How does vuRCABot 2.0 fix the above issue?

vuRCABot 2.0 was tested in a similar customer environment on a similar issue, and here is how it performed:

Earlier, the troubleshooting team had to contend with multiple dashboards from multiple tools. vuRCABot, on the other hand, presented all relevant metrics, traces and golden signals in the context of a single user journey – a mobile first payment Business Journey, to be precise – on a unified Incident Resolution Dashboard.

2. Previously, the team had to manually correlate golden signals across service/tool dashboards AFTER the transaction declines had already started. The vuRCABot, on the other hand, flagged anomalous behaviour in the transaction turnaround times (“Significant drop in TAT starting at 18:56”) and correlated it to an anomalous reporting batch job running on the database. This is an example of:

MTTD getting reduced due to early warnings on the lead indicators of a business journey
The temporal correlation performed by vuRCABot in order to give L0 and L1 teams situational context along with all required correlations across the stack already in place. This eliminates the need for manual correlations of infrastructure and journey metrics.

3. vuRCABot 2.0 uses a combination of transformer learning and a hierarchical approach to capture the most likely root causes, thereby decreasing MTTR. In this case, indicative golden signals across services were grouped together. Consequently, the overload on the database due to the large number of batch jobs was flagged as the most probable root cause.

4. vuRCABot 2.0 prescribed healing actions that could be performed to mitigate the incident. Instead of just being passive messengers of information, the ITOM team could initiate remediation and stop the rogue reporting batch jobs through integrations with IT Automation solutions.

5. vuRCABot allows collaboration between teams as well as closing of feedback loops via the user interface. For instance, in the above case, root cause suggestions can be ranked, which makes the algorithm either reinforce or relearn previously drawn hypotheses.

Results Obtained via vuRCABot 2.0 Deployment in Test Environments

vuRCABot performed significantly better than conventional RCA in the test environment mentioned in the above use case. The following table captures the time taken for each stage of the RCA using traditional monitoring and observability tools (for a typical issue) compared to the time taken for RCA using vuRCABot’s Incident Resolution Dashboard. Further, with incident feedback loops and a larger store of incident analysis, the modules learn over time and make the root cause analysis even faster, reducing the surface area of the unknowns unknowns.

Conclusion

vuRCABot 2.0 uses lead indicators on journey metrics and AI/ML approaches to correlate anomalies and reduce alert noise. This not only optimizes MTTD, but also provides a means to generate intelligent root cause recommendations that reduce the time for issue triage and resolution i.e MTTR, by up to 75% over conventional methods.?

So how does vuRCABot do this? We will explore this and the ML Operating System layer of VuNet called vuCoreMLOps, which sits on top of vuSmartMaps? and gives you the necessary AI/ML infra and accelerators to automate your enterprise, in the next blog of the series – Under the Hood of vuRCABot 2.0 – What makes VuNet’s RCA Assistant tick.

Nikhil Chakravarthi

1 年

This easy to understand blog gives a great insight as to how VuNet avoids a deluge of Alerts and will be able to improve your MTTD and MTTR.

要查看或添加评论，请登录

We have a Bot for that! Reinventing Root Cause Analysis with vuRCABot 2.0

VuNet Systems

Driving Business Outcomes Through Business Observability

What ITOM Teams are facing today

MTTD and Anomaly Detection

MTTR and Issue Resolution

Enter vuRCABot 2.0?

领英推荐

In what way is vuRCABot 2.0 faster, smarter, better??

How does vuRCABot 2.0 fix the above issue?

Results Obtained via vuRCABot 2.0 Deployment in Test Environments

Conclusion

更多精彩文章

社区洞察

其他会员也浏览了

Building a Data-Driven Culture: Change Management Strategies for Banks and Investment Firms

Decoding the Influence of Real-Time Vector Search in Diverse Industries

Are You Data-Driven, Insight-Driven, or AI-Driven

Accelerating digital strategy: Societe Generale transforms into a data-driven bank ??

Leveraging AI and Machine Learning for Data Management in BFSI: Opportunities and Challenges

November 02, 2023

New RegTech solutions, same JWG stage: 7 Feb 24

A Short Note || MX vs. MT || ISO 20022 vs. Legacy Formats ||

6 Steps for Financial Firms to Turn AI into ROI

Algorithmic assessments Case study - The Framework

What ITOM Teams are facing today

MTTD and Anomaly Detection

MTTR and Issue Resolution

Enter vuRCABot 2.0?

领英推荐

In what way is vuRCABot 2.0 faster, smarter, better??

How does vuRCABot 2.0 fix the above issue?

Results Obtained via vuRCABot 2.0 Deployment in Test Environments

Conclusion

User Experience Index (UEI) - Unveiling VuNet’s AI-Powered Customer Experience-o-Meter!

2023年6月15日

VuNet’s vuOTT – The Future of Observability

2023年3月30日

Under the Hood of vuRCABot 2.0: A look at the Machine Learning models that make VuNet’s RCA Assistant tick

2023年3月21日

Sowing the Seeds of MLOps in the Soil of vuSmartMaps?

2023年3月2日

The Why, What and How of Business Journey Observability

2023年2月23日

The Evolution from APM to AIOps through Business Journey Observability

2023年2月17日

Data-Centric Anomaly Detection: A Practical Guide

2022年6月29日

Lack Of Unified Full-Stack Observability Is A Problem For The Entire Organization

2022年5月2日

Overcoming Data Scarcity: Doing More With Less Using Data Centric AI

2022年4月22日

vuSmartMaps Observability and MLOps Models

2022年4月11日

社区洞察

其他会员也浏览了

Building a Data-Driven Culture: Change Management Strategies for Banks and Investment Firms

Decoding the Influence of Real-Time Vector Search in Diverse Industries

Are You Data-Driven, Insight-Driven, or AI-Driven

Accelerating digital strategy: Societe Generale transforms into a data-driven bank ??

Leveraging AI and Machine Learning for Data Management in BFSI: Opportunities and Challenges

November 02, 2023

New RegTech solutions, same JWG stage: 7 Feb 24

A Short Note || MX vs. MT || ISO 20022 vs. Legacy Formats ||

6 Steps for Financial Firms to Turn AI into ROI

Algorithmic assessments Case study - The Framework