Why pre-built ML models are a must for your AIOps platform?
In one of our earlier blogs , we discussed the need for addressing the challenges of digitization through a journey-based AIOps platform (such as vuSmartMaps? platform?illustrated in Figure 1). vuSmartMaps??as a full stack AIOps product brings together its platform enablers – Data pipeline for Unified Visibility, Correlation Engine for Unified Visibility, MLOps for ML Driven KPIs, and Integration for Automation to enable organizations reach AIOps maturity goals. We demystified the platform enablers as the fundamental building block of your AIOps solution . We discussed the need for unified visibility and reasoned the need for correlation of your business transactions across distributed systems is key for business and customer journey visibility.
However, an AIOps platform is not limited to capabilities of unified visibility through customized dashboards and correlating your business transactions. After all, when your applications and infrastructures are generating so much logs, metrics, events and traces, they are of no use if you can’t turn them into knowledge, insights, actions and automation through machine learning.
Pedro Domingos in his best-seller book “The Master Algorithm” provides a nice analogy, using farming as a metaphor: “Learning algorithms (ML algorithms) are the seeds, data is the soil, and the learned programs are the grown plants. The machine-learning expert is like a farmer, sowing the seeds, irrigating and fertilizing the soil, and keeping an eye on the health of the crop but otherwise staying out of the way”.
For an AIOps platform, ML algorithms are the seeds, monitoring and observability data is the soil, MLOps (Machine Learning for Operations) ?models for AIOps are the grown plants. Imagine, you have bought a new house; you are likely to head to a nursery to buy a few grown plants ( along with some seeds too) for your garden so you can focus on irrigating and fertilizing the soil to get quick ROI from your garden along with the pleasing experience your garden can provide for you.
MLOps (Machine Learning for Operations) is about applying ML (Machine Learning) algorithms on the operations data lake built through the data pipeline by streamlining machine learning models lifecycle.
Let us understand the equivalent of “grown plants” that MLOps can enable in your AIOps platform. To get there, we need to understand why there is a need for MLOps in the first place in your AIOps platform.
AIOps Exchange, a not-for-profit private forum, held a survey across 100 IT executives representing large enterprise organizations as well as IT industry analysts and academics. Industries that were represented include financial services, transportation, technology, education, and healthcare. AIOps Exchange’s participant survey revealed that, 26% participants deal with 50 or more monitoring tools in their enterprise, while 40% organizations are flooded with 1 million plus events every day (Figure 2).
The survey states: “it’s clear that those IT leaders charged with overseeing monitoring hold decision-making power. Among AIOps Exchange participants, 49% indicated in our survey that monitoring drove their decision to deploy AIOps.”
The survey also captured the key concerns of the organizations with downtime, as shown in Figure 3. About 45% of their concerns are their inability to root-cause the incidents or predict the incidents. About 20% of the concerns are their inability to remediate the incidents faster and 20% of concerns are lack of automating the resolution of routine tasks.?
We can summarize the key asks of the participants as below:
It is simply impossible for humans to make sense of petabytes of operational data being generated by their IT systems. It is a small wonder why Gartner observes “there is no future of IT operations that does not include AIOps. This is due to the rapid growth in data volumes and pace of change (exemplified by rate of application delivery and event-driven business models) that cannot wait on humans to derive insights.”
“Data is the new oil” is a popular refrain, and as with oil, refining the operational data generated by IT systems is a huge challenge.?
Enter MLOps.
Machine learning requires skills that requires statistical thinking whereas traditional programming skills mostly require deterministic thinking. Organizations that attempt to deploy AIOps platform without MLOps team and proper MLOps practices in place will face issues with machine learning models quality and continuity that will have a negative impact on the business. You need an AIOps platform vendor who has pre-built the ML models that addresses the top concerns such as automated RCA, anomaly detection, self-healing and intelligent alerting.
Let us take a brief look at how an AIOps platform such as vuSmartMaps? platform can enable MLOps models to address the concerns raised in AIOps Exchange Survey.
Automated RCA is a self-service that identifies the suspects of any operations issues automatically and reduces drastically the time spent on identifying the root cause of performance issues and failures.
领英推荐
The fundamental approach of an automated RCA is to speed up troubleshooting across applications and infrastructure by summarizing request tracing, logs, analysis, and metadata. The automated RCA service needs to be service-centric. A dynamic service topology graph as illustrated in Figure 4 can help the service owner determine the root cause of service issues by looking at application, network, database and server metrics that measure infrastructure utilization.
With MLOps, you can understand much more complex phenomena than before. Before MLOps, you would have used very limited kind of models, like linear regression, where the curve you fit to the data is always a straight line. Unfortunately, most IT operational problems are more complex and are nonlinear. MLOps opens up a vast new world of nonlinear models. It’s like turning on the lights in a room where only a silver of moonlight filtered before. Using causality algorithms, MLOps can help identify suspects of performance issues and failures.?
Causality algorithms identify changes in critical nodes in physical or logical topologies to assess and understand the impacts of alerts. They also help your operations teams understand which events have the highest probability to be the root cause, guiding teams to the best starting point for troubleshooting and remediation.
Anomalies are patterns in data that do not conform to a well-defined notion of normal behavior. To spot an anomaly, identification of what is normal is required. Anomalies can be any linear/nonlinear combination of various attributes - seasonality, trend, auto-correlation, noise, and so on. Unfortunately, no single statistical/ML approach can take care of all scenarios.
VuNet’s approach to anomaly detection consists of online time series classification to identify signature/behavior of time series and dynamically decide on techniques to use. It uses both deep learning models as well as statistical models that can scale to thousands of signals and dimensions.
Tom Limoncelli, co-author of “The Practice of Cloud System Administration: Designing and Operating Large Distributed Systems” and a former Site Reliability Engineer at Google, relates the following story on monitoring:
“When people ask me for recommendations on what to monitor, I joke that in an ideal world, we would delete all the alerts we currently have in our monitoring system. Then, after each user-visible outage, we’d ask what indicators would have predicted that outage and then add those to our monitoring system, alerting as needed. Repeat. Now we only have alerts that prevent outages, as opposed to being bombarded by alerts after an outage already occurred.”
Tom’s story will resonate with a lot of operations engineers - set thresholds too low and you get a deluge of spurious alerts. Then you overcompensate and relax the tuning to the point of missing important health warnings. The end result is a lack of trust in alerts.
vuSmartMaps? platform reduces the number of alerts that IT systems generate on a daily basis by bubbling up important alerts and reduce noise through ML algorithms.
Enterprises typically have ITSM (IT Service Management) tools and processes to deal with incident management, problem management and change management. Automating these processes helps reduce downtime and cost when done effectively. The key to automating these processes effectively and efficiently is clean and accurate data. Without robust integration with ITSM processes, automation remains a dream. The key function of an incident management tool is to automatically assign incidents to the correct resolution group, help bring together stakeholders to investigate issues and restore services swiftly.
vuSmartMaps? platform, for example, has capabilities of run book automation where scripts and recipes can be invoked on alerts to take corrective action and additional data collection.
Key Takeaways
??Srikanth Narasimhan , the author of the article, is a Technical Advisor @ VuNet Systems. He is an Enterprise Architect and has served as a distinguished engineer at Cisco.
VuNet Systems is a deep tech AIOPs startup revolutionizing digital transactions. VuNet's platform vuSmartMaps?, is a next generation full stack deep observability product built using big data and ML models in innovative ways for monitoring and analytics of business journeys to provide superior customer experience. Monitoring more than 3 billion transactions per month, VuNet's platform is improving digital payment experience and accelerating digital transformation initiatives across BFSI, FinTechs, Payment Gateways and other verticals.
To learn more about VuNet Systems visit -?https://www.vunetsystems.com/ ?