Understanding AIOps and its linkage to High Availability and Observability

Understanding AIOps and its linkage to High Availability and Observability

Technology never sleeps.

Businesses need to have high availability, especially for mission critical applications. When we do target up times – the 5 9s (99.999 %) is an expensive affair and defining the number of 9s in your SLA is directly proportional to the cost that you are willing to invest.

Technology leadership aims to define the SLAs for the different aspects of availability - goals to target for redundancy, failover, rollback and scaling and the degree of automation for each of these. There is also the angle of composite availability – as systems do not work in silos but depend on other upstream systems or integrate through external interfaces in the cloud native era of distributed systems.

A very closely linked aspect to resilient systems with high uptime is “observability” – the ability to learn what is happening in your system and avoid extended outages.

The three pillars of Observability being: metrics, logs, and traces.??

The scope of Observability, to a large extent, is about helping you identify the problem as soon as it happens. And sometimes even before, an incident happens – and this is where AIOps fits in as it tries to provide proactive alerts and responses based on the event and telemetry data captured in the IT environment.

AIOps can be seen as a part of observability and assuming you have total observability data viz.? M.E.L.T (metrics, events, logs and traces) you can leverage AIOps and the power of AI/ML to correlate events and identify problems, cause of incidents and suggest what can be done to fix it.

?AI Ops, its benefits and what it involves

The term “AIOps” stands for “Artificial Intelligence for IT operations.” Originally coined by Gartner, it refers to the way data and information from an application environment are managed by an IT team -- in this case, using AI.

The definition of AIOps by Gartner says “AIOps platforms utilise big data, modern machine learning and other advanced analytics technologies to directly and indirectly enhance IT operations (monitoring, automation and service desk) functions with proactive, personal and dynamic insight.”

AIOps is a system that relies heavily on data and learning to provide proactive prediction and alerts and decisions which will hopefully increase in accuracy over time.

The benefits of AIOps can boast on are

?This is of enormous value to IT Operation teams and in turn helping achieve high availability and adhering to uptime metrics.

The typical AIOps Use cases include – decrease MTTR (mean time to repair) and associated cost, proactive performance monitoring, drive faster and better decision making for the team and its broad categories when we can see an impact are incident and problem management, IT Operations Analytics, Infrastructure Management, Capacity Management.

The Road to AIOps

At the heart of AIOps is machine learning and telemetry data collected. Data ingestion is critical - clean and usable telemetry data which you can depend on is a prerequisite. The other important thing that can be done at the start of the AIOps journey is to understand what is necessary for your enterprise to collect and measure and focus on those aspects to build out your plan. Explain ability, as one of the key aspects for this initiative as AIOps augments the decision-making capability of the IT Ops team and therefore needs to be able to earn the trust of the team using it. Finally, the IT environment and an intelligent infrastructure is crucial in order to capture the right telemetry and event data.

Five Main Functions as described by Gartner for AIOps are

  1. Data Ingestion – this function will ingest, index and normalise events from across devices and vendors to grab data and telemetry like syslog, config changes, SNMP, NetFlow and others
  2. Topology – has a list and relationship between the various devices to understand the context between the end user and the resource they are trying to access
  3. Correlation – the next function is to correlate the telemetry data between devices
  4. Recognition – this is where issues are detected or predicted based on the machine learning model. This is the stage to identify the use cases and decide your focus and what you want to achieve with the AIOps
  5. Remediation – this is the function where a recommendation is made based on the situation or automates a response to the external system

There is a possibility of a lot of false positives in the beginning as it is a system that relies heavily on learning and improving over a period of time with the supervised learning model.

AIOps - Artificial Intelligence for IT Operations

AIOps Platform – the right time to get on one

?The AIOps platform market is relatively new and most vendors are in the process of introducing more use cases to their machine learning models.? The features provided by most platforms involve

  • Identifying meaningful patterns that provide insights ? ( Pattern Discovery)
  • Providing Automated Insights - using correlation, asset relationships and dependencies between events associated with an incident (Root Cause Analysis and Predictive Analysis)
  • Probable Adaptive Remediation - As the technology matures, users will be able to leverage prescriptive advice from the platform, enabling the action stage

The road map to approach for AIOps Platform could be with incremental goals for observability – setting up a metrics program and practical outcomes and then move to an all-inclusive AIOps platform.

Many companies are also trying a do it yourself (DIY)-architectural approach towards AIOps using the strategies and tools like data lakes, transport layer, data pipeline (e.g., using Kafka), analytics and visualisation.

?Conclusion

To summarise, meeting high availability requirements of mission critical applications has a huge dependence on observability. And given the amount of data captured by modern IT environments and infrastructure it is of great value to the IT Ops team to automate and plan the adoption of AIOps in an incremental manner.

The best way to start on the AIOps journey is to identify the use case to start off - ideally the small and focus areas where you want to ensure high availability and failover. This can be built upon over a period and at the right time an AIOps platform can be introduced to justify the ROI of improved MTTR (mean time to repair) and improved customer experience.

“The journey of a thousand miles begins with a single step” and it's never too late to take the first step towards AIOps.

Nikhil Shirali

Communications professional at Intel India

2 年

Great insights on AIOPs Deepa!

要查看或添加评论,请登录

Deepa Naik的更多文章

社区洞察

其他会员也浏览了