A journey to Autonomous AI Ops
GURDEEP SINGH CHOPRA
Digital Transformation | AI Ops|SecOpsAI | Intelligent Automation| Artificial Intelligent | Lean Six Sigma Black Belt | Process Re-engineering | Product Owner | Prince 2| TOGAF| CISSP|ISC2|
For a long time AI Ops has been defined as identification, analysis and rectification of incidents leveraging AI in the IT Ops. In the ever-evolving world of AI and complex infrastructure landscape its not sustainable to be limited to that.
We need to go ahead and integrate AI Ops in the overall IT landscape to enhance our capabilities towards preventive maintenance from reactive maintenance.
Most of the incident management processes are based on AI fundamentals with human injections due to lack of overall view of the infrastructure landscape at Application, services, hosts and data centre level.? Lack of integration of ITSM, application management, development and digital transformation initiative result in silo buildup which further enhance the risk of AI Ops turning manual and less efficient.
Before moving forward lets first understand what is AI Ops.
?
What is AI Ops ?
Artificial Intelligence for IT Operations, involves leveraging artificial intelligence and machine learning to enhance and automate various aspects of IT operations. The components of AI Ops typically include:
·???????? Monitoring and Analytics: Continuous monitoring of IT infrastructure, applications, and performance metrics, along with advanced analytics to detect patterns and anomalies
·???????? Preventive maintenance: Build an open, observable and transparent ecosystem giving relevant information for IT Ops, AM team, Dev Team and business operations to have view of system performance and potential issues proactively
·???????? Incident Management: Using AI to identify and prioritise incidents, as well as automate responses to common issues. Collaborate with Monitoring and Preventive Maintenance framework for effective migration of incidents from identification to registry and eventually to resolution
·???????? Root Cause Analysis: MTTI and MTTR is a key parameter of IT Ops performance and ability to automatically analyse data to determine the underlying causes of problems and prevent them from recurring is the key performance parameter
·???????? Automation: Implementing automation for routine tasks and processes, reducing manual intervention and improving efficiency for example assignment of incidents to right assignment groups, followups, notifications to user base and evaluate the Cis impacted with relations to an incident
·???????? Collaboration and Communication: Integrating AI into collaboration tools to facilitate communication and coordination among teams making operations linear with right accountability and clarity in operationalization
Real life Example
In a real life example if we leverage RUM ( Real User Monitoring) and APM ( Application Performance Monitoring) for our website following are the steps we can follow to have AI Ops architecture adopted:
RUM will evaluate the performance our customers engagement on a website on various parameters like end user response time, drop off rate, total page load time, stalled pages, most common visited pages, customer demographics and segmentation based on device, browser, equipment, operating system, internet geography etc. to understand how our customers are experiencing.
Any drop in performance is associated to the website and the dependent applications which drive performance of these websites for eg: services, backend applications, hosts, integration services, data centres etc. now APM getting kicked in will identify the supplier application’s performance in relationship to various health rules and baselines to support the website.
Proactive identification of anomalies based on historical data, current baseline performance, rules and machine learning models to predict propensity of a potential issue and creating a potential incident in your ITSM tool like ServiceNow etc. to have the right team allocated on the potential issue for ex increase in volume inflow on the website experienced which might increase the load on backend servers for a particular service of a legacy system resulting in a potential failure of services or latency in the response time from the legacy system impacting other parts of the organization along with the customers. No an incident is raised proactively rather than post the issue is experienced and this is an example of preventive maintenance.
Once the incident is raised rule engines and automation are triggered to identify the potential issue of the incident and route it to the right AM team. Usually when the incident is raised by a human user the description is subjective with unstructured information however in this case since this is auto generated with a system the information is standardized and in accordance to the predefined standards and hence the routing and triaging of the incidents is faster as the potential root cause is identified by the APM tool and shared beforehand so the entire RCA cycle is covered by the tool itself. This helps in improving the overall MTTI and automate the process.
Once the incident is identified the fix provided by the AM team is expedited since the right team gets the incident, there is an automated follow-up and communication protocol is initiated via the automated system and eventually once the issue is fix/resolved the RCA, Fix, potential fail overs and risk are updated by the automated team and stored in the system logs for future reference which are mentioned in future notifications with links to reference incase from the past for faster resolutions and considerations
Last but not the least is the eventual collaboration and communication to effected parties and measurement of overall performance in terms of MTTI, MTTR, translation of business hours lost and customer impact is captured, calculated, stored and communicated to the right parties to have one view of performance dashboard and performance standards.
?
Different stages in AI Ops Deployment Lifecycle ?
Now that we have understood the end to end potential life cycle I will like to clear the air by saying its not as easy as it sounds. So incase we have to adopt we need to have few basics in place and after that select the right tool to plan on moving forward:
领英推荐
·???????? Discovery Phase : Do a assessment of the current IT landscape and ecosystem at all levels to have a visibility, evaluate the siloed monitoring tools and segregate monitoring to logging. Evaluate the variation of different monitoring / logging parameters and identify minimal viable requirement for logging and setting up an orchestrator in between to consume and collaborate all information
·???????? Data readiness and harnessing: Build common data baseline and harnessing framework, identify systems with deviation on grounds of capabilities and current setup and move closer to the baseline or generate synthetic data or exclusion rules
·???????? Identification of tools and consolidation: Identify RUM and APM tool capability based on the requirements and outputs from the discovery phase for identification and utility of the right tool. Depending upon the maturity of the current infra ecosystem and data handling maturity the tool might vary
·???????? Training machine learning models: Most of the RUM and APM tools are good enough to identify anomalies, variation, errors and potential issues however they are not so great with leveraging the data and creating / projecting the propensity of incidents based on baseline, historical data and current infra behaviour
·???????? Integration: Once the information is available based on health rules, identified anomalies and prediction from the ML models collaboration with the processes of incident management, fine tuning of false positives and integrating with the ITSM tools and processes is important, failure or leakage in this process results in increased false positives and lack of trust from the IT operations and AM team. This can also increase number of incidents if not done in controlled and phased approach to manage effectively.
·???????? Continuous improvement: The entire cycle is not revolutionary but evolutionary and needs to be improvised on regular basis for optimization. We should have regular retrospection of the process, feedback, models, data and analytics on regular intervals
?
Challenges for deployment of AI Ops :
AI Ops seems to a star kid and poster boy for IT Operations, as it provides end to end visibility, automation of manual tasks, our paradigm-shift from logging to preventive maintenance. However, as we all know life is not that simple in IT Operations, as it comes with a baggage of legacy systems, applications variations, silo operations, closed ecosystems and spiderweb network of applications depending upon each other with limited or no visibility making things even more complex.
Hence the next step is to understand the major challenges in deployment of Autonomous AI Ops.
A.??? Holistic view and integration of the ecosystems:? In today's complex ecosystems where applications are developed and on-boarded on regular basis having an holistic overview of all applications, their relations, inter dependencies and impact due to operational issues is a big challenges for IT operations and this can lead to delayed ROI on the investments made for AI Ops
B.??? Data quality and availability: Normally the IT infra setup and ecosystem would have evolved over decades of old and new systems with different data architectures and hence extracting data for monitoring and building a baseline with common understanding and fitting in the end to end value chain is a big task. Hence, we should be careful and keep in mind to set up a minimal viable setup for extracting the best value out of the dataset rather than creating new setup with huge investments
C.??? Multiple logging / monitoring systems: Organizations have diversified development and application management setups which in turns have their own budget and since monitoring and logging is not controlled and governed centrally it’s left to development team, sometimes it gets even more completed since we have multiple vendors involved leading to different tools being used by each team and each vendor and hence it seems like duplicating the effort of data collection and consolidation for end to end value chain. To mitigate ideally, we should have standardized monitoring tools and guardrails for monitoring as a framework and incase that’s not available rather than reinventing the wheel with new monitoring setup we should have unified logging systems consuming information from multiple systems and providing baseline details in the framework defined for consumption ?
D.?? Skill gap : This is one of the biggest challenge in today's ecosystem due to technological diversity, requirement of business acumen, understanding of AI and ML and overall ability to connect all these different moving parts together to build a well functional operational setup. This is a rare skill set and it takes time to build on with trained resources in their respective domain due to exponential evolution over the period of time. This is a high risk when moving forward with adoption of AIOps since lack of these resources increases our dependence on vendors or consultants leading to vendor lock or knowledge leakage in longer run
E.??? Change management : Considering so many moving parts, different teams and stakeholders, change management is a big challenge which includes having multiple stakeholders on same page, driving change and continuous proactive communication. Usually change management cost is not considered when we talk about AI Ops however I strongly believe we should have a considerable part and team allocated for the same to smooth transition and scaling
F.??? Ethical and regulatory considerations: If we are looking towards having AI Ops on internal tools and monitoring internal data there is no major issue, keep into consideration contractual obligations with vendors and applications prohibits the monitoring by a unified or AI Ops tools apart from their own approved tool.
However it starts getting complicated the moment we start doing it where our customers or employees data is involved, for eg: RUM (real user monitoring) where customer details, ids, names,. IP addresses and other geographical locations can be tracked for performance monitoring of websites or mobile applications and this is where we need to be highly considerate about consent, declaration on our policies on data handling and ownership of us not just as a consumer of ML and AI tools but also as developers and deployers of these tools. These can lead to legal issues hence need to be very well validated and governance around the same need to be setup with regards to responsible AI needs.
G.?? Costs: Now since we are talking about observibility and AI Ops integration across multiple systems with the right unified tool its is expected to be costly at the very first glance however this is more of a perception buildup which needs to be well evaluated keep in mind various small and big monitoring and logging tools in the organization, capabilities overlap of these tools and time and effort investment in un-unified dashboards. It's definitely expensive to adopt AI Ops then normal operations however the value and the ROI generated is quite significant hence we should consider the same and identify if we are able to break-even with our investments and when and if that falls within our agreed baseline
?
Now that we have gone ahead and understood about the end to end lifecycle of autonomous AI Ops deployment, real life example and challenges we can envision our future roadmap on the journey. What is important to understand that the practical deployment is way more complex due to the challenges highlighted and need a strong support from all levels with a strong commitment to the future deployment and a clear vision of end goad or target state.
?
There are few misconceptions about it being the silver bullet to solve the issues and take IT Ops at the pinnacle of stability but that’s not true at the same time misconception of it being fully autonomous and no need of human in the loop also adds unrealistic expectations. Hence as IT Ops leads and AI Ops adopters we should focus on the practical adoption and the journey of evolution without expecting it to be a revolution.
?