AIOPS?

AIOPS?

What is AIOPS?

Is that means 'Artificial Intelligent Operation Support'? No, it is wrong understanding term AIOPS means 'Algorithmic IT Operation & Support' purpose to enhance the IT Operation capabilities by integrating the Machine Learning technologies and prepare system which can Collect, Analyze, Predict and Act on the IT issues or incidents which are raised within the support system.

I have played IT Operation Delivery Manager role for a Telecom Gaint, In my experience, it will be a difficult process to introduce such systems that too in existing infrastructure. Moreover building such product is hypercomplex but considering the next-generation digital experience need to consider AIOPS seriously to increase customer delight.

Why introduce AIOPS?

  1. Reduce the TTR (time-to-resolution) by 50% or more.
  2. Predict the possible outages and act on before it cause business impacts.
  3. Provide upfront related data to Support Engineers and SEMs to solve the issue quickly
  4. Reduce customer inbound complaint call in IVR and etc. channel
  5. Quick introduction of new business services with less operational overhead
  6. Multinational support environment needs a better collaborative working environment
  7. Automate the repeated incidents and manual process
  8. Enhance vigilance on IT Monitoring.
  9. Improve Security Operations and Detect fraud, attacks, vulnerability before it becomes impacting.
  10. Reduce OPEX by >30%
  11. and more.

Enterprise uses the Big Data to analyze their business growth, customer revenue, offers, recommendations, sentiments, etc. but they don't spend much on implementing Big Data or Machine Learning for Operation Efficiency.

In a survey of IT leaders, Gartner found that 18% report currently using AI/ML to analyze big data with another 42% planning to implement this by the end of 2019, while 41% say they have no plans to use AI/ML within the next two years (see Gartner graphic below).

No alt text provided for this image

only 6% of uses AI/ML to enhance APM and business productivity but with the promising percentage to introduce in the near future.

How to change from Reactive to Proactive IT?

No doubt AIOPS will drive for the change, but the biggest question is how to implement AIOPS and be Proactive?

  • Way#1: Big Bang and Big Money approach: Purchase so-called AIOPS platform like Moogsoft, Splunk, etc. and integrate with existing infrastructure which helps to collect, analyze (machine learning) and act (Collaboration, Monitoring, etc.) but the idea is to reduce the OPEX but these tools introduction needs additional operation expenditure, these software for sure will give better control and outputs.
  • Way#2: Small and Incremental Approach: Using exiting infrastructure and introduce the Machine Learning on a targeted area from ITIL.

Small and Incremental Approach

Identify the targeted process and system from ITIL and convert them into small action packs.

No alt text provided for this image

Each of the ITIL processes needs a Machine Learning system which can analyze and act.

Service Design

One of the critical process in the ITIL but statistically less focus given when enhancement made on the existing services, AIOPS can be introduced in following life-cycle stage to strengthen the Service Design delivery.

  • Capacity Management: Machine Learning algorithm which is implemented using Queue Theory & Scheduling Theory help to predict the process arrival patterns, execution queues and system schedules and ways to improve the service efficiency.
  • Risk Management: Machine Learning algorithm can study the existing service risks and current system behaviour and update the risks levels, priority and recent findings with possible mitigations.
  • Availability Management: Manual calculated formulas is always debateable like GDP calculation. Machine Learning can predict and calculate the Availablity depending on various incidents and events occurred around the system. This will provide higher visibility and confidence to System Designer to study the promised vs actual availability.
  • Information and Security Management: Due to increased Digital exposure and associated treats, near impossible to provide proactive security measure even for high skilled Sec-Ops. Introducing Machine Learning can analyze security threats like DDOS, Vulnerability Reports Analysis, Patch recommendation system and License Renewal and Usage recommendation system.

I will cover the remaining ITIL process (Operation and Transition) and possible implementation of AIOPS in the next part of the article with detailed implementation architecture and opensource alternatives.

If enterprise already using the Big Data and generating the graph and reports, then implementing Machine Learning will be less complex.

Data Collection and Processing:

Collect the APM and Event data using Nagios or similar event monitoring tools for the targeted systems. (Not that straight forward but considering the benefits, worth collecting the data).

Stage the collected data in flat files for Bigdata processing, BigData platforms like Ab Initio or Apache Spark can process the data and convert into correlating events and patterns. Collected events can be staged in a middleware system like Kafka.

Machine Learning:

No doubt ML needs training and dataset. Initial data collection can be used to prepare the Model and training dataset.

Once the Model is prepared and trained, processed data from middleware can be analyzed for continuous study and generate recommendations. (Complex but with current available frameworks and models which is achievable)

Act on the Recommendations:

Machine Learning model can generate different recommendations, action plans, suggestion, categorized information, etc. converting and delivering those points to the appropriate audience and system make the entire process effective.

Tools like Apache AirFlow (from Airbnb) helps to design the Workflow automation tool which helps to deliver the recommendation to targeted systems or audience in a different format. The action could be executing some purge action in DB, retrying some commands in network elements, restarting some middleware components or sending consolidated observation report to System Owners or raising incidents, etc.

Note: AIOPS needs investment in infrastructure to generate fast and quicker recommendation and machine learning. Considering the benefits the AIOPS can generate, the enterprises need to invest. If they have BigData and TimeSeries DB infrastructure, then they might have already 30-40% of infrastructure needed for Machine Learning.

Existing IT Operation environment needs a complete study and incremental approach to introducing an intelligent & cost-effective operations AIOPS, it is not like introducing tools in DevOps process.







要查看或添加评论,请登录

Rathishraj Janarthanan的更多文章

  • Queue Busting - Self-Checkout

    Queue Busting - Self-Checkout

    Moving QueueOne to next stage by introducing Self-Checkout along with the Virtual Queue Management System. QueueOne…

  • Headless CMS and No-CMS with GraphQL

    Headless CMS and No-CMS with GraphQL

    QueueOne provides Self-Check-in Web Interface which allows end-customers to join the Queue while joining the Queue…

  • Interesting Startup / Solution www.pigeonholelive.com

    Interesting Startup / Solution www.pigeonholelive.com

    Often we attend a workshop, conference, meeting or meetup sessions. It may be once a lifetime opportunity but how often…

  • Analytical and Metics with Druid and Superset

    Analytical and Metics with Druid and Superset

    As a CRM solution QueueOne needed to generate different analytical graphs and metric reports for partners to make…

  • How to be Comprehensivist?

    How to be Comprehensivist?

    Doesn't matter man or woman as an entrepreneur or executive needs to practice to be a Comprehensivist. "A…

  • Is Good to be a One-Man Army in Technical Product Development?

    Is Good to be a One-Man Army in Technical Product Development?

    Why is the ‘one-man-army’ a problem In Technical Product Development? So does this mean you can't start a product…

    1 条评论
  • Smart Stanchions - Queue Counter IoT device

    Smart Stanchions - Queue Counter IoT device

    Intelligent Queue Management, Footfall Analytics and Queue CRM all bundled together to measure, monitor, manage and…

社区洞察

其他会员也浏览了