AIOPS Design Principles

AIOPS Design Principles

Along with the business-level guiding principles, the responsible team should collaborate to develop a set of design principles for each functional area through which AIOPS will be integrated. The following are some examples of design principles:

GENERAL

  • The architecture should maximize the available information for training learning algorithms so that each region can take advantage of relevant experience from other regions.
  • Predictions should be based on operational data.
  • Unsupervised learning algorithms should learn in-line (not require a full pass through the historical data to learn new patterns).
  • Models should learn from human feedback in-line, i.e. without requiring parameter tuning or feature engineering (i.e a. data scientist).
  • The architecture should support arbitrary unsupervised and supervised learning algorithms

FAULT/EVENT MANAGEMENT

Unsupervised event and log clustering should learn patterns across tools/domains- systems using different models for clustering will have a different meaning

PERFORMANCE MANAGEMENT

Unsupervised anomaly detection should learn patterns across tools/domains- systems using different models for clustering will have a different meaning

CONFIGURATION MANAGEMENT

Learning algorithms should be able to take advantage of special, topological relationships of objects in order to maximize performance and deliver root cause inferences.

INCIDENT MANAGEMENT

Align Major incidents to services and the right resources at the right time.

Enable consistent and reliable incident data for ongoing ML training.

Provide context from event monitors, anomalies & meta-data for accelerated MTTR.?

CHANGE MANAGEMENT

Accurately evaluate risk based on historical context.

PROBLEM MANAGEMENT

Reliably provide root cause details for accurate problem resolution.

Enable & enforce feedback loop from known errors & problem resolution to prevent future incidents.

OPERATIONAL KNOWLEDGE MANAGEMENT

Leverage ALL classified learning(automated and manual) for ML training.

RUNBOOK AUTOMATION

Automate preventative tasks for pre-Incidents.

Allow for scripting for resolution, not just service restoration.

AN EXAMPLE OF A LOGICAL ARCHITECTURE MIGHT LOOK LIKE THE VISUALIZATION BELOW:

No alt text provided for this image


Learn more about Grok’s design principles using AIOPS by starting a free trial here.?

要查看或添加评论,请登录

Grokstream LLC的更多文章

社区洞察

其他会员也浏览了