ML and CI/CD Pipelines for Unstructured datasets: Efficiency and Optimization Investigation
Introduction: ??Today a complete ML development pipeline includes three levels where changes can occur:?Data,?ML Model, and?Code. This means that in ML based systems, the trigger for a build might be the combination of a code change, data change or model change.?Unstructured data shows its true traits when its data anomalies are exposed during Metadata management and logs processing.?In cloud native transformation some of these anomalies do emanate from cumulative roles of various data formats, stack overflow, misconfiguration, directory issues, data pipelines hazards, and related limitations of algorithms (AI and ML) on actual run time engine. These problems further get aggravated if we analyze end to end performance of workflows in many pressing use cases in the context of temporal dependencies vs topological dependencies, workflow scaling for structured data vs real time MOS scores of voice, video, images. The good news is Enterprise transformations??industry has embraced ?CI/CD pipeline and ML pipeline architectures. The CI/CD system is supposed to perform fast and reliable ML model deployments in production. We now automatically build, test, and deploy the Data, ML Model, and the ML training pipeline components so that we can scale the workloads better more cost effectively with business resiliency offering richer user experience . Unstructured datasets often contain components that require different feature extraction and processing pipelines. But problem complicates when (a) dataset consists of heterogeneous data types (e.g. raster images and text captions) (b) dataset is stored in a?pre-fixed dataframe (say for example pandas.DataFrame )?and different columns require different processing pipelines. ?This paper also outlines how to make ML models effective in clouds using K8s, Kubeflow, Metadata and Correlation Logic. ?While?the role of ML pipeline automation ?in continuous training of the models is quite critical the level of automation also includes data and model validation steps implying?whenever new data is available, the process of model retraining must be triggered in real time.
Decoding AIOps : Data generated by business applications across multiple sources and layers of the stack –throughout the development, run lifecycles needs accurate actionable insights. Sample insights derived by monitoring and analysing include detection of anomalous behaviors, solving ?issues ?via root cause analysis on the anomalies detected, predicting outages before incurring ?revenue loss and ultimately executing automated and autonomous remediation. Servers, databases, applications, networks, and storage systems are all part of a multi-tiered AIOps.?These constitute the sources of data. Each source can emit different types of data i.e. incidents, logs, metrics and traces. Incidents include human-generated problem descriptions of what issue they see and on which components. It also contain actions too, i.e. resolution comments of how the problem was solved, comments and what worked/did not work. Metrics reveal key health indicators of a component. Certainly logs gives a peek into messages generated within a system, while traces reveal instances of?API calls that span micro-services. The three key prerequisites of a good AIOPs program are outlined as follows.?
Limitations?of Current AI systems : Anomaly detection for storage systems using metric anomalies, incident categorisation for server incidents, root cause analysis for databases using transaction logs are not adequate ?enough for today’s dense?vm clusters/container clusters . The real issue : how to exploit the right properties in the anomalies that helps correlate and create causal links across these dependencies. The structured, semi-structured and completely un-structured data that is present in logs, traces?and incidents contain descriptions of the problem, application behavior details in terms of error codes and problem descriptions, server/virtual machine/container names, error codes, trace IDs.?So a Big Data approach will remediate the situation better. It will pinpoint on unstructured data solely by exploiting ?the voluminous nature of its logs and traces, generally full of ambiguity .?
?Why Unstructured data is different and how it Impacts ?
Other Considerations: ?Linking unstructured data has been an old?problem made hard by the nuances of human language. While IT data is not as open-ended, complex, and nuanced as day-to-day human language, it still contains?non-standard ways of expressing the same content. For example, similar entities could be expressed in different ways, different entities could seem similar but are not. Because of the number of words expressed in a log message feature space could potentially be very high-dimensional, smart linking of entities and reducing the feature space is an essential piece of this puzzle. While the structured data correlation gives a good view of when and where things are happening, it does not give enough clues about the root cause of the issue. ?AIOps must always?expose?possible root causes of anomaly due to temporal or topological dependencies, architectural limitation?of a given data format, object file or file type or directory type. In case of unstructured data when anomalies occur within the same time-window they can be related using temporal associations.
Issue Context Graph for Unstructured data: ?Typically the source anomalies could come from multiple systems specialised in detecting anomalies on metrics, logs, traces etc. Showing all data representations from anomalies as a graph creates a path for different learning algorithms to be applied. Unstructured data brings richness to the graph that otherwise would not have been possible with just structured meta-data. The issue context graph has the advantage of being the base for many graph algorithms today used as downstream analytics on it. Some use cases include but are not limited to the?following --these are outlined as follows.
In short, unstructured data has immense context and meta-data which when harnessed in a disciplined way can lead to more insights while troubleshooting and preventing failures of full stack applications. Based upon the fig 1 which outlines platform technology early success?were made in using columnar transformer based algorithm in Python. But executing?ML models for context rich (buffer delay sensitive) images, video suffer from dependencies.
Ground Rules for AI/ML models: ?As a first rule automate and enhance the entire incident ?response process.??
Fig 1 below outlines the various components of a tiered AIOps. ?Fig 2?below?shows the what an?AIOps?pipeline contains (see below)
Fig 1: AIOPs platform Technology Components
Fig 2:?An overview of our AIOps pipeline
Fig 2?above shows what steps are taken in an?AIOps?pipeline and how Model training and Model evaluation are executed.
Implementation Details of MLOps?
Today MLOps ?is??very capable of??handling ?unstructured datasets?in cloud ready deployments due to the following?key features. Fig 2?above shows the what an?AIOps?pipeline contains.
领英推荐
Leveraging the Python Libraries??
Source of data
Operation on the unstructured data
To train models with unstructured data, the data requires cleansing and sorting. However, it must do ?differently from structured data. It will depend on the nature of data, text, audio, image, or video. The cleansing will also depend on the end goal of the analysis of the data. The following are good tips.
Please remember NoSQL databases like MongoDB, Hadoop, and other popular databases can help keep the data in JSON format.
Features?Required to ?handle Unstructured Data?in Clouds
Fig 3: A Typical MLOPs Production Pipeline where ML models are used (A reference Big Data Approach
Fig 3 shows how using a Big Data Architectural model a composite production pipeline looks and how ML models are trained with data and several AI/ML algorithms.
Fig 4a below shows Kubeflow pipelines architecture in a typical Cloud using Prometheus, Metadata management and Loggings/syslogs. Fig 4b depicts and highlights ML Lifecycle using Kubeflow and K8s amidst YAML
Fig 4b depicts and highlights ML Lifecycle using Kubeflow and K8s using YAML data serializer
Conclusion???
AIOps leverages ML models to implement the massive data of large-scale systems covering structured and unstructured data. However, due to the nature of the unstructured operation data, AIOps modeling faces several data splitting-related challenges, data anomalies, imbalanced data, data leakage, and erroneous correlation logic leading to concept drift. Inefficient data pipeline, deficient ML and CI/CD pipelines start behaving as barriers to correctly produce modeling decisions on unstructured datasets. Such issues also pose threats to scaling live migration in a cloud topology despite the best automation efforts at ?vm cluster and container clusters ?level. Predicting job failures based on trace data from a large-scale cluster environment will not be possible. An ideal MLOps is one where Machine Learning assets are treated consistently with all other software assets within a CI/CD environment. ML models must be deployed alongside the services that wrap them and the services that consume them as part of a unified release process.?By codifying these practices, we then can accelerate the adoption of ML/AI in software systems and fast delivery of intelligent software. Important concepts in MLOps such as?Iterative-Incremental Development, Automation, CD, Versioning, Testing, Reproducibility, and Monitoring are the most vital tasks. With a sloppy MLOps predicting ?real time buffer underflow and buffer overflows in case of HTML 5 video and images and predicting disk failures based on disk monitoring data from a large-scale cloud storage environment will not materialize. Data leakage and transmission delay will create severe performance issues like poor MOS scores for voice and video traffic. While using a time-based splitting of training and validation datasets ?can significantly reduce such data leakage, solely counting on ML models to fix concept drift will be disappointing and not meet SLA requirements. ?Going forward ?business continuity and Digital enterprise transformations will continue to see explosion of traffic from streaming video, HTML 5 video, 3D images, media clips, audio files ?and hence the acute urgency for a holistic approach to drive continuous efficiency enhancement of AI/ML and CI/CD pipelines. The analysis of unstructured data is of utmost importance when it derives intelligence with structured ?Metadata and identifies customer’s KPIs?and action points to maximize customer’s ?business resiliency and success.