登录查看更多内容

ML and CI/CD Pipelines for Unstructured datasets: Efficiency and Optimization Investigation

GETTY VERMA

发布日期: 2022年11月7日

Introduction: ??Today a complete ML development pipeline includes three levels where changes can occur:?Data,?ML Model, and?Code. This means that in ML based systems, the trigger for a build might be the combination of a code change, data change or model change.?Unstructured data shows its true traits when its data anomalies are exposed during Metadata management and logs processing.?In cloud native transformation some of these anomalies do emanate from cumulative roles of various data formats, stack overflow, misconfiguration, directory issues, data pipelines hazards, and related limitations of algorithms (AI and ML) on actual run time engine. These problems further get aggravated if we analyze end to end performance of workflows in many pressing use cases in the context of temporal dependencies vs topological dependencies, workflow scaling for structured data vs real time MOS scores of voice, video, images. The good news is Enterprise transformations??industry has embraced ?CI/CD pipeline and ML pipeline architectures. The CI/CD system is supposed to perform fast and reliable ML model deployments in production. We now automatically build, test, and deploy the Data, ML Model, and the ML training pipeline components so that we can scale the workloads better more cost effectively with business resiliency offering richer user experience . Unstructured datasets often contain components that require different feature extraction and processing pipelines. But problem complicates when (a) dataset consists of heterogeneous data types (e.g. raster images and text captions) (b) dataset is stored in a?pre-fixed dataframe (say for example pandas.DataFrame )?and different columns require different processing pipelines. ?This paper also outlines how to make ML models effective in clouds using K8s, Kubeflow, Metadata and Correlation Logic. ?While?the role of ML pipeline automation ?in continuous training of the models is quite critical the level of automation also includes data and model validation steps implying?whenever new data is available, the process of model retraining must be triggered in real time.

Decoding AIOps : Data generated by business applications across multiple sources and layers of the stack –throughout the development, run lifecycles needs accurate actionable insights. Sample insights derived by monitoring and analysing include detection of anomalous behaviors, solving ?issues ?via root cause analysis on the anomalies detected, predicting outages before incurring ?revenue loss and ultimately executing automated and autonomous remediation. Servers, databases, applications, networks, and storage systems are all part of a multi-tiered AIOps.?These constitute the sources of data. Each source can emit different types of data i.e. incidents, logs, metrics and traces. Incidents include human-generated problem descriptions of what issue they see and on which components. It also contain actions too, i.e. resolution comments of how the problem was solved, comments and what worked/did not work. Metrics reveal key health indicators of a component. Certainly logs gives a peek into messages generated within a system, while traces reveal instances of?API calls that span micro-services. The three key prerequisites of a good AIOPs program are outlined as follows.?

Collection and aggregation of multiple sources of data based on design principles and a big data architecture?
?Observability and monitoring efficiency of all involved pipelines
Most efficient techniques needed to derive insights out of collected data.

Limitations?of Current AI systems : Anomaly detection for storage systems using metric anomalies, incident categorisation for server incidents, root cause analysis for databases using transaction logs are not adequate ?enough for today’s dense?vm clusters/container clusters . The real issue : how to exploit the right properties in the anomalies that helps correlate and create causal links across these dependencies. The structured, semi-structured and completely un-structured data that is present in logs, traces?and incidents contain descriptions of the problem, application behavior details in terms of error codes and problem descriptions, server/virtual machine/container names, error codes, trace IDs.?So a Big Data approach will remediate the situation better. It will pinpoint on unstructured data solely by exploiting ?the voluminous nature of its logs and traces, generally full of ambiguity .?

?Why Unstructured data is different and how it Impacts ?

There are no predefined rows, columns, values, or features in the unstructured data, and is messier than structured data.?Structured data has pre-defined feature sets and focuses on the target attribute whereas unstructured data may throw diverse ?aspects to us
Beyond the structured meta-data, the message portion of data types like?log sand ?incident tickets is unstructured containing??rich information, i.e. error codes, error symptoms ( file system is full), conditions (setting option offline), metrics within text shows problem conditions, detailed component names, regions.
Metrics extracted from the unstructured data in anomalies from storage available capacity ?metric ?or ?one from a server disk space metric becomes very helpful.
Various entities from different data types in the spatio-temporal space which captures what is happening in the entire operations stack across the (App layer, web layer and Database layer) are also important to look at.
They give symptoms validated by Metadata and log files. Relationships across these entities, generally captured by temporal dependencies (co-occurs, happens before, happens after etc.)
Topological dependencies like same host, processes?running on?vm or microservices?calling out?to each other via API calls are extremely important.

Other Considerations: ?Linking unstructured data has been an old?problem made hard by the nuances of human language. While IT data is not as open-ended, complex, and nuanced as day-to-day human language, it still contains?non-standard ways of expressing the same content. For example, similar entities could be expressed in different ways, different entities could seem similar but are not. Because of the number of words expressed in a log message feature space could potentially be very high-dimensional, smart linking of entities and reducing the feature space is an essential piece of this puzzle. While the structured data correlation gives a good view of when and where things are happening, it does not give enough clues about the root cause of the issue. ?AIOps must always?expose?possible root causes of anomaly due to temporal or topological dependencies, architectural limitation?of a given data format, object file or file type or directory type. In case of unstructured data when anomalies occur within the same time-window they can be related using temporal associations.

Issue Context Graph for Unstructured data: ?Typically the source anomalies could come from multiple systems specialised in detecting anomalies on metrics, logs, traces etc. Showing all data representations from anomalies as a graph creates a path for different learning algorithms to be applied. Unstructured data brings richness to the graph that otherwise would not have been possible with just structured meta-data. The issue context graph has the advantage of being the base for many graph algorithms today used as downstream analytics on it. Some use cases include but are not limited to the?following --these are outlined as follows.

Community detection/Graph partitioning algorithms to analyze several patterns of issues.
Centrality algorithms that point to the probable root cause
Link prediction and path finding techniques that help in outage prediction
Similarity: useful to find similar issues
Graph embeddings ?create learned representations of these issues

In short, unstructured data has immense context and meta-data which when harnessed in a disciplined way can lead to more insights while troubleshooting and preventing failures of full stack applications. Based upon the fig 1 which outlines platform technology early success?were made in using columnar transformer based algorithm in Python. But executing?ML models for context rich (buffer delay sensitive) images, video suffer from dependencies.

Ground Rules for AI/ML models: ?As a first rule automate and enhance the entire incident ?response process.??

Proactive anomaly detection: AIOps tools to allow automatically detecting anomalies in our environment and triggering notifications to our monitoring solution and other tools where our teams collaborate, such as Slack.
?Event correlation and enrichment : ?Navigate teams to root cause faster by prioritizing?the ?issues ?by correlating?related alerts, events,?incidents and enriching them with context from historical data or other tools ?
Use advanced tools for both machine-generated (i.e., time-based clustering, similarity algorithms, and other ML models) and human-generated decisions to power the correlation logic and to automate flapping detection and suppress noisy or low-priority alerts.
?Intelligent alerting and escalation: Route ?speedily incident data to the individuals or response teams ?
AIOps tools to use ML models to evaluate data ?from?incident management and monitoring tools and suggest a team to resolve a particular problem faster, because either they’ve already seen something similar in the past or are experts at the specific components that are failing.

Fig 1 below outlines the various components of a tiered AIOps. ?Fig 2?below?shows the what an?AIOps?pipeline contains (see below)

Fig 1: AIOPs platform Technology Components

Fig 2:?An overview of our AIOps pipeline

Fig 2?above shows what steps are taken in an?AIOps?pipeline and how Model training and Model evaluation are executed.

Implementation Details of MLOps?

Today MLOps ?is??very capable of??handling ?unstructured datasets?in cloud ready deployments due to the following?key features. Fig 2?above shows the what an?AIOps?pipeline contains.

Similar to the regular structured data, the MLOps gives a foundation for the practice to access the data lake, set up intermediate components for transformation/tagging, use the model code to generate a trained model, and deploy the trained model to the web service.
?MLOps automates various tasks, like data ingestion through the streaming API, scheduling the training, deploying the latest trained models, or sending the alerts to related stakeholders for an item that needs immediate attention.
MLOPs creates ?regular reports for stakeholder’s consumption and give a baseline to the upcoming models.
Edge Computing: This industry is a big user of trained models and sometimes can also follow the hierarchical model. The parent trained model may be trained on the servers and be broadcasted to IoT devices thereby reducing ?latency of the prediction and updated data requirements. It is quite relevant on video cameras for face/object detection.

Palantir Technologies 10 个月前

Exploring a Taxonomy and Process Model for Data…

Data & Analytics 4 个月前

What are some of the challenges with using machine…

Machine Learning 1 年前

Leveraging the Python Libraries??

Python libraries like Librosa and PyAudio are used widely for the analysis of audio data. This data analysis is used for music genre detection, voice commands, generating language/voice for voice-based assistants.
Pyo, pyAudioAnalysis, Dejavu, Mingus, hYPerSonic, Pydub, Loris are few Python libraries that provide users out of the box features for immediate use by tuning hyper-parameters.
These libraries provide features like sound granulation, audio manipulations, classify unknown sounds, apply dimensionality reduction to visualize audio data and content similarities, perform supervised and unsupervised segmentation, detect audio events and exclude silence periods from long recordings.
Libraries like Mingus work on music data, while hYPerSonic, Pydub, Loris work on the low-level analysis of the sound data for time- and frequency-scale modification and sound morphing.

Source of data

AI/ML practices retrieve unstructured data from various sources like business documents, emails, social media, customer feedback, webpages, survey responses, images, audio, and videos.
Data etrieved from a central repository containing emails, business documents or feedbacks.
At the same time, scrapping can fetch data from different websites that are based on indexing keywords,?which in turn are aligned to the business’s end- goals.
Social media provides APIs to obtain data on users, pages, or hashtags highlighting ?the focus area.

Operation on the unstructured data

To train models with unstructured data, the data requires cleansing and sorting. However, it must do ?differently from structured data. It will depend on the nature of data, text, audio, image, or video. The cleansing will also depend on the end goal of the analysis of the data. The following are good tips.

The text, when analyzed, will give associations, sentiments, translations, or trends.
NLTK, Gensim, Polyglot, TextBlob, CoreNLP, spaCy, Pattern, Vocabulary, PyNLPI, and Quepy are few libraries that provider uses out of the box features for immediate use by tuning hyper-parameters.
Most libraries provide features like tokenization, language detection, named entity recognition, part of speech tagging, sentiment analysis, word embeddings, classification, translation, WordNet integration, parsing, word inflection, adding new models, or languages through extensions.
Specialized libraries like Pattern, Vocabulary, PyNLPl, Quepy give additional features like crawling text from websites, network analysis, graph centrality, visualization, translations, and question-like interface.
With audio, one has to analyze the sentiment based on the tone as well. The analysis is done after audio data is converted to text and analyzing the same. Tone and accent are also an essential aspect of the audio.
?Image processing is one of the other extensions of data analysis. The image analysis can help identify/count people, objects, detect faults, and various other features.
Scikit-image, OpenCV, Mahotas, SimplelTK, SciPy, Pillow, Matplotlib are few Python libraries for image data giving users out of box features for immediate use with the tuning of hyper-parameters.
The libraries provide image processing, face detection, object detection,??watershed transformation , morphological processing,?image convolution . They also support multiple image formats, manipulate images for extracted information, or conduct analysis for measurements. Video data processing is an extension of image and audio data processing. The audio and image data are analysed and collated to give the results.
Since the models to analyse and predict unstructured data may require data in different forms, the data captured is pushed to the data lake and retrieved for training the models based on the transformations.
To predict using the streaming data, the trained models are further deployed on the MLOps workflow as web services. The streaming data subsequently train the model if the forecast is accepted or rejected. Finally, the trained model may be deployed again as the web service. The frequency of deployment may vary from few minutes to few days.
General techniques used in handling structured data can be applied to unstructured data for ease of operations later. The units of unstructured data are tagged with the findings for use with further models.

Please remember NoSQL databases like MongoDB, Hadoop, and other popular databases can help keep the data in JSON format.

Features?Required to ?handle Unstructured Data?in Clouds

A few features offer how to connect with multiple platforms on a regular basis to fetch data with specific parameters. The data can come in various extensions, and the API can give back various statistical figures for the data scientist to explore and decide their further course of action.
?MLOPs components must be customizable for text, audio, and image data in streaming or batch format.
?It must offer universal connectors to fetch data from various data sources, which is typically a file system in case of unstructured data. Some examples. are Local FS, HDFS, or AWS S3. A dict parameter in the import dataset method of class gives the repository specifications.
For Unstructured Dataset, the Data connector returns a pandas DataFrame with metadata information of file(s) constituting the file size, name, and path.?

Fig 3: A Typical MLOPs Production Pipeline where ML models are used (A reference Big Data Approach

Fig 3 shows how using a Big Data Architectural model a composite production pipeline looks and how ML models are trained with data and several AI/ML algorithms.

Architect must provide a basic explore method to perform numeric analysis on the file sizes of all the files collected. MLOPs must allow Data to be reused in the models for the predictions and analysis ?
Some features accurately calculate different numeric metrics (i.e. min, max, pdf, quartiles, mean, median, var, etc.) on file sizes. Since the data is in the pandas data frame, one is free to use other libraries to do more advanced analysis. It is a good practice to version the data at every step of transformation.

Fig 4a below shows Kubeflow pipelines architecture in a typical Cloud using Prometheus, Metadata management and Loggings/syslogs. Fig 4b depicts and highlights ML Lifecycle using Kubeflow and K8s amidst YAML

Fig 4b depicts and highlights ML Lifecycle using Kubeflow and K8s using YAML data serializer

Conclusion???

AIOps leverages ML models to implement the massive data of large-scale systems covering structured and unstructured data. However, due to the nature of the unstructured operation data, AIOps modeling faces several data splitting-related challenges, data anomalies, imbalanced data, data leakage, and erroneous correlation logic leading to concept drift. Inefficient data pipeline, deficient ML and CI/CD pipelines start behaving as barriers to correctly produce modeling decisions on unstructured datasets. Such issues also pose threats to scaling live migration in a cloud topology despite the best automation efforts at ?vm cluster and container clusters ?level. Predicting job failures based on trace data from a large-scale cluster environment will not be possible. An ideal MLOps is one where Machine Learning assets are treated consistently with all other software assets within a CI/CD environment. ML models must be deployed alongside the services that wrap them and the services that consume them as part of a unified release process.?By codifying these practices, we then can accelerate the adoption of ML/AI in software systems and fast delivery of intelligent software. Important concepts in MLOps such as?Iterative-Incremental Development, Automation, CD, Versioning, Testing, Reproducibility, and Monitoring are the most vital tasks. With a sloppy MLOps predicting ?real time buffer underflow and buffer overflows in case of HTML 5 video and images and predicting disk failures based on disk monitoring data from a large-scale cloud storage environment will not materialize. Data leakage and transmission delay will create severe performance issues like poor MOS scores for voice and video traffic. While using a time-based splitting of training and validation datasets ?can significantly reduce such data leakage, solely counting on ML models to fix concept drift will be disappointing and not meet SLA requirements. ?Going forward ?business continuity and Digital enterprise transformations will continue to see explosion of traffic from streaming video, HTML 5 video, 3D images, media clips, audio files ?and hence the acute urgency for a holistic approach to drive continuous efficiency enhancement of AI/ML and CI/CD pipelines. The analysis of unstructured data is of utmost importance when it derives intelligence with structured ?Metadata and identifies customer’s KPIs?and action points to maximize customer’s ?business resiliency and success.

要查看或添加评论，请登录

GETTY VERMA的更多文章

Auto-scaled Business Logic Using Declarative APIs and Microservices in Clouds vs Legacy IT

2021年3月1日

Auto-scaled Business Logic Using Declarative APIs and Microservices in Clouds vs Legacy IT

1.0 Introduction Today the intent to use open source, open API and run time portability is fulfilled by ensuring…
Data Plane Acceleration to Scale VNFs Better ?

2020年10月4日

Data Plane Acceleration to Scale VNFs Better ?

Today enterprises are fully enjoying terabit and petabit speeds of data, but Network bottlenecks still commonly persist…
Splunk vs ELK : Security, Scale and High Availability Perspectives

2020年9月20日

Splunk vs ELK : Security, Scale and High Availability Perspectives

Today developers primarily look at Splunk and the ELK Stack as the two strongest options to solve the same set of…
Improving VNF live migration through Machine Learning

2020年9月18日

Improving VNF live migration through Machine Learning

Recently several ML techniques have been used to better optimize and enhance the performance of VNF live migration…
The Dilemma between VMs and Containers

2020年1月4日

The Dilemma between VMs and Containers

The Dilemma of VMs or Containers: Which will most often minimize TCO for agile secure scalable migration of workloads ?…

See all articles

ML and CI/CD Pipelines for Unstructured datasets: Efficiency and Optimization Investigation

GETTY VERMA

领英推荐

GETTY VERMA的更多文章

社区洞察

其他会员也浏览了

Preparing data for AI: A guide for data engineers

Automated Data Science and Machine Learning Platforms Market (New Data Insights): Latest Innovations, and Demand, till 2030

Reference Architecture for RAG applications

Optimizing Data Pipelines for AI: Best Practices for High-Performance Workflows

How to approach a Machine Learning Project ?

Enhancing AI with Untapped Data: How MetadataHub Transforms Unstructured Data for Advanced Machine Learning

Data Engineering in the Era of Machine Learning – Key Insights and Best Practices

The Impact of Machine Learning on Data Pipelines: Challenges and Opportunities

DATA Pill #048 - Zero-ETL, Chat GPT and why NOT to use Kubeflow

领英推荐

GETTY VERMA的更多文章

Auto-scaled Business Logic Using Declarative APIs and Microservices in Clouds vs Legacy IT

Data Plane Acceleration to Scale VNFs Better ?

Splunk vs ELK : Security, Scale and High Availability Perspectives

Improving VNF live migration through Machine Learning

The Dilemma between VMs and Containers

社区洞察

其他会员也浏览了

Preparing data for AI: A guide for data engineers

Automated Data Science and Machine Learning Platforms Market (New Data Insights): Latest Innovations, and Demand, till 2030

Reference Architecture for RAG applications

Optimizing Data Pipelines for AI: Best Practices for High-Performance Workflows

How to approach a Machine Learning Project ?

Enhancing AI with Untapped Data: How MetadataHub Transforms Unstructured Data for Advanced Machine Learning

Data Engineering in the Era of Machine Learning – Key Insights and Best Practices

The Impact of Machine Learning on Data Pipelines: Challenges and Opportunities

DATA Pill #048 - Zero-ETL, Chat GPT and why NOT to use Kubeflow