Explore Data centric AI
Patrick Nicolas
Director Data Engineering @ aidéo technologies |software & data engineering, operations, and machine learning.
Data-Centric AI?(DCAI) is a burgeoning domain focused on methods and models that prioritize the selection, monitoring, and enhancement of datasets for the training, validation, and testing of established machine learning models [ref?1].
What you will learn:?How to select, analyze, re-balance training data sets and create or correct labels for building robust ML models.
Introduction
Historically, the data science community has placed undue emphasis on refining models, often overlooking the paramount importance of data quality during both training and inference phases. As the adage goes, "garbage in, garbage out" [ref?2].
The management of quality of input data and labels has been regarded as an afterthought driven more by intuition than a strict engineering process. However, researchers have recently introduced techniques that turn improvement of data into an?engineering discipline [ref?3].
This post lists some of the important challenges in understanding and managing quality of feedback/annotations and input data once a trained and validated?model is deployed in production.
Challenges
In this post we assume that the nature or distribution of data used in training or evaluating a machine learning model may change overtime. It is not uncommon for the?quality of a?trained and validated?model?to?degrade?during inference because the distribution of the production data shift overtime.
Let's consider a simple data pipeline for a multi-classification model.
Data scientists and engineers go through the process of training, fine-tuning, and validating a model using data that has undergone selection, sampling, cleaning, preprocessing, and annotation, accomplished through various means including crowd sourcing and the expertise of domain specialists.?
When the model meets the established quality metrics and fits within resource limitations, it is then deployed into a production environment, where it encounters numerous challenges:
Let's review some of the key issues that may arise once users exercise the model.
Data distribution shift
Detecting data inconsistency
The different types of data shift can be derived from the Naive Bayes formula [ref?4].
Let's consider production or test data x and label y and a trained discriminative model that predict class?y,?p(y|x).?
If the data distribution in the test or production environments differs from that in the training and validation sets, follow these steps:
In this scenario, the notable difference in error rates between the train-test and the test data clearly indicates a distribution shift or mismatch between the data used for training the model and the data encountered in production. Active learning is a widely used method to tackle this issue.
Active learning for data distribution shift
Inadequate model predictions could stem from the use of training labels that aren't applicable to the current data distribution. For example, a self-driving vehicle model trained predominantly in urban settings would require newly labeled data from rural and less dense environments.
The expense of acquiring new labels and retraining the existing model can be substantial. Active Learning, also known as?optimal experimental design?in the field of statistics, is a semi-supervised approach that can lessen the need for labeled data in training a model for a semi-supervised learning problem [ref?5].
There?are 3 architectures to generate/update labels:
The objective of any sampling method is to identify unlabeled data that are:
There are several?algorithms?for selecting candidate data points for annotation from a pool of unlabeled data. Among them:
Relabeling
Techniques to address incorrect labeling or annotating of data abound. Preventing, ?finding and correcting errors in labeling a large dataset is not a trivial task even for experts. Here are some of the most interesting approaches to either produce valid labels or correct wrong labels.
Consensus labeling
Consensus labeling is the simplest technique to validate labels from a group of annotators, domain experts. In case of multiple annotators, we need to estimate the following:
Dawid-skene aggregator
The?Dawid-Skene aggregation?model is a statistical model that uses confusion matrices to characterize the skill level of annotators. It employs an?Expectation-Maximization?(EM) algorithm to determine the probability of errors made by annotators, based on their annotations and the probabilities of the true (correct) labels [ref?6,?7]. Utilizing this method necessitates a solid understanding of statistical inference.
Confident learning
Confident learning is a data-centric approach that focuses on label quality by identifying label errors in datasets. This technique enables data scientists to
The objective is to estimate the join distribution between the noisy, observed labels and the true latent labels [ref?8]. The method estimates the ratio of wrong positive and negative labels. The noise in actual labels is?estimated by learning from uncorrupted labels in ideal conditions.
Encoding human priors
Incorporating human priors either domain knowledge or data scientist assumptions into a model goes well beyond the?Bayes?prior?(class probability)?p(C). It may encompass selection of neural architecture (convolution, recurrence, ...), domain related facts, and first order logic.
The most commonly applied method to inject domain knowledge into a model is?knowledge distillation?(teacher-student networks in deep learning).
Class rebalancing
Training classification models often encounters a noted challenge: class imbalance. This happens when there's a disproportionate distribution of classes in the dataset, potentially causing bias in the model's outcome. This phenomenon is particularly evident in binary classification. For instance, a diagnostic model analyzing doctor visit summaries could lean heavily towards detecting prevalent diseases.?
Interestingly, many data professionals might overlook that class imbalances can also emerge in a live production setting.?
Let's explore methods to tackle this imbalance issue.
Data augmentation
This technique involves creating artificial data based on your existing data set. This could be as simple as creating copies with small alterations. For text data, common methods include synonym replacement, random insertion, random deletion, and sentence shuffling.
Feature vs data space augmentation:
Example of augmentation using BERT: The objective is to duplicate a record (sequence of tokens) and randomly replace a token by ‘[UNK]’. The original and duplicate (or augmented) records associated with the same label are fine-tuned (or optionally pre-trained) using the same loss function and MLM model. Here are some examples of easy steps for augmentation
Under-sampling?and?Over-sampling
Under-sampling involves randomly discarding examples from the majority class, while oversampling involves duplicating examples from the minority class. However, under-sampling may lead to information loss, while oversampling can lead to overfitting. A more sophisticated oversampling method is the?Synthetic Minority Over-sampling Technique (SMOTE), but it's mostly applied on numerical data and might not be directly applicable to text data [ref?9].
领英推荐
Class?weighting
Class weighting in supervised learning is a technique used to address imbalances in the distribution of different classes within a training dataset. This imbalance can lead to biased models that perform well on the majority class but poorly on the minority class.?
Here's a deeper dive into the concept:
The primary goal of class weighting is to make the model pay more attention to the minority class during the training process by assigning a higher weight to the minority class and a lower weight to the majority class. These weights are used during the calculation of the loss function in the training process.
One noticeable drawback is the increase in complexity in model tuning.
Ensemble?methods
This approach involves training multiple models and having them vote on the output. This can often lead to improved results, as different models might make different errors, and the ensemble can often make better decisions.
Ensemble learning is a powerful technique in machine learning that involves combining multiple models to improve the overall performance of a predictive task especially in the case of imbalanced class. This approach is based on the principle that a group of weak learners, when properly combined, can outperform a single strong learner.?
These methods used multiple, diverse weak learners to increase the accuracy of prediction.
The most common techniques in ensemble learning are:
Cost-sensitive?learning?
In cost-sensitive learning, misclassification costs are incorporated into the decision process. The costs are usually set inversely proportional to the class frequencies. It helps to focus more on minority classes.
In many real-world scenarios, the cost of misclassifying one class of data may be much higher than misclassifying another. Cost-sensitive learning involves modifying the learning algorithm so that it minimizes a weighted sum of errors associated with each class [ref?10].?
One-vs-all (OvA) strategy
The one-vs-all strategy is a technique used in machine learning for multi-class classification problems dealing with datasets where one class is significantly underrepresented compared to others.?
The concept consists of breaking down a multi-class classification problem into multiple binary classification problems. For each class in the dataset, a separate binary classifier is trained. This classifier distinguishes between the class under consideration (positive case) and all other classes (negative case).
By treating the minority class as the 'positive' case in one of the binary classification techniques described in previous sections such as class weighting or oversampling, can be more effectively applied [ref?11].
Transfer learning
Transfer learning is a technique where a model developed for one task is reused as the starting point for a model on a second task.?Transfer learning typically involves using a model pre-trained on a large and diverse dataset [ref?12].
This is a sequential process used when:
These models have learned rich feature representations that can be beneficial for a wide range of tasks, including those with class imbalance. Since the pre-trained model has already learned a considerable amount of information, the need for a large and balanced dataset for the new task is reduced. This is particularly helpful when the available dataset for the new task is imbalanced.
Data privacy
Data privacy may not be on the forefront of data scientists concerns during the training and tuning of models. However, those issues become crucial for productization. For instance, HIPAA requires medical records be fully and properly de-identified: any violation incur significant financial penalties.
Data observability
Data observability is defined by 5 key attributes [ref?13]:
The 3 pillars of observability data:
?? ?Data downtime = Number of incidents * (Time to detection + Time to resolution)
Automated data pipeline
Finally, the variety and breath of techniques used to address data quality issues is a critical impediment for deploying and orchestrating these solutions. Quite often, these techniques are evaluated and deployed in an ad-hoc fashion.
One effective solution is to create a?configurable streaming pipeline?of various data manipulation and correction techniques that?execute concurrently.
Open-source frameworks such as Apache Spark and Kafka can be starting point to build a data centric AI platform for test and production data.
The design of the annotator interface is a critical element in the successful active learning or re-labeling strategy.
Thank you for reading this article. For more information ...
References
[3]?What is data centric AI ?
---------------------------
Patrick Nicolas has over 25 years of experience in software and data engineering, architecture design and end-to-end deployment and support with extensive knowledge in machine learning.? He has been director of data engineering at Aideo Technologies since 2017 and he is the?author of "Scala for Machine Learning" Packt Publishing ISBN 978-1-78712-238-3