Data Centric ML
Data-centric ML is the discipline of systematically engineering the data used to build ML and artificial intelligence (AI) systems. The data-centric AI and ML movement is grounded in the philosophy that data quality is more important than data volume when it comes to building highly informative models. Put another way, it is possible to achieve more with a small but high-quality dataset than with a large but noisy dataset. For most ML use cases, it is not feasible to build models based on very large datasets, say millions of observations, simply because the volume of data doesn’t exist. In other words, the potential use of ML as a tool to solve certain problems is often ignored on the basis that the available dataset is too small.
But what if we can use ML to solve problems based on much smaller datasets, even down to less than 100 observations? This is one challenge the data-centric movement is attempting to solve through systematic data collection and engineering.
Andrew Ng is spearheading the movement of data centric ML. He and his team at Landing AI has demonstrated how data centric ML has led to great improvement in results for computer vision problems.
Data is the key ingredient of ML Systems and will yield the maximum benefit if right quality data can be sourced for the problem.
In today's competitive world, all companies have access to the same algorithms and infrastructure, but is is the data that gives the edge over competitors.
There are few principles for Data centric ML as listed below
Principle 1: Data should be the center of ML development
Data is unique to every company, problem, and situation, and the data-centric paradigm recognizes this by putting the spotlight and development efforts on the data before the model. Data is no longer a static asset that can be collected at the beginning and forgotten about; it is now a unique commodity that needs to be leveraged to its full potential to make better predictions. We will argue that in many cases, a company’s proprietary data is its only truly unique competitive advantage – so long as it’s leveraged.
Data-centricity requires a mindset shift from “I’ll build the best model with this data” to “How can we make the best dataset to solve this particular problem?” To do that, we need the whole organization involved in a coordinated effort. This brings us to the second principle of data-centric ML.
Principle 2: Leverage Data Labelers and SMEs effectively
Even in the ChatGPT age Labeled data is the unavoidable thing in AI/ML. Labelling can be done either manually or programatically. Manual Labelling with the help of SMEs can be critical in few niche domain use cases. It’s important to note that SMEs can be incredibly valuable contributors to almost any ML exercise. For example, we often use SMEs to help us review the outputs of our models because it allows us to discover new contexts in the problem space that should become features in the training data.
Programmatic labeling techniques are often all that’s required to lift the quality of your data. However, in some cases, relationships between features are too complex for rules-based algorithms to do the job.
Principle 3: use ML to improve data
Just as we can use a programmatic or algorithmic approach to label our data, we can also use ML to identify data points that may be wrong or ambiguous. By leveraging developments in explainability, error analysis, and semi-supervised approaches, we can create new labels and find data points to improve or discard.
The use of ML to improve input data quality is a fundamental shift in the traditional approach to ML. It requires a mindset shift from using ML models to make the best prediction to using ML to identify the data points that are not helping model performance. After all, the goal of data-centric ML is to increase signal and reduce noise in our input data.
Principle 4: follow ethical, responsible, and well-governed ML practices
Ethical and responsible ML practices become increasingly important as data-centricity allows us to tackle more high-stakes challenges. This requires you to consider factors such as transparency, fairness, and accountability when designing algorithms so that they do not discriminate against certain groups or individuals. Additionally, those responsible for implementing these systems must be aware of how they work and understand their limitations so that they can make informed decisions about their use.
AI ethics and responsibility is not just a tick-box exercise, but a potential source of differentiation. Organizations that pay attention to AI ethics are more likely to be trusted by their customers, while organizations that overlook it are likely to suffer customer backlash and reputational damage.