Model-Centric to Data-Centric AI: Challenges, trends, and opportunities underlying this shift

Model-Centric to Data-Centric AI: Challenges, trends, and opportunities underlying this shift

“In the last decade, the biggest shift in AI was in embracing deep learning. In this decade, I think the biggest shift will be to Data-Centric AI” – Andrew Ng (Link )

Over the last decade – as AI systems were getting adopted commercially – to improve their performance the focus always remained on improving the model – algorithm, hyperparameters, etc. But with many advancements (like in deep learning), now there is a realization that these AI systems have reached the place where singular improvements in model brings diminishing returns. Thus, there is a shift towards re-looking AI systems as a combination of model and data, where data is not just an input or a constant but an integral part of the system which can be altered to improve performance. Machine learning luminary Andrew Ng is a major advocate of data-centric AI which he describes as “the discipline of systematically engineering the data used to build an AI system”.(link )

The advocates also argue that for AI to penetrate beyond consumer facing or tech companies, data-centric AI is the way. These companies often do not possess big data sets. But they can realize benefits of AI even without big data, by preparing small quality training data with the help of subject matter expert. At the same time, many business leaders from traditional industries have realized that AI can become a strategic asset for the company and thus are investing heavily in AI. Under this backdrop, data-centric AI can act as an important enabler in driving AI adoption.

Data-centric AI is currently at the initial and in principle stage. Many challenges need to be overcome to enable this shift. A few of the major challenges are discussed below:

  1. Scalability: As per a report by cognilytica (Link ), almost 80% of the time in ML project is spent on data preparation. Within data preparation, data labeling is a step that is a major bottleneck in terms of scaling as it is done by human labelers who go through each data point to complete the task. Further, employing/contracting human labelers is expensive.
  2. Adaptability: Based on model performance, updating the data, especially data labels, is again a time-consuming and non-scalable process. Further identifying the problem or required change in the data set is not an easy task.
  3. Explainability: As AI models are now expected to be more responsible and explainable, the biases exhibited should be traceable back to the data. This capability of tracing back to the data is a challenge in most scenarios.
  4. Unstructured Data: More and more unstructured data – text, images, and videos – is now becoming part of the mainstream software stack, to improve customer experience. Often such unstructured data requires domain expertise to be able to make sense of it. As such labeling process becomes even more challenging, time-consuming and subjective, which overall affects the data quality. As per cloudfactory (Link ), it requires 800 human hours to annotate an hour-long video. Further, it is difficult to visualize and manage unstructured data sets.
  5. Data Privacy: With stringent privacy regulations, getting access to customer data has become challenging for data scientists. Many privacy-preserving machine learning techniques (PPML) still have practical shortcomings like huge processing time.
  6. Edge cases: The training data which is based on real events may fail to capture rare or unexpected events. Many times, solving this long-tail problem by capturing real data is prohibitively expensive. For instance, training an autonomous vehicle for a rare scenario of a broken bridge.

No alt text provided for this image

Many of these challenges have existed in the domain of ML development for a long but they become critical problems when it comes to data-centric AI. There is one sector that has been focusing on solving these problems even before data-centric AI came to forestage. It is the autonomous vehicle sector that has been receiving massive investments to achieve a formidable task and is now leading the frontier. As mentioned before, Data-centric AI has not yet reached the stage of standardization, so there are many emerging solutions in this space. These early trends could be categorized under three categories – data labeling, data curation, and synthetic data – as discussed below.

No alt text provided for this image

Data Labelling

  • Model Assisted Labelling: Also called hybrid labeling, it utilizes an already trained model that labels the data first and then human labelers are involved to check either the complete data set or only the data points where the model showed low confidence. It has the potential to considerably bring down labeling time depending on the data set and model used. An MIT study (Link ) on this type of human interaction suggests that human labelers may get biased by pre-labels into accepting them, as it mentions “…when presented with fully pre-populated suggestions, these expert users exhibit less agency: accepting improper mentions, and taking less initiative in creating additional annotations.”
  • Programmatic Labelling: It is a relatively new technique that aims to completely automate the labeling process. The subject matter experts are expected to define multiple label functions based on rules or heuristics. Data points that could be labeled using these functions are labeled first. Then weak supervision models are used to label the complete data set. The technique drastically reduces the labeling time and makes labeling easy to adapt by changing the labeling function. Further, biases in output can also be traced back to labeling functions. Snorkel AI, a start-up born out of Stanford AI labs is leading the developments in this technology. It has Google, Intel, Microsoft, and many other fortune 500 companies as its customers.?
  • Scale, Hive, Snorkel, and Labelbox are start-ups that have achieved unicorn status by providing data labeling solutions. Scale was valued at more than $7 Bn in the last funding round in 2021. Scale started with a narrow focus on labeling solutions for the autonomous vehicle industry where it received good momentum.?

Data Curation

  • Active Learning: In this technique, a machine learning model in iterative cycles points out the data which should be labeled next. It starts with a set of high-quality labeled data points. Both Scale AI and Labelbox provide tools for active learning to reduce data labeling time.
  • Self-supervision: Generally used when there is limited labeled data, which is augmented to create new data points. In the case of contrastive learning, the focus is on creating a balance of positive (similar) and negative (different) data points to help the model identify the features from similar data points.
  • De-biasing: Often the bias in the AI system generates from spurious correlations existing in the training data-set. But such biases have the potential to cost heavily to businesses. Themis AI, a very early-stage start-up spin-off from MIT Computer Science and Artificial Intelligence Laboratory is trying to solve this problem. It has created an end-to-end platform for de-biasing both the training data and the model. It also provides de-biased certificates.
  • Data Management Platforms: Many end-to-end ML training platforms have come up centered around the data as the main asset. Many such platforms also provide advanced capabilities to help better understand and improve the quality of the data. For instance, Scale AI has a platform named Nucleus which provides capabilities to identify mislabelled data, mine edge cases, debug models and explore/query even unstructured data based on annotations, metadata, model predictions, etc.?

Synthetic Data

No alt text provided for this image

  • While synthetic data has been in discussion for a long, it is only in the last few years that it has started seeing major adoption. One of the major reasons for this recent growth in adoption is stringent privacy regulations. Healthcare and finance industry has particularly seen immense adoption because of these regulations. Apart from data privacy, synthetic data is also helping solve the problem of missing rare edge cases in real data through simulation.
  • According to a widely reported (Link ) Gartner study, synthetic data will account for 60% of the data used for AI development by 2024. Synthetic data also help address the challenges of manual labeling as data is already labeled when it is generated. Paul Walborsky, the first synthetic data service co-founder, told NVIDIA (Link ), “a single image that could cost $6 from a labeling service can be artificially generated for six cents”. It is also believed that synthetic data could also help democratize AI as it will take away the hegemony of tech giants like Meta, Google, etc. who currently sit on massive consumer data.
  • Another major driver has been the innovation at the technology front. Generative Adversarial Neural Networks (GANs), introduced by Ian Goodfellow et al. in 2014 is the technology which has enabled the controlled creation of synthetic data, especially in the domain of computer vision. At the heart of this technology is two neural network based sub-models – Generator and Differentiator. Generator creates synthetic data from random input which is fed to differentiator model. Differentiator tries to distinguish between the generated data and real data and its output acts as feedback to generator model. These sub-models keep learning and generating the data until differentiator is not able to distinguish between generated data and real data.
  • At present, the market is filled with many startups providing synthetic data solutions. A few of them look promising and are now leading the front viz. Datagen (Series B), Gretel(Series B), Mostly AI (Series B), Tonic AI (Series B) and Synthesis AI (Series A). Many of them are currently aligned to a particular industry or a particular use case. For instance, Datagen and Synthesis AI specializes in generating human data (like human faces) and caters majorly to AR/VR industry. Seeing the potential, Meta also acquired a budding start-up – AI.Reverie from this space.
  • At the same time, there are still many data scientists who are skeptical about the “reality gap” which exists between synthetic and real data. Innovation at the technology front will help reduce/overcome this gap. Startups are trying to address this issue by coming up with various metrics to measure the fidelity of synthetic data. Gretel calculates the overall synthetic data quality score (Link ) from three different statistical metrics – field correlation stability, deep structure stability, and field distribution stability. But there is a need for standardization in the way synthetic data quality is measured to assuage the concerns of skeptics.

Data-centric AI has the potential to drive the next wave of AI adoption. But it is still just the beginning. Preparing quality training data at scale and low cost is still a problem that has not been completely solved. There is room for innovation on many fronts. Many industry-specific or use case-specific solutions have come up. But the processes and metrics are not yet standardized. As such there could be no better time than now to think of products and solutions which help enable the shift from model-centric to data-centric AI.

要查看或添加评论,请登录

社区洞察

其他会员也浏览了