The best guide for your AI applications and projects
High-quality datasets for AI training makes your AI applications robust and highly intelligent.

The best guide for your AI applications and projects

Data-Centric, Code / Model Centric, and Computation Centric Approach in AI Projects

The foundations of the AI system are code, computation, and data. All of these components play an important role in the development of a robust model but which one should you focus on more? In this article, we will highlight the pros and cons of model, computation, and data-centric approaches, see which one is better, and also talk about how to adopt data-centric infrastructure.

Model-centric approach 

The model-centric approach means developing experimental research to improve the ml model performance. This involves selecting the best model architecture and training process from a wide range of possibilities. 

  • In this approach you keep the data the same and improve the code or model architecture.
  • Working on code is the central objective of this approach.

Model-centric trends in the AI world

Currently, most AI applications are model-centric, one possible reason behind this is that the AI sector pays careful attention to academic research on models. According to Andrew Ng, more than 90% of research papers in this domain are model-centric. This is because it is difficult to create large datasets that can become generally recognized standards. As a result, the AI community believes that model-centric machine learning is more promising. While focusing on the code, data is frequently overlooked, and data collection is viewed as a one-time event.

Data-centric approach 

In an age where data is at the core of every decision-making process, a data-centric company can better align its strategy with the interests of its stakeholders by using information generated from its operations. This way the result can be more accurate, organized, and transparent which can help an organization run more smoothly. 

  • This approach involves systematically altering/improving datasets in order to increase the accuracy of your ML applications.
  • Working on data is the central objective of this approach.

The data-driven/data-centric conundrum

Data-driven vs data-centric

Many people often get confused between a data-centric and a data-driven approach. A data-driven approach is a methodology for gathering, analyzing, and extracting insights from your data. It’s sometimes referred to as “analytics.” The data-centric approach on the other hand is focused around using data to define what you should create in the first place.

  • Data-centric architecture refers to a system in which data is the primary and permanent asset, whereas applications change.
  • Data-driven architecture means the creation of technologies, skills, and an environment by ingesting a large amount of data.

Let’s now talk about how a data-centric approach differs from a model-centric approach and the need for it in the first place.

Data-centric approach vs model-centric approach

To data scientists and machine learning engineers, the model-centric approach may seem more pleasant. This is understandable since practitioners may use their knowledge to tackle a specific problem. On the other hand, no one wants to spend the entire day labeling data because it is considered a one-time job.

However, in today’s machine learning, data is crucial, yet it’s often overlooked and mishandled in AI initiatives. As a result, hundreds of hours are wasted fine-tuning a model based on faulty data. That could very well be the fundamental cause for your model’s lower accuracy, and it has nothing to do with model optimization.

Model-centric ML

Data-centric ML

Working on code is the central objective

Working on data is the central objective

Optimizing the model so it can deal with the noise in the data

Rather than gathering more data, more investment is being made in data quality tools to work on noisy data

Inconsistent data labels

Data consistency is key

Data is fixed after standard preprocessing

Code/algorithms are fixed

Model is improved iteratively

Iterated the data quality

You don’t have to become completely data-centric; sometimes it’s important to focus on model and code. It’s great to do research and improve models, but data is also important. We tend to overlook the importance of data while focusing on the model. The best way is to adopt a hybrid approach that focuses on both data and models. Depending on your application, you can focus more on data and less on the model, but both should be taken into account.

The need for a data-centric infrastructure

Model-centric ML refers to machine learning systems that are primarily concerned with optimizing model architectures and their parameters.

Model-centric ML application

The model-centric workflow depicted in the graphic above is suitable for a few industries, such as media and advertising, but consider healthcare or manufacturing. They may face challenges such as:

1. High-level customization is required

Unlike media and advertising industries, a manufacturing business with several goods cannot use a single machine learning system to detect production faults across all of its products. Instead, each manufactured product would require a distinctly trained ML model.

While media companies can afford to have an entire ML department working on each and every little optimization problem, a manufacturing business that requires several ML solutions cannot follow such a template in terms of size.

2. Importance of large datasets

In most cases, companies do not have a large number of data points to work with. Instead, they are often forced to deal with tiny datasets, which are prone to disappointing outcomes if their approach is model-centric.

Andrew NG explains how he believes a data-centric ML is more rewarding and advocates for a revolution in the community toward data-centrism in his AI talk. He gives an example of a steel defect detection problem statement in which the model-centric approach fails to improve the model’s accuracy, while the data-centric approach boosts the accuracy by 16%. 

Data is extremely important in AI research, and adopting a strategy that prioritizes obtaining high-quality data is critical – after all, relevant data is not just rare and noisy, but also extremely expensive to get. The idea is that AI should be treated in the same way that we would care for the greatest materials while building a house. They should be evaluated at each level rather than as a one-time event.

Data-centric ML application

Adopting a data-centric infrastructure 

Treat data as a fundamental asset that will outlast applications and infrastructure when implementing a data-centric architecture. This approach does not need a single database or data repository, but rather a shared understanding of the data with a uniform description. Data-centric ML makes data sharing and movement simple. 

So, what exactly does data-centric machine learning involve? What essential factors should you consider while implementing a data-centric approach?

1. Data label quality

Data labeling is the process of assigning one or more labels to data. Labels are associated with specific values that are applied to the data. When a significant number of images are incorrectly labeled, the results are lower than when fewer but accurate images are used.

The labels provide detailed information about the content and structure of a dataset, which may include components such as what data types, measurement units, and time periods are represented in the dataset. The best way to improve label quality is to find the inconsistencies in labels and work on the labeling instructions. Later in this article, we’ll learn more about the importance of data quality.

2. Data augmentation

Data augmentation is a data analysis task involving the creation of data points through interpolation, extrapolation, or other means. It can be used to introduce more training data for machine learning, or it can be used to make synthetic images or video frames with varying degrees of realism. It helps to enhance the number of relevant data points, such as the number of faulty production components by creating data that your model hasn’t seen yet throughout the training period.

However, adding data isn’t always the best option. Getting rid of the noisy observations that cause the high variance improves the model’s capacity to generalize to new data.

3. Feature engineering

Feature engineering is the process of adding features to a model by altering input data, prior knowledge, or algorithms. It is used in machine learning to help increase the accuracy of a predictive model. 

Improving data quality involves improving both the input data and the target/labels. Feature engineering is crucial for adding features that may not exist in their raw form but can make a significant difference.

4. Data versioning

In any software application, data versioning plays an important role. As a developer, you want to track down bugs by comparing two versions and see something that just doesn’t make sense anymore. Or maybe you could’ve prevented that bug by deploying that particular version again. Managing dataset access, as well as the many versions of each dataset throughout time, is difficult and error-prone. Data versioning is one of the most integral steps in maintaining your data – it’s what helps you keep track of changes (both additions and deletions) to your data set. Versioning makes it easy to collaborate on code and manage datasets.

Versioning also makes it easy to manage ML Pipeline from proof of concept to production, this is when MLOps tools come to the rescue. You might be wondering why MLOps tools are discussed in the context of “Data Versioning”. It’s because managing data pipelines is a significantly difficult task in the development of machine learning applications. Versioning ensures reproducibility and reliability.

WOW AI LLC is one of the trusted companies when it comes to providing high-quality data sets for AI training and applications at a larger scale. They support number of industries as follows:

  1. Government
  2. Manufacturing
  3. Banking / Finance
  4. Healthcare
  5. E-commerce

Government

AI helps the government to ease the burden

AI has the potential to help ease the burden through resources optimization, smarter cities, traffic congestion relief, improved citizen well-being and education. Here are some use cases:

  • Conversational AI: chatbot, call center assistant to answer all citizens' questions
  • Compliance tracking
  • Gather traffic information and make traffic systems smarter

Manufacturing

How AI can help with manufacturing processes?

AI is vital from product engineering and supply chain management to logistics and circular management:

  • Self-driving vehicles development
  • Maintenance and quality inspection
  • Product design

Banking / Finance

AI in the banking and finance industry?

Overall, AI can identify potentially fraudulent transactions more swiftly and accurately than traditional methods relying on human staff. Here are some use cases:

  • Conversational AI: chatbot, call center assistant
  • Fraud detection, risk management
  • KYC support: confirm customer identity easier by AI

Healthcare

AI is increasingly a part of our healthcare ecosystem.

Some use cases in the healthcare industry: 

  • AI helps to detect diseases at an early stage
  • Cognitive technology unlocks vast amounts of health data and powers diagnosis
  • AI improves patient outcomes

Ecommerce

How will AI change e-commerce?

Here are the top use cases:

  • Personalized Services such as website personalization, recommendation systems
  • Conversational AI for chatbot, call center assistants
  • Supply Chain management: damage detection, predictive maintenance, inventory planning

Which one to prioritize: data quantity or data quality?

Before going any further, I’d want to emphasize that more data does not automatically equal better data. Sure, a neural network can’t be trained with a few images, but the emphasis is now on quality rather than a number.

Data quantity

It refers to the amount of data accessible. The main goal is to gather as much data as possible and then train a neural network to learn the mappings.

Data

As seen in the above graphic, the majority of Kaggle datasets aren’t that large. In a data-centric approach, the size of the dataset isn’t that important and a lot could be done with a small quality dataset.

Data quality

Data quality, as the name suggests, is all about quality. It makes no difference if you don’t have millions of datasets; what matters is that they are of high quality and properly labeled. 

Different approaches to drawing bounding boxes | Source: inspired by Andrew Ng

You can see a different way to label the data in the above graphic; there’s nothing wrong with labeling it independently or combined. For example, If data scientist 1 labels pineapple separately but data scientist 2 labels it combined, the data will be incompatible, causing the learning algorithm to grow confused. The main goal is to maintain consistency in labels; if you’re labeling it independently, make sure all labels are labeled the same way.

Data annotation consistency is critical since any discrepancy might derail the model and make your evaluation inaccurate. As a result, you’ll need to carefully define annotation guidelines to ensure that ML engineers and data scientists label data consistently.  According to research, roughly 3.4 percent of samples infrequently used datasets were mislabeled, with the large models being the most affected.

Importance of consistency in small datasets

In the above image, Andrew Ng explains the importance of consistency in small datasets. The graph above illustrates the relationship between the voltage and speed of a drone. You can confidently fit the curve and get higher accuracy if you have small datasets but consistent labels.

A low-quality piece of data means that flaws and inaccuracies can go undetected indefinitely without any consequences. The accuracy of models depends on the quality of your data; if you want to make good decisions then you need accurate information. Data with poor attributes are at risk of containing errors and anomalies which can be very costly when using predictive analytics and modeling techniques. 

When it comes to data, how much is too much?

The amount of data you have is critical; you must have enough data to solve your problem. Deep Networks are low-bias, high-variance computers, and we believe that the solution to the variance problem is more data. But how Much Data Is Enough? That’s a more difficult question to answer than you would expect. Yolov5 suggests :

  • Minimum 1.5k images per class
  • Minimum 10k instances (labeled objects) per class total

Having a large amount of data is a benefit, not a must.

Best practices for a data-centric approach

Keep these things in mind if you’re adopting a data-centric approach:

  • Ensure high-quality data consistency across the ML project lifecycle.
  • Make the labels consistent.
  • Use production data to get timely feedback.
  • Use error analysis to focus on a subset of data.
  • Eliminate the noisy samples; as discussed above more data is not always better. 

Where to find high-quality datasets for AI training on a larger scale?

Wow AI offers libraries of high-quality data sets for your AI projects and applications. They support enterprises, private companies, and IT outsourcing agencies with their datasets requirements, whether Off-the-shelf or custom training datasets.

要查看或添加评论,请登录

WowDAO的更多文章

社区洞察

其他会员也浏览了