登录查看更多内容

Fixing (parts of) your Labeled Dataset

Felix Friedmann

NVIDIA DriveOS, embedded LLM/VLM, DriveWorks

发布日期: 2020年10月28日

Intro that you'll probably skip

Supervised learning, i.e. training machine learning algorithms with annotated data, dominates commercial AI applications. This has led to tremendous pain for developers, who struggle with having their data collected, cleaned and labeled before they can even start to work on actual development. And it led to the raise of the labeling industry, consisting of hundreds of companies offering data annotations services; sometimes employing AI-powered automation ("pre-labeling"), usually relying on hundreds, even thousands of manual labelers.

As increasingly many Machine Learning applications are moving out of the research world and into productization, it becomes apparent that quality has been ... a bit neglected until now, compared to more mature fields of engineering. At Incenda AI, we've been working with TüV Süd to come up with a Whitepaper that discusses what the data labeling process should look like.

But as of today, the situation is what it is: Many of you already have bulks of data at their hands, on their serves, and it may not be good enough. The actual metrics for assessing data quality vary from application to application (mAP, IoU, ...) and measuring them accurately is another challenge - a topic for another article, but let's remain agnostic for now.

You may have pre-labeled data that shows systematic errors, you may have received labels from a supplier that are flawed - either way you may be at a place where you have assessed the quality of your labeled dataset and it is just too bad. Now you could just go and have it all re-labelled, manually, 100% of the data points, maybe even be multiple parties for redundancy. But often that is not feasible commercially (cost) or practicably (time, resources).

In this situation, you'll find yourself asking "how much do I have to fix so that the dataset is good enough?" situation. I've seen these situations a couple of times and wanted to share some thoughts on it. It boils down to a mixture problem.

Fixing a part of your dataset - a Mixture Problem

Let’s look at Mixture Problems:

If we know how high the quality level of our data is r_A, and we know the quality level we can reach by improving the data r_B, it isn’t hard to derive the amount of data we have to fix n_B. However, we must take into account that we are fixing a part of the original data, hence the more we fix the less remains in original quality:

The amount of data that needs to be fixed depends on the original amount of data n_O. But we are mainly interested int the rate of the originals data we have to refine: s = n_B/n_O.

If the quality level of refined data is known (e.g. 99%, which is challenging to achieve), the share of data that must be refined s to reach an overall quality of r_C can be expressed in a table for convenience. The table illustrates intuitively how the effort will quickly scale with low original grade r_O and high target grade r_C.

Summary

Once again, high school math provides us with an interesting perspective, this time by showing us how much effort it is to raise the quality level of our dataset. Using some simple formulas we can assess the ratio of a data we have to fix in order to reach a target quality level, and beyond that we can also quickly triangulate how the original quality level (pre-labeling accuracy!), the level after fixing etc. change the game.

Raising the disclaimers of the intro again - this was intentionally agnostic of the actual quality metric and the approach for measuring it; Coming articles shall elaborate on that.

And I can imagine that this article will immediately raise the question how applications can be released at all, knowing that there are still errors in the data used to built them. A valid question, that should discussed in greater detail, but to provide an initial reply already: Because it is better to fix at least a part of the data than to leave the data broken as it is, and because it may suffice for the quality/maturity target of the individual application.

---

Want to discuss Data Quality? At Incenda AI we obsess about it - reach out!

Sébastien ESKENAZI

Director of R&D at Pixelz Inc

4 年

If I may, you forgot another very useful and a lot cheaper way to improve the quality of a dataset: filter out the bad data. Obviously one should pay attention not to create a bias and this requires to have enough data in the first place. But if the data collection is automated so that one has enough images and the errors are somewhat random (or the resulting bias is acceptable) then filtering data is a lot faster hence cheaper than fixing it. Sometimes it can even be automated. For example at Pixelz Inc we can collect data (images) straight from our production at a rate above 100k samples per week and with adequate logic we can automatically filter out the bad samples resulting in roughly 50-60k perfect samples per week at practically no cost. ???? Of course in the case of multi-class image segmentation one would need to be smart about the filtering in order to remove only the bad labels and keep the good ones in a given image without penalising "false" false positives during the training. But I think it should be doable.

2 次回应

Werner Streich

Senior Business Advisor

4 年

This is very pragmatic thinking. How good is good enough is always a central question, not only in terms of functionality and also in terms of project-time and -resource planning.?Having some metrics to define a desired target quality level is an interesting approach.

2 次回应

Vinayak Kamath

Enterprise IT | Future Mobility | DevOps | MBA

4 年

An interesting read...

1 次回应

查看更多评论

要查看或添加评论，请登录

Felix Friedmann的更多文章

Ep1: Antonio M. López on Early ADAS Development in Academia, SYNTHIA, CARLA, UrbanSyn, SensiMotor Models

2024年11月25日

Ep1: Antonio M. López on Early ADAS Development in Academia, SYNTHIA, CARLA, UrbanSyn, SensiMotor Models

Based on a discussion with Antonio M. López: Researcher & Professor at Computer Vision Center (CVC) of Universitat…

1 条评论
Precision, Recall, F1-Score for Object Detection - Back to the ML Basics

2020年11月19日

Precision, Recall, F1-Score for Object Detection - Back to the ML Basics

There are some topics that we come across again and again. As Christoph Petzinger, a fellow (fantastic) software…

6 条评论
Sample Size Determination for Data Quality Checks

2020年11月12日

Sample Size Determination for Data Quality Checks

Intro you will probably skip After having recently discussed how to Fix (parts of) your Labeled Dataset, let's now look…

4 条评论
Join Autonomous Driving Meetup #4

2017年12月8日

Join Autonomous Driving Meetup #4

There'll be talks on Automated Driving Architectures (DFKI+BeamNg) and DNN training with synthetic data (TU Graz) plus…

1 条评论
1st Autonomous Driving Meetup in Shanghai, tomorrow!

2017年11月6日

1st Autonomous Driving Meetup in Shanghai, tomorrow!

Join an open discussion on all self-driving car technology and feel free to forward this invitation!

1 条评论

See all articles

Fixing (parts of) your Labeled Dataset

Felix Friedmann

NVIDIA DriveOS, embedded LLM/VLM, DriveWorks

Intro that you'll probably skip

Fixing a part of your dataset - a Mixture Problem

Summary

Felix Friedmann的更多文章

社区洞察

其他会员也浏览了

Synerise open-sourcing Cleora AI framework for ultra-fast embeddings in large graphs

Machine learning needs care too!

How to Provide Data to Your Gen AI Application

Generative AI: Picking the Right Vector Database

Create Machine Learning Models Without Needing to Write Code

Population, Sample, and Sampling Techniques in Machine Learning

A Practical Guide to XGBoost for Enterprise

5 Common Machine Learning Problems & How to Solve Them

When It Comes To AI—Synthetic Data Has A Dirty Little Secret

How to Build an AI Model: A Comprehensive Guide

Intro that you'll probably skip

Fixing a part of your dataset - a Mixture Problem

Summary

Felix Friedmann的更多文章

Ep1: Antonio M. López on Early ADAS Development in Academia, SYNTHIA, CARLA, UrbanSyn, SensiMotor Models

Precision, Recall, F1-Score for Object Detection - Back to the ML Basics

Sample Size Determination for Data Quality Checks

Join Autonomous Driving Meetup #4

1st Autonomous Driving Meetup in Shanghai, tomorrow!

社区洞察

其他会员也浏览了

Synerise open-sourcing Cleora AI framework for ultra-fast embeddings in large graphs

Machine learning needs care too!

How to Provide Data to Your Gen AI Application

Generative AI: Picking the Right Vector Database

Create Machine Learning Models Without Needing to Write Code

Population, Sample, and Sampling Techniques in Machine Learning

A Practical Guide to XGBoost for Enterprise

5 Common Machine Learning Problems & How to Solve Them

When It Comes To AI—Synthetic Data Has A Dirty Little Secret

How to Build an AI Model: A Comprehensive Guide