Fixing (parts of) your Labeled Dataset

Fixing (parts of) your Labeled Dataset

Intro that you'll probably skip

Supervised learning, i.e. training machine learning algorithms with annotated data, dominates commercial AI applications. This has led to tremendous pain for developers, who struggle with having their data collected, cleaned and labeled before they can even start to work on actual development. And it led to the raise of the labeling industry, consisting of hundreds of companies offering data annotations services; sometimes employing AI-powered automation ("pre-labeling"), usually relying on hundreds, even thousands of manual labelers.

As increasingly many Machine Learning applications are moving out of the research world and into productization, it becomes apparent that quality has been ... a bit neglected until now, compared to more mature fields of engineering. At Incenda AI, we've been working with TüV Süd to come up with a Whitepaper that discusses what the data labeling process should look like.

But as of today, the situation is what it is: Many of you already have bulks of data at their hands, on their serves, and it may not be good enough. The actual metrics for assessing data quality vary from application to application (mAP, IoU, ...) and measuring them accurately is another challenge - a topic for another article, but let's remain agnostic for now.

You may have pre-labeled data that shows systematic errors, you may have received labels from a supplier that are flawed - either way you may be at a place where you have assessed the quality of your labeled dataset and it is just too bad. Now you could just go and have it all re-labelled, manually, 100% of the data points, maybe even be multiple parties for redundancy. But often that is not feasible commercially (cost) or practicably (time, resources).

In this situation, you'll find yourself asking "how much do I have to fix so that the dataset is good enough?" situation. I've seen these situations a couple of times and wanted to share some thoughts on it. It boils down to a mixture problem.

Fixing a part of your dataset - a Mixture Problem

Let’s look at Mixture Problems:

No alt text provided for this image

If we know how high the quality level of our data is r_A, and we know the quality level we can reach by improving the data r_B, it isn’t hard to derive the amount of data we have to fix n_B. However, we must take into account that we are fixing a part of the original data, hence the more we fix the less remains in original quality:

No alt text provided for this image

The amount of data that needs to be fixed depends on the original amount of data n_O. But we are mainly interested int the rate of the originals data we have to refine: s = n_B/n_O.

No alt text provided for this image

If the quality level of refined data is known (e.g. 99%, which is challenging to achieve), the share of data that must be refined s to reach an overall quality of r_C can be expressed in a table for convenience. The table illustrates intuitively how the effort will quickly scale with low original grade r_O and high target grade r_C.

Summary

Once again, high school math provides us with an interesting perspective, this time by showing us how much effort it is to raise the quality level of our dataset. Using some simple formulas we can assess the ratio of a data we have to fix in order to reach a target quality level, and beyond that we can also quickly triangulate how the original quality level (pre-labeling accuracy!), the level after fixing etc. change the game.

Raising the disclaimers of the intro again - this was intentionally agnostic of the actual quality metric and the approach for measuring it; Coming articles shall elaborate on that.

And I can imagine that this article will immediately raise the question how applications can be released at all, knowing that there are still errors in the data used to built them. A valid question, that should discussed in greater detail, but to provide an initial reply already: Because it is better to fix at least a part of the data than to leave the data broken as it is, and because it may suffice for the quality/maturity target of the individual application.

---

Want to discuss Data Quality? At Incenda AI we obsess about it - reach out!

Sébastien ESKENAZI

Director of R&D at Pixelz Inc

4 年

If I may, you forgot another very useful and a lot cheaper way to improve the quality of a dataset: filter out the bad data. Obviously one should pay attention not to create a bias and this requires to have enough data in the first place. But if the data collection is automated so that one has enough images and the errors are somewhat random (or the resulting bias is acceptable) then filtering data is a lot faster hence cheaper than fixing it. Sometimes it can even be automated. For example at Pixelz Inc we can collect data (images) straight from our production at a rate above 100k samples per week and with adequate logic we can automatically filter out the bad samples resulting in roughly 50-60k perfect samples per week at practically no cost. ???? Of course in the case of multi-class image segmentation one would need to be smart about the filtering in order to remove only the bad labels and keep the good ones in a given image without penalising "false" false positives during the training. But I think it should be doable.

Werner Streich

Senior Business Advisor

4 年

This is very pragmatic thinking. How good is good enough is always a central question, not only in terms of functionality and also in terms of project-time and -resource planning.?Having some metrics to define a desired target quality level is an interesting approach.

Vinayak Kamath

Enterprise IT | Future Mobility | DevOps | MBA

4 年

An interesting read...

要查看或添加评论,请登录

Felix Friedmann的更多文章

社区洞察

其他会员也浏览了