登录查看更多内容

Training data: the more the better?

b-plus Group

PIONEERING NEW MOBILITY

发布日期: 2023年7月28日

In Machine Learning the mantra for many years has been:?more data equals better results!?So far, this held true in most cases.?Datasets have been growing from several thousand data-points to now billions and billions with a tendency to grow even larger.

However, getting evermore data brings its own set of challenges. Training time increases, data collection can be tedious and expensive, especially for very specific domains. But data is complex and not all is equal. There can be redundancies or quality issues which increase when trying to amass large amounts of data. Hence, the question arises: Does it really always need to be more data? Could getting better data improve the performance equally good as more data?

This article will examine how more data and higher quality data influences the performance of a machine learning system.

More data or better quality?

In order to evaluate this, experiments are running on the KITTI Dataset and RetinaNet, a bounding box detector for investigating data quality. In the scenario there are two variables: data quality and dataset size. For the data quality aspect labels of the dataset are randomly damaged in varying degrees, e.g. 5% have issues with the size of the bounding box. The flawed data are then used and the neural net are trained with different fractions of the total dataset (25% to 100%). As an evaluation criteria mAP value is used.

mAP stands for mean Average Precision and is a popular metric in object detection. As the name suggests, it is the mean of the Average Precision (AP) of each object class. Average Precision is the area under the Precision-Recall curve of an object detector for one class.

This is done for 10 random seeds and yields the following results:

Es wurde kein Alt-Text für dieses Bild angegeben.

It is clearly visible that the best performance can only be reached if the data quality is sufficiently high. Furthermore, the positive effect of additional data seems to get less, especially for data with a high (0%) or very low (35%) quality grade. This becomes easier to see by looking at the mAP improvement for each increase of dataset size:

领英推荐

How to build a good database for AI and machine…

Arkangel AI 1 年前

What is a Data pipeline for Machine Learning?

TAGX 1 年前

?? Exciting News: Generative models are…

Lumina - makers of Analytica 1 年前

With each step the performance gains through more data become less. A different behavior is observed for increasing data quality. Below the average improvement for every decrease in error rate is given:

Here, a more consistent improvement for each step is seen across all quality levels as well as size of a dataset. Although it also becomes less when reaching a high-quality level.

What to take away?

The results here at least indicate that more data is not the only way to reach a better model performance.?

For safety-critical applications such as in the ADAS/AD area, it is essential to perform quality assurance. Not the quantity, but the required data quality must be ensured in order to analyze and solve emerging problems in ADAS function development.

An increase in data quality by fixing label issues can have an equal or even greater positive impact. This can be useful when the data is rare or hard to acquire. Furthermore, the best levels of performance can’t be reached with bad or false labels.?

Surely, this only represents one use case and results probably vary depending on domain, amount of data and model. Nonetheless, the results are consistent with our experience when working with customers across a range of industries. If you are interested in the topic or you want to know more about the quality of your data just, just get in touch with us.

Training data: the more the better?

b-plus Group

PIONEERING NEW MOBILITY

More data or better quality?

领英推荐

What to take away?

b-plus Group的更多文章

社区洞察

其他会员也浏览了

Data Literacy in the AI Era

Overview of Feature Engineering In Machine Learning

The Importance of Statistics in Machine Learning: A Comprehensive Guide

IID in machine learning

Some Statistical Operations For Machine Learning

Day 13 : How Machines Learn from Data – An Overview

Data Encoding in Machine Learning - Part 08

Introduction to Data

Is Your Course Data Inclusive, Time To Ask?

Enhancing Machine Learning Models: The Importance of Data Augmentation

More data or better quality?

领英推荐

What to take away?

b-plus Group的更多文章

The Next Step in Edge Computing: Reliability, Scalability, and Cost Efficiency

Tailored for Performance: b-plus Develops Custom DATALynx ATX4 System

Robust, flexible and perfect for any application – our displays for mobile machines!

Seamless Validation of your Systems with our HiL Solutions

Securing your Data Pipeline with our Encryption Manager

Because "Out of the Box" Is Not Always the Right Solution

Safeguarding Sensitive Data with xSTORAGE Disk Encryption Manager

Free Online Webinar by b-plus // AD Validation for Data Driven Development

The perfect data set – generated with b-plus and IVEX

Optimize your Test Fleets with high-performance Measurement solutions and integrated Data Management!

社区洞察

其他会员也浏览了

Data Literacy in the AI Era

Overview of Feature Engineering In Machine Learning

The Importance of Statistics in Machine Learning: A Comprehensive Guide

IID in machine learning

Some Statistical Operations For Machine Learning

Day 13 : How Machines Learn from Data – An Overview

Data Encoding in Machine Learning - Part 08

Introduction to Data

Is Your Course Data Inclusive, Time To Ask?

Enhancing Machine Learning Models: The Importance of Data Augmentation