Training data: the more the better?
In Machine Learning the mantra for many years has been:?more data equals better results!?So far, this held true in most cases.?Datasets have been growing from several thousand data-points to now billions and billions with a tendency to grow even larger.
However, getting evermore data brings its own set of challenges. Training time increases, data collection can be tedious and expensive, especially for very specific domains. But data is complex and not all is equal. There can be redundancies or quality issues which increase when trying to amass large amounts of data. Hence, the question arises: Does it really always need to be more data? Could getting better data improve the performance equally good as more data?
This article will examine how more data and higher quality data influences the performance of a machine learning system.
?
More data or better quality?
In order to evaluate this, experiments are running on the KITTI Dataset and RetinaNet, a bounding box detector for investigating data quality. In the scenario there are two variables: data quality and dataset size. For the data quality aspect labels of the dataset are randomly damaged in varying degrees, e.g. 5% have issues with the size of the bounding box. The flawed data are then used and the neural net are trained with different fractions of the total dataset (25% to 100%). As an evaluation criteria mAP value is used.
mAP stands for mean Average Precision and is a popular metric in object detection. As the name suggests, it is the mean of the Average Precision (AP) of each object class. Average Precision is the area under the Precision-Recall curve of an object detector for one class.
This is done for 10 random seeds and yields the following results:
It is clearly visible that the best performance can only be reached if the data quality is sufficiently high. Furthermore, the positive effect of additional data seems to get less, especially for data with a high (0%) or very low (35%) quality grade. This becomes easier to see by looking at the mAP improvement for each increase of dataset size:
领英推荐
With each step the performance gains through more data become less. A different behavior is observed for increasing data quality. Below the average improvement for every decrease in error rate is given:
Here, a more consistent improvement for each step is seen across all quality levels as well as size of a dataset. Although it also becomes less when reaching a high-quality level.
?
What to take away?
The results here at least indicate that more data is not the only way to reach a better model performance.?
For safety-critical applications such as in the ADAS/AD area, it is essential to perform quality assurance. Not the quantity, but the required data quality must be ensured in order to analyze and solve emerging problems in ADAS function development.
An increase in data quality by fixing label issues can have an equal or even greater positive impact. This can be useful when the data is rare or hard to acquire. Furthermore, the best levels of performance can’t be reached with bad or false labels.?
Surely, this only represents one use case and results probably vary depending on domain, amount of data and model. Nonetheless, the results are consistent with our experience when working with customers across a range of industries. If you are interested in the topic or you want to know more about the quality of your data just, just get in touch with us.