登录查看更多内容

Sample Size Determination for Data Quality Checks

Felix Friedmann

NVIDIA DriveOS, embedded LLM/VLM, DriveWorks

发布日期: 2020年11月12日

Intro you will probably skip

After having recently discussed how to Fix (parts of) your Labeled Dataset, let's now look into how we can assess the quality level of a dataset. For the metric we can assume a pass/fail outcome per data point in the dataset, based on e.g. thresholds on IoU, mAP, F1, etc. - will be covered in a separate article. We could go trough the whole dataset, assess every single data point, and take the average. For large datasets, this will be hardly possible; instead we'd take a sample and estimate the overall quality level.

This article discusses how to assess dataset quality by sampling, how to determine the size of the sample, and interprets the results.

ISO 2859?

ISO 2859 "Sampling procedures for inspection by attributes" is addressing quality inspections, but relies heavily on look up tables. Instead of pointing to those, let's employ some statistics ourselves in the sections below. The ISO standard is protected by a pay wall, but there are some resources (lecture notes etc.) that can be found online.

Perhaps most interesting about ISO 2859-1 are the proposed switching rules, that iteratively tighten or relax inspections depending on the measured quality level. A mechanism like this makes sense for labeling processes as well.

Sample Size Determination

If we take a sample of the dataset to estimate its overall quality, how should parameters be set? How big shall the sample size be? It depends on a couple of factors, especially the confidence we want to have in our estimation. We'll assume that the quality metric underlies a normal distribution.

The parameters we'll need to set are the following:

Confidence Level: Reflects the confidence in the results, i.e. how often the true result would lie in the Confidence Interval.
Confidence Interval: Reflects the accuracy of the estimation - the interval within which the true result is expected to lie. Defined by the epsilon (=margin of error) parameter, the true result would be expected to be between between the measured value +- epsilon
Population Proportion: The share of the proportion that is expected to be positive. Sample size will reach the highest value if this proportion is at 50%, so 50% can be used as a worst case estimate in case the proportion is not known upfront (usually the case when it comes to data quality)

Now that the basics are covered, here is how we can determine the sample size:

If you want to avoid invnorm, you can grab z-score values from one of the many Confidence Level ? Z-Score tables. I was surprised to see how often they are used, but don't want to rely on lookup tables. Reference, in case you want to get into the details: Penn State Eberly College of Science - Estimating a Proportion for a Small, Finite Population.

A closer look at some examples

Let's get a feeling about what this means in practice.

For a Population Size of 100000, epsilon of 2%, Population Proportion and Confidence Level ranging from 70% o 97%, we'd have to check between 78 and 2412 data points ... almost two orders of magnitude. Overall, this would still be a fraction (0.08% - 2.41%) of the overall dataset - way less effort than checking the full dataset.

Looking at a couple of parameter settings, we can see that the Sample Size quickly saturates when the Population Size becomes large.

Let's Simulate it - Python Code

Formulas are nice, but let's conduct a reality check by simulating this problem in Python

1) Define vars

population_size = 100000
surveys = 100000
true_population_proportion = 0.95
use_worst_case_estimate = False
population_proportion = 0.5 if use_worst_case_estimate else true_population_proportion
epsilon = 0.02
confidence_level = 0.93

2) Calculate Z-Score

Calculate Z-Score from Confidence Level via invnorm according to the formulas above:

from scipy.stats import norm

z_score = float(norm.ppf(confidence_level + (1 - confidence_level) / 2))
print("Confidence level = %f => z-score = %f" % (confidence_level, z_score))

Result

Confidence level = 0.950000 => z-score = 1.959964

Results match well with the Confidence Level ? Z-Score tables one can find online.

3) Calculate Sample Size

Calculate the Sample Size for unlimited Population Size and for the Population Size defined above:

sample_size_unlimited = z_score ** 2 * population_proportion * (1 - population_proportion) / epsilon ** 2
sample_size_limited = int(sample_size_unlimited / (1 + (sample_size_unlimited - 1) / population_size))
print("Sample size for estimated positive population proportion %f should be %i" % (population_proportion , sample_size_limited))

Result

Sample size for estimated positive population proportion 0.950000 should be 454

454 - not much for a Population Size of 100000.

4) Simulation

Run a <surveys=100000> experiments where <sample_size=sample size calculated above> random numbers are generated, if they are below or equal to <true_poopulation_proportion=0.95> consider them positve otherwise negatives. If the proportion of positives vs. <sample_size> is within the confidence interval, count it as a positive survey.

import numpy as np
from tqdm import tqdm

results = np.zeros((surveys, ))
positive_surveys = 0
for i in tqdm(range(surveys)):
    results[i] = np.mean(np.random.rand(sample_size_limited) <= true_population_proportion)
positive_survey_ratio = np.mean(abs(results - results.mean()) <= epsilon)
print("%f Ratio of surveys in confidence interval" % positive_survey_ratio)
print("%f Confidence level" % confidence_level)

Result

0.948040 Ratio of surveys in confidence interval
0.950000 Confidence level

Nice to see that a good share of surveys are indeed within the confidence interval. And that this share is indeed close to the Confidence Level. If we rerun the experiment with other parameters it becomes apparent that the latter is only the case with high numbers (for <population_size>, surveys etc., due to variance.

Relevance? What's the point?

The sample size determination approach described above allows to reliably estimate the quality level of any products including labeled data by assessing a sample. This can save a lot of work.

---

See also this article on "Fixing (parts of) your Labeled Dataset" to understand the impact it would have on the overall quality level to fix the flawed data points found in the sample.

Want to discuss Data Quality? At Incenda AI we obsess about it - reach out!

Travis Dodd

Project Leader | Medical Devices

1 年

Hi Felix Friedmann Sebastian Kaiser I've just read you article, and thought maybe you might have had expertise in some of this. Table 16 of ISO 2859-2-2020 lists the shortest length confidence intervals for a number of sampling plans, for single batch sampling, however I haven't yet found a method to develop shortest length confidence outside of this table. Do you know of any tools to calculate these numbers, outside of the table?

Masnizam Musri

MZ Worldwide

2 年

can we use acceptance quality limit to determine sample size?

Dr. Sebastian Kaiser

4 年

Very nice to see that my real profession statistic is now discovered by the data science and AI community ;-) I know statisticians are very bad in marketing so I really appreciate this

7 次回应

查看更多评论

要查看或添加评论，请登录

Felix Friedmann的更多文章

Ep1: Antonio M. López on Early ADAS Development in Academia, SYNTHIA, CARLA, UrbanSyn, SensiMotor Models

2024年11月25日

Ep1: Antonio M. López on Early ADAS Development in Academia, SYNTHIA, CARLA, UrbanSyn, SensiMotor Models

Based on a discussion with Antonio M. López: Researcher & Professor at Computer Vision Center (CVC) of Universitat…

1 条评论
Precision, Recall, F1-Score for Object Detection - Back to the ML Basics

2020年11月19日

Precision, Recall, F1-Score for Object Detection - Back to the ML Basics

There are some topics that we come across again and again. As Christoph Petzinger, a fellow (fantastic) software…

6 条评论
Fixing (parts of) your Labeled Dataset

2020年10月28日

Fixing (parts of) your Labeled Dataset

Intro that you'll probably skip Supervised learning, i.e.

6 条评论
Join Autonomous Driving Meetup #4

2017年12月8日

Join Autonomous Driving Meetup #4

There'll be talks on Automated Driving Architectures (DFKI+BeamNg) and DNN training with synthetic data (TU Graz) plus…

1 条评论
1st Autonomous Driving Meetup in Shanghai, tomorrow!

2017年11月6日

1st Autonomous Driving Meetup in Shanghai, tomorrow!

Join an open discussion on all self-driving car technology and feel free to forward this invitation!

1 条评论

See all articles

Sample Size Determination for Data Quality Checks

Felix Friedmann

NVIDIA DriveOS, embedded LLM/VLM, DriveWorks

Intro you will probably skip

ISO 2859?

Sample Size Determination

A closer look at some examples

Let's Simulate it - Python Code

1) Define vars

2) Calculate Z-Score

3) Calculate Sample Size

4) Simulation

Relevance? What's the point?

Felix Friedmann的更多文章

社区洞察

其他会员也浏览了

Creating Fact Based Conversation

How to Handle Missing Data in Your KPIs

Cs & As of data

Zoom Out and Drill Down: The Benefits of Core Performance

Estimating queue position with L2 and MBO data

Data as a Product

What goes into making a great report?

It seems like these days, everyone and everything is data driven

You can't report on everything

Adding value to your organisation through data products

Intro you will probably skip

ISO 2859?

Sample Size Determination

A closer look at some examples

Let's Simulate it - Python Code

1) Define vars

2) Calculate Z-Score

3) Calculate Sample Size

4) Simulation

Relevance? What's the point?

Felix Friedmann的更多文章

Ep1: Antonio M. López on Early ADAS Development in Academia, SYNTHIA, CARLA, UrbanSyn, SensiMotor Models

Precision, Recall, F1-Score for Object Detection - Back to the ML Basics

Fixing (parts of) your Labeled Dataset

Join Autonomous Driving Meetup #4

1st Autonomous Driving Meetup in Shanghai, tomorrow!

社区洞察

其他会员也浏览了

Creating Fact Based Conversation

How to Handle Missing Data in Your KPIs

Cs & As of data

Zoom Out and Drill Down: The Benefits of Core Performance

Estimating queue position with L2 and MBO data

Data as a Product

What goes into making a great report?

It seems like these days, everyone and everything is data driven

You can't report on everything

Adding value to your organisation through data products