Sample Size Determination for Data Quality Checks

Sample Size Determination for Data Quality Checks

Intro you will probably skip

After having recently discussed how to Fix (parts of) your Labeled Dataset, let's now look into how we can assess the quality level of a dataset. For the metric we can assume a pass/fail outcome per data point in the dataset, based on e.g. thresholds on IoU, mAP, F1, etc. - will be covered in a separate article. We could go trough the whole dataset, assess every single data point, and take the average. For large datasets, this will be hardly possible; instead we'd take a sample and estimate the overall quality level.

This article discusses how to assess dataset quality by sampling, how to determine the size of the sample, and interprets the results.

ISO 2859?

ISO 2859 "Sampling procedures for inspection by attributes" is addressing quality inspections, but relies heavily on look up tables. Instead of pointing to those, let's employ some statistics ourselves in the sections below. The ISO standard is protected by a pay wall, but there are some resources (lecture notes etc.) that can be found online.

Perhaps most interesting about ISO 2859-1 are the proposed switching rules, that iteratively tighten or relax inspections depending on the measured quality level. A mechanism like this makes sense for labeling processes as well.

No alt text provided for this image

Sample Size Determination

If we take a sample of the dataset to estimate its overall quality, how should parameters be set? How big shall the sample size be? It depends on a couple of factors, especially the confidence we want to have in our estimation. We'll assume that the quality metric underlies a normal distribution.

The parameters we'll need to set are the following:

  • Confidence Level: Reflects the confidence in the results, i.e. how often the true result would lie in the Confidence Interval.
  • Confidence Interval: Reflects the accuracy of the estimation - the interval within which the true result is expected to lie. Defined by the epsilon (=margin of error) parameter, the true result would be expected to be between between the measured value +- epsilon
  • Population Proportion: The share of the proportion that is expected to be positive. Sample size will reach the highest value if this proportion is at 50%, so 50% can be used as a worst case estimate in case the proportion is not known upfront (usually the case when it comes to data quality)

Now that the basics are covered, here is how we can determine the sample size:

No alt text provided for this image

If you want to avoid invnorm, you can grab z-score values from one of the many Confidence Level ? Z-Score tables. I was surprised to see how often they are used, but don't want to rely on lookup tables. Reference, in case you want to get into the details: Penn State Eberly College of Science - Estimating a Proportion for a Small, Finite Population.

A closer look at some examples

Let's get a feeling about what this means in practice.

No alt text provided for this image

For a Population Size of 100000, epsilon of 2%, Population Proportion and Confidence Level ranging from 70% o 97%, we'd have to check between 78 and 2412 data points ... almost two orders of magnitude. Overall, this would still be a fraction (0.08% - 2.41%) of the overall dataset - way less effort than checking the full dataset.

No alt text provided for this image

Looking at a couple of parameter settings, we can see that the Sample Size quickly saturates when the Population Size becomes large.

Let's Simulate it - Python Code

Formulas are nice, but let's conduct a reality check by simulating this problem in Python

1) Define vars

population_size = 100000
surveys = 100000
true_population_proportion = 0.95
use_worst_case_estimate = False
population_proportion = 0.5 if use_worst_case_estimate else true_population_proportion
epsilon = 0.02
confidence_level = 0.93

2) Calculate Z-Score

Calculate Z-Score from Confidence Level via invnorm according to the formulas above:

from scipy.stats import norm

z_score = float(norm.ppf(confidence_level + (1 - confidence_level) / 2))
print("Confidence level = %f => z-score = %f" % (confidence_level, z_score))

Result

Confidence level = 0.950000 => z-score = 1.959964

Results match well with the Confidence Level ? Z-Score tables one can find online.

3) Calculate Sample Size

Calculate the Sample Size for unlimited Population Size and for the Population Size defined above:

sample_size_unlimited = z_score ** 2 * population_proportion * (1 - population_proportion) / epsilon ** 2
sample_size_limited = int(sample_size_unlimited / (1 + (sample_size_unlimited - 1) / population_size))
print("Sample size for estimated positive population proportion %f should be %i" % (population_proportion , sample_size_limited))

Result

Sample size for estimated positive population proportion 0.950000 should be 454

454 - not much for a Population Size of 100000.

4) Simulation

Run a <surveys=100000> experiments where <sample_size=sample size calculated above> random numbers are generated, if they are below or equal to <true_poopulation_proportion=0.95> consider them positve otherwise negatives. If the proportion of positives vs. <sample_size> is within the confidence interval, count it as a positive survey.

import numpy as np
from tqdm import tqdm

results = np.zeros((surveys, ))
positive_surveys = 0
for i in tqdm(range(surveys)):
    results[i] = np.mean(np.random.rand(sample_size_limited) <= true_population_proportion)
positive_survey_ratio = np.mean(abs(results - results.mean()) <= epsilon)
print("%f Ratio of surveys in confidence interval" % positive_survey_ratio)
print("%f Confidence level" % confidence_level)

Result

0.948040 Ratio of surveys in confidence interval
0.950000 Confidence level

Nice to see that a good share of surveys are indeed within the confidence interval. And that this share is indeed close to the Confidence Level. If we rerun the experiment with other parameters it becomes apparent that the latter is only the case with high numbers (for <population_size>, surveys etc., due to variance.

Relevance? What's the point?

The sample size determination approach described above allows to reliably estimate the quality level of any products including labeled data by assessing a sample. This can save a lot of work.

---

See also this article on "Fixing (parts of) your Labeled Dataset" to understand the impact it would have on the overall quality level to fix the flawed data points found in the sample.

Want to discuss Data Quality? At Incenda AI we obsess about it - reach out!







Travis Dodd

Project Leader | Medical Devices

1 年

Hi Felix Friedmann Sebastian Kaiser I've just read you article, and thought maybe you might have had expertise in some of this. Table 16 of ISO 2859-2-2020 lists the shortest length confidence intervals for a number of sampling plans, for single batch sampling, however I haven't yet found a method to develop shortest length confidence outside of this table. Do you know of any tools to calculate these numbers, outside of the table?

回复

can we use acceptance quality limit to determine sample size?

回复

Very nice to see that my real profession statistic is now discovered by the data science and AI community ;-) I know statisticians are very bad in marketing so I really appreciate this

要查看或添加评论,请登录

Felix Friedmann的更多文章

社区洞察

其他会员也浏览了