Dataset QA — how to automatically review a new set of images?
Alexander Berkovich
Principal AI/ML Engineer @ Akridata | Computer Vision Expert | Top AI Voice '24
A dataset of images or videos, used for computer vision tasks, could be the key to success or failure. A clean dataset could lead the way to a great algorithm, model and ultimately system, while no matter how good the model or algorithm is, garbage in — garbage out.
The rule of thumb is — more examples is better, right? Not always.
Maintaining dataset quality is very tricky. New images could be corrupted, they might come from undesired sources, contain irrelevant objects or scenes, so instead of improving model accuracy they would degrade it.
How can you tell? How to check you aren’t, unintentionally, doing harm?
Manual QA
A manual sanity check is easy — review a handful of images, to confirm you got the relevant scene. Maybe write a script to randomly select a few images and check image quality.
Perhaps run statistical analysis to confirm color distribution? Trickier, but doable.
The bigger and more diverse a dataset becomes, increasingly complicated QA steps are required, while manual inspection becomes impractical.
There is a better way!
Data Explorer’s Compare
Akridata’s Data Explorer provides a comprehensive solution. Data Explorer is a platform that was built to allow us to focus on the data, curate it, clean it and make sure we start, and continue working on high quality data when developing computer vision solutions.
In previous posts we saw Data Explorer visualize a dataset, support exploration, image-based-search and much more.
In this post, we will see how Data Explorer compares datasets. Within the Compare functionality, we define the existing set of images as the Base, and the new set as Delta.
Automated QA
The new batch, Delta, is visualized and users can review and explore Base, Delta or both Combined. As a result, basic QA is completed rapidly:
- Checking the scene relevance and quality
- Finding outliers or unrelated images
The example below shows a comparison between an existing batch of satellite images and a new batch of flower images. A similar error could be caused by incorrect marking, human error in sending the wrong batch or a bug somewhere in the pipeline.
If it goes undetected, it results in wasted labeling cost, time and effort spent on training cycles and probably lower model accuracy.
With Data Explorer you can immediately identify the problem as the sets are separated:
Another level of QA is just as easily completed:
- Confirm image quality
- Confirm distribution of images
The example below demonstrates Base and Delta coming from the same image source, same scene type, but Delta should be ignored and maintenance teams to be alerted as the images in the new set are completely blurred. Notice the clear separation between the original, high quality set, and the new, blurry images:
How to pass QA?
Data Explorer allows you to inspect the distribution within each cluster and between clusters — indicating where to add examples, where they are missing, and where they should be removed.
If the image quality remains high, image distribution between clusters and within each cluster is similar to those observed in Base, the new set most likely can be used and overall dataset quality remains.
In the below example, Base and Delta come from the same source, VOC dataset, and the comparison indicates new set of similar quality to existing:
Summary
In this blog, we saw how to apply QA to new batches of images using Data Explorer’s Compare, in order to preserve overall dataset quality. It allows us to confirm relevance and quality of new images, and in doing so to save annotation costs and training cycles, and keep model accuracy high.
To learn more, visit us at akridata.ai or click here to register for a free account.