Dataset QA — how to automatically review a new set of images?
New images (green) are blurry and Data Explorer successfully separates them from existing batch of sharp images (red, purple)

Dataset QA — how to automatically review a new set of images?

A dataset of images or videos, used for computer vision tasks, could be the key to success or failure. A clean dataset could lead the way to a great algorithm, model and ultimately system, while no matter how good the model or algorithm is, garbage in — garbage out.

The rule of thumb is — more examples is better, right? Not always.

Maintaining dataset quality is very tricky. New images could be corrupted, they might come from undesired sources, contain irrelevant objects or scenes, so instead of improving model accuracy they would degrade it.

How can you tell? How to check you aren’t, unintentionally, doing harm?

Manual QA

A manual sanity check is easy — review a handful of images, to confirm you got the relevant scene. Maybe write a script to randomly select a few images and check image quality.

Perhaps run statistical analysis to confirm color distribution? Trickier, but doable.

The bigger and more diverse a dataset becomes, increasingly complicated QA steps are required, while manual inspection becomes impractical.

There is a better way!

Data Explorer’s Compare

Akridata’s Data Explorer provides a comprehensive solution. Data Explorer is a platform that was built to allow us to focus on the data, curate it, clean it and make sure we start, and continue working on high quality data when developing computer vision solutions.

In previous posts we saw Data Explorer visualize a dataset, support explorationimage-based-search and much more.

In this post, we will see how Data Explorer compares datasets. Within the Compare functionality, we define the existing set of images as the Base, and the new set as Delta.

Automated QA

The new batch, Delta, is visualized and users can review and explore Base, Delta or both Combined. As a result, basic QA is completed rapidly:

  1. Checking the scene relevance and quality
  2. Finding outliers or unrelated images

The example below shows a comparison between an existing batch of satellite images and a new batch of flower images. A similar error could be caused by incorrect marking, human error in sending the wrong batch or a bug somewhere in the pipeline.

If it goes undetected, it results in wasted labeling cost, time and effort spent on training cycles and probably lower model accuracy.

With Data Explorer you can immediately identify the problem as the sets are separated:

No alt text provided for this image
Comparing Base set of satellite images (top left) and Delta set of flowers (bottom right) — Delta is separated from the Base to indicate a completely different image source or scene

Another level of QA is just as easily completed:

  1. Confirm image quality
  2. Confirm distribution of images

The example below demonstrates Base and Delta coming from the same image source, same scene type, but Delta should be ignored and maintenance teams to be alerted as the images in the new set are completely blurred. Notice the clear separation between the original, high quality set, and the new, blurry images:

No alt text provided for this image
Comparing an existing set of flowers (red + purple clusters), with a new set of flower images (green). New set to be ignored and acquisition flow should be reviewed with images being very blurry

How to pass QA?

Data Explorer allows you to inspect the distribution within each cluster and between clusters — indicating where to add examples, where they are missing, and where they should be removed.

If the image quality remains high, image distribution between clusters and within each cluster is similar to those observed in Base, the new set most likely can be used and overall dataset quality remains.

In the below example, Base and Delta come from the same source, VOC dataset, and the comparison indicates new set of similar quality to existing:

No alt text provided for this image
Visualization of Base and Delta datasets, and both combined. Comparison indicates the new set is of similar quality to existing.

Summary

In this blog, we saw how to apply QA to new batches of images using Data Explorer’s Compare, in order to preserve overall dataset quality. It allows us to confirm relevance and quality of new images, and in doing so to save annotation costs and training cycles, and keep model accuracy high.

To learn more, visit us at akridata.ai or click here to register for a free account.

要查看或添加评论,请登录

社区洞察

其他会员也浏览了