登录查看更多内容

Dataset QA — how to automatically review a new set of images?

Alexander Berkovich

Principal AI/ML Engineer @ Akridata | Computer Vision Expert | Top AI Voice '24

发布日期: 2023年4月17日

A dataset of images or videos, used for computer vision tasks, could be the key to success or failure. A clean dataset could lead the way to a great algorithm, model and ultimately system, while no matter how good the model or algorithm is, garbage in — garbage out.

The rule of thumb is — more examples is better, right? Not always.

Maintaining dataset quality is very tricky. New images could be corrupted, they might come from undesired sources, contain irrelevant objects or scenes, so instead of improving model accuracy they would degrade it.

How can you tell? How to check you aren’t, unintentionally, doing harm?

Manual QA

A manual sanity check is easy — review a handful of images, to confirm you got the relevant scene. Maybe write a script to randomly select a few images and check image quality.

Perhaps run statistical analysis to confirm color distribution? Trickier, but doable.

The bigger and more diverse a dataset becomes, increasingly complicated QA steps are required, while manual inspection becomes impractical.

There is a better way!

Data Explorer’s Compare

Akridata’s Data Explorer provides a comprehensive solution. Data Explorer is a platform that was built to allow us to focus on the data, curate it, clean it and make sure we start, and continue working on high quality data when developing computer vision solutions.

In previous posts we saw Data Explorer visualize a dataset, support exploration, image-based-search and much more.

In this post, we will see how Data Explorer compares datasets. Within the Compare functionality, we define the existing set of images as the Base, and the new set as Delta.

Automated QA

The new batch, Delta, is visualized and users can review and explore Base, Delta or both Combined. As a result, basic QA is completed rapidly:

Checking the scene relevance and quality
Finding outliers or unrelated images

The example below shows a comparison between an existing batch of satellite images and a new batch of flower images. A similar error could be caused by incorrect marking, human error in sending the wrong batch or a bug somewhere in the pipeline.

If it goes undetected, it results in wasted labeling cost, time and effort spent on training cycles and probably lower model accuracy.

With Data Explorer you can immediately identify the problem as the sets are separated:

No alt text provided for this image — Comparing Base set of satellite images (top left) and Delta set of flowers (bottom right) — Delta is separated from the Base to indicate a completely different image source or scene

Another level of QA is just as easily completed:

Confirm image quality
Confirm distribution of images

The example below demonstrates Base and Delta coming from the same image source, same scene type, but Delta should be ignored and maintenance teams to be alerted as the images in the new set are completely blurred. Notice the clear separation between the original, high quality set, and the new, blurry images:

How to pass QA?

Data Explorer allows you to inspect the distribution within each cluster and between clusters — indicating where to add examples, where they are missing, and where they should be removed.

If the image quality remains high, image distribution between clusters and within each cluster is similar to those observed in Base, the new set most likely can be used and overall dataset quality remains.

In the below example, Base and Delta come from the same source, VOC dataset, and the comparison indicates new set of similar quality to existing:

Summary

In this blog, we saw how to apply QA to new batches of images using Data Explorer’s Compare, in order to preserve overall dataset quality. It allows us to confirm relevance and quality of new images, and in doing so to save annotation costs and training cycles, and keep model accuracy high.

To learn more, visit us at akridata.ai or click here to register for a free account.

Dataset QA — how to automatically review a new set of images?

Alexander Berkovich

Principal AI/ML Engineer @ Akridata | Computer Vision Expert | Top AI Voice '24

更多精彩文章

社区洞察

其他会员也浏览了

Building an Efficient Data Scraper Tool : A Step-by-Step Guide to Algorithm Creation

Notes on Data Compression: Part 5 (JPEG model)

Regex – the ultimate language we love to hate!

Data Analysis with an LLM Twist

Building LangChain ReAct Agents with create_json_chat_agent

What's New in DataOps Suite 2.0.0

Leveraging GenAI for Test Data Generation

Embedding Distance To Enhanced Answer Quality: A Simple Dive

Meaningful Targeted AI at Veeam Software with our Next Generation AI Assistant

?? Leveraging Prompt Chaining and Function Calling in Text-to-SQL Conversion with ??

How to minimize Bias?

2023年11月21日

Model Accuracy Analysis with Saliency Maps

2023年10月6日

Event Search in Videos

2023年9月27日

Modern Model Accuracy Analysis

2023年9月22日

Automatic image tagging using Data Explorer

2023年7月24日

Using Data Explorer to comply with GDPR’s rules

2023年7月10日

Start Training Data

2023年4月4日