登录查看更多内容

From Zero to Labeled: How Hixloop Transforms Dataset Creation Timelines

BionomeeX

BionomeeX provides innovative analysis tools to support your biological, medical and environmental R&D projects.

发布日期: 2024年4月9日

+ 关注

In this blog post, we will detail the process of using Hitloop to speed up the generation of a labeled dataset.

Hixloop Introduction

Hixloop offers users a seamless data labeling workflow.?Commencing with data importation, users initiate the labeling process by annotating a small segment of their dataset. This initial input serves as the foundation for training the first AI model. Utilizing this model, predictions are generated for the remaining data

Human verification of these predictions follows, ensuring accuracy and reliability. Subsequently, a new model is trained, incorporating human-verified predictions to enhance precision.

Labeled Dataset Creation

The Base Dataset

The dataset contains?around 3000 images featuring bees and hives, with the following specifications:

RGB images (24-bit)
Varied image sizes, including 4K, 1080p, and 720p.
Approximately 20 different points of view.

The dataset can be downloaded here:

https://www.kaggle.com/dsv/4738309

Credit:

 @misc{andrewl_2022,
	title={Bee Image Object Detection},
	url={https://www.kaggle.com/dsv/4738309},
	DOI={10.34740/KAGGLE/DSV/4738309},
	publisher={Kaggle},
	author={AndrewL},
	year={2022}
}

Dataset Preview

The Task?/ What we want to solve

Find and count bees in images.

Labelling Project Creation

For labeling tasks, Hitloop utilizes Label Studio, a widely recognized open-source labeling interface.?Let's begin by establishing a project to label bounding boxes on bee images.

First Loop

Labelling Data for the first Loop.

In the initial loop, we labeled 28 images, capturing all the bees within each image. We deliberately selected various backgrounds to enhance the impact of each image during the training process.

The labeling process took approximately 20?minutes.

Training the first Loop

We randomly divided the initial dataset of 28 images into groups for training and validation purposes.

Training Set: 233 bees
Validation Set: 43?bees

The trained model is a Faster R-CNN, but we won't delve into further details as that is not the focus of this blog post.

Result of the? First Loop

After the training, predictions were generated for all images imported into the project. Initial results are promising, capturing the majority of bees in the simplest images. However, in cases with significant overlap, the model's performance is not optimal.

Evaluate the first Model

To evaluate the model, we calculate the average recall and precision for various Intersection over Union (IoU) thresholds and box sizes. The evaluation is conducted on a distinct set of images that have never been presented to the model.

领英推荐

K-Nearest Neighbors (KNN) Algorithm for…

Vrata Tech Solutions (VTS) 11 个月前

Computer Vision Classification: Cleaning Noisy and…

Superb AI Inc. 1 年前

Benefits of cross-validation in model selection

AIBrilliance 2 个月前

Here are the results:

In regards to the results obtained from the test dataset:

The average precision is 0.321, meaning that 32% of the proposed model boxes are correct.
The average recall is 0.222, indicating that 22% of the boxes are identified.

The results indicate that the model is currently not performing well, even though it can provide predictions for simpler images.

Expanding the Dataset Size with First Loop Predictions.

Utilizing the predictions from the first loop allows us to quickly acquire additional labels through review.?

This method yields 72 new images in less than 5?minutes, bringing the total number of images to 100.

Although we initially imported 2000 images into the dataset, we chose to conclude loop 2 with 100 images.

Using predictions to obtain labels has significantly simplified and accelerated the task compared to manual labeling, making it notably easier and faster.

Depending on the accuracy of the image and the model, we can estimate a speed-up factor ranging between 5 and 10.

After doing the same splitting process we get:

- 614 bees?in the training set

- 170 bees in the validation set.

Second Loop

Training the Second?Loop

We opted to extend the training of the same model.?

As illustrated in the graph, the purple box?represents the training of the first loop, which plateaus after 200 epochs and begins to overfit.

?The addition of these 72 images results in a loss reduction to 0.6 in the second loop (red box).

Regarding the results on the test dataset, we achieved an average precision of 0.43 and an average recall of 0.298. While not a substantial improvement, these metrics indicate that the model is still learning and improving on our test dataset (ground truth).

Training a new model using predictions from prior loops.

Following loop 2, we doubled the labeled images in our dataset:

Training Set: 1218 bees
Validation Set: 276 bees

We can now train a new model with the same structure (Faster R-CNN) but with a consistent number of labels.

Final Results

After three loops, the scores reached 0.91 (was 0.43) for average precision and 0.56 (was 0.29) for average recall. While not perfect, the model now needs to learn how to detect overlapped bees, a task that can be easily addressed with Hitloop by correcting errors in the predictions.

Next Steps

There are several next steps ahead. We can further utilize Hitloop to label more images. Currently, we've only incorporated ~200 images (1218 bees), which is a fraction of the 3000 images in the dataset.

Expanding the task to a more intricate level could be determining whether a bee is entering or exiting the hive. This task could address the question: "What is the flux of bees in and out of my hive?"

From Zero to Labeled: How Hixloop Transforms Dataset Creation Timelines

BionomeeX

BionomeeX provides innovative analysis tools to support your biological, medical and environmental R&D projects.

Hixloop Introduction

Labeled Dataset Creation

The Base Dataset

Dataset Preview

The Task?/ What we want to solve

First Loop

Labelling Data for the first Loop.

Training the first Loop

Result of the? First Loop

Evaluate the first Model

领英推荐

Here are the results:

Expanding the Dataset Size with First Loop Predictions.

Second Loop

Training the Second?Loop

Final Results

Next Steps

社区洞察

其他会员也浏览了

How to create a train and test dataset

Decoding data tools: BI vs AI - What's the difference?

From Blank Canvas to a Brilliant Presentation with AI

Understanding Database Vector Search for Gen AI

How your enterprise should use a vector database for its LLM apps - AI&YOU #54

Handling Outliers in ML: Best Practices for Robust Data Preprocessing

How Machines Learn to See Similarities

Graph Assets - Best practises for your RAG based project.

Top AI Tools to Enhance Your Data Analysis Skills in August 2024

Navigating Parametric and Non-Parametric Data in Machine Learning