登录查看更多内容

Chapter 2: Data Preparation Phase in Machine Learning Projects

Mahade Hasan Mridul

AI Project Manager @ Altersense | AI/ML | Product Manager | 6+ Years of Experience in B2B Product & Project Management | Building Industrial AI Solutions

发布日期: 2024年11月15日

Data is the backbone of any machine learning (ML) project. It’s often said that "garbage in, garbage out," emphasizing the critical role that high-quality data plays in the success of an ML model. Following up on our previous discussions of the ML project lifecycle and the scoping phase, this article delves into the Data Preparation Phase, which is arguably the most labor-intensive and significant part of any ML project.

In this article, we’ll explore the steps involved in data preparation and use the example of a Smartphone Manufacturing company leveraging computer vision to enhance productivity, quality, and efficiency as a practical reference.

Data is the Gem of any Machine Learning Project

The quality of the data directly impacts the performance and reliability of the ML model. Properly prepared data ensures that the model learns the correct patterns, minimizes biases, and generalizes well to unseen data. This phase is where raw data is transformed into a form that the ML algorithm can use effectively.

Key Steps in the Data Preparation Phase

1. Data Collection

The first step is gathering the required data. Depending on the project’s scope, data collection methods may vary:

Client-Supplied Data: In our example, the manufacturing company might provide historical production data, images of products, or defect reports.
Source-Based Collection: If the project involves real-time analysis, the team may install cameras on the production line to capture live video streams for computer vision tasks.
Synthetic Data: In cases where real-world data is scarce, synthetic data can be generated using tools like GANs or 3D modeling to create realistic images of smartphones.
Hybrid Approach: Combining real and synthetic data to balance cost, effort, and diversity in training data.

2. Data Filtering and Cleaning

Collected data is rarely perfect. This step involves:

Removing Noisy Data: For instance, filtering out blurry or irrelevant images from production line footage.
Handling Missing Data: Filling gaps in datasets using techniques like interpolation or excluding incomplete records.
Balancing Classes: If the dataset is imbalanced (e.g., far more defect-free products than defective ones), resampling techniques like SMOTE can be used to ensure fairness during training.

Figure: Data Preparation Phase in a Machine Learning Project

3. Data Annotation

Annotation is critical for supervised learning tasks, as it provides the labeled data that the model learns from.

Text Annotation: Not applicable in this example but often used for NLP tasks.
Image Annotation: Labeling images to indicate specific features, such as marking defects like scratches or misaligned components on smartphones.
Video Annotation: Identifying and tagging specific events or objects in video streams, such as bottlenecks in the assembly line.
Data Augmentation: To enhance the diversity of the dataset, especially in computer vision projects, augmentation techniques like flipping, rotating, cropping and color adjustments are applied. For example, varying angles of smartphone images can improve the model’s ability to detect defects in real-world scenarios.

Tools can be used for data annotation such as Supervisely, Roboflow, Label Studio, etc.

4. Data Supervision and Quality Assurance

Before feeding the data into the ML pipeline, rigorous quality checks are essential. This involves:

Data Validation: Ensuring the annotations are accurate and consistent.
Sample Review: Manually reviewing subsets of data to spot errors or biases.
Feedback Loop: Incorporating stakeholder feedback to refine data quality and relevance.

5. Dataset Splitting

Finally, the prepared data is split into training, validation, and test sets:

Training Set: Used to train the model.
Validation Set: Helps fine-tune hyperparameters.
Test Set: Evaluate the model’s performance on unseen data.

Real-World Example: Smartphone Manufacturing

Imagine a smartphone manufacturer implementing computer vision to inspect production lines.

Data Collection: High-resolution cameras installed on the assembly line capture video streams of the smartphones being assembled.
Filtering and Cleaning: Frames where the smartphone is partially visible or obscured by tools are removed.
Annotation: Experts tag areas of interest, such as scratches, dents, or misaligned components, on thousands of images.
Augmentation: Images are flipped, rotated, and adjusted for brightness to simulate different lighting conditions on the factory floor.
Supervision: Annotated samples are reviewed by quality engineers to ensure correctness.
Dataset Splitting: 70% of images are used for training, 20% for validation, and 10% for testing the model’s performance.

What's Next?

The Data Preparation Phase is the cornerstone of every successful ML project. While it requires significant effort, the payoff is immense, as it directly impacts the quality of the final model. By investing time and resources in proper data preparation, teams can build robust, reliable ML systems that deliver meaningful results.

If you’ve found this guide helpful, stay tuned for the next article, where we’ll dive into Model Development and Training for machine learning projects. Let me know through comments if you’d like any adjustments or additional details.

Md. Rashedun Nabi

Project Lead at e27 | Innovation Strategist | Mentor and Advisor

4 个月

Insightful

1 次回应

要查看或添加评论，请登录

Mahade Hasan Mridul的更多文章

Chapter 4: Model Deployment and Monitoring Phase of a Machine Learning Project

2024年12月28日

Chapter 4: Model Deployment and Monitoring Phase of a Machine Learning Project

In my previous articles, I’ve explored the ML Project Lifecycle, the importance of the Scoping Phase, and the process…

1 条评论
Chapter 3: Model Development and Training Phase of a Machine Learning Project

2024年12月7日

Chapter 3: Model Development and Training Phase of a Machine Learning Project

In my previous articles, I’ve explored the ML Project Lifecycle, the importance of the Scoping Phase, and the…

1 条评论
Chapter 1: Scoping Phase in Machine Learning Projects

2024年11月1日

Chapter 1: Scoping Phase in Machine Learning Projects

As an AI/ML Project Manager, I think one of the most critical & important phases is the Scoping Phase. This stage sets…
Machine Learning Project Life Cycle: From Scoping to Deployment

2024年10月27日

Machine Learning Project Life Cycle: From Scoping to Deployment

Demystifying the Machine Learning Project Lifecycle A successful machine learning project follows a structured approach…

1 条评论
Functional VS Non-Functional Requirements

2021年4月27日

Functional VS Non-Functional Requirements

The basic difference between Functional & Non-Functional requirement analysis are - Functional requirements define the…

2 条评论
A Project Management Life Cycle

2020年8月6日

A Project Management Life Cycle

A Project Management Life Cycle is a series of activities that are essential for accomplishing project objectives or…

2 条评论
A Software Development Life Cycle

2020年7月29日

A Software Development Life Cycle

Software Development Life Cycle (SDLC) is a process used by the software industry to design, develop, and test…

4 条评论

See all articles

Data is the Gem of any Machine Learning Project

Key Steps in the Data Preparation Phase

1. Data Collection

2. Data Filtering and Cleaning

3. Data Annotation

4. Data Supervision and Quality Assurance

5. Dataset Splitting

Real-World Example: Smartphone Manufacturing

What's Next?

Mahade Hasan Mridul的更多文章

Chapter 4: Model Deployment and Monitoring Phase of a Machine Learning Project

Chapter 3: Model Development and Training Phase of a Machine Learning Project

Chapter 1: Scoping Phase in Machine Learning Projects

Machine Learning Project Life Cycle: From Scoping to Deployment

Functional VS Non-Functional Requirements

A Project Management Life Cycle

A Software Development Life Cycle