登录查看更多内容

10 steps to build and optimize a ML model

Mage

??♀? Data engineers use Mage to build, run, and manage data and AI/ML pipelines, and LLM orchestration (e.g. RAG).

发布日期: 2021年12月10日

TLDR

Let’s take a look at the different steps to build a prediction model and go over the?what,?when,?why, and?how?people accomplish them.

Steps

Below are the steps required to solve a machine learning use case and to build a model.

Define the Objective
Data Gathering
Data Cleaning
Exploratory Data Analysis (EDA)
Feature Engineering
Feature Selection
Model Building
Model Evaluation
Model Optimization
Conclusion

Step 1: Define the objective

Source: Pixabay

What’s the objective?

Deciding a use case you want to predict or know more about.

When is the objective defined?

Objective is the first step which is decided based on business requirements.

Why is it necessary to set an objective?

Defining the objective sheds light on?what kind of data should be gathered. It also helps us in?judging what kind of observations are important?while doing exploratory data analysis.

How to define an objective?

Objective should be clear and precise. Therefore, to define a clear objective we need to follow few steps like:

Understand the business (Eg: Grocery store)
Identify the problem (Eg: Less Profits)
List out all the possible solutions to solve the problem(Eg: By increasing sales or by reducing manufacturing costs or by managing inventory etc.)
Decide on one solution (Eg: managing inventory, we can come to this conclusion by talking to the respective business people back and forth.

By following the above steps, we’ve clearly defined that the objective is to build a model?to manage inventory?in order to?increase?store?profits.

Step 2: Data Gathering

Source: Pexels

What’s Data Gathering?

Data Gathering is nothing but collecting the data required as per the defined objective.

When do we gather data?

Once the objective is defined, we will collect data.

Why is Data Gathering necessary?

Without past data, we cannot predict the future, hence Data Gathering is necessary. In general a dataset is created by?gathering data from various resources?based on the objective. One of the reasons for gathering data from multiple resources is to get more accurate results i.e.,”The more the data, the more accurate the results will be”.

How is Data Gathering done?

Data can be collected in one of the following ways mentioned below:

API’s (like Google, Amazon, Twitter, New York Times etc.)
Databases (like AWS, GCP etc.)
Open source (Kaggle/UCI Machine Learning Repositories etc.)
Web Scraping (Not recommended, as often it is considered as illegal)

“The order of?Defining the objective?and?Data gathering?steps can be changed. Sometimes we will have the data handy and we need to define the objective later and sometimes we will decide the objective first and then we will gather data.”

Step 3: Data Cleaning

Source: Pixabay

What’s Data Cleaning?

Data cleaning is the process of removing, modifying or formatting data that is incorrect, irrelevant or duplicated.

When to clean the data?

Once we have the dataset ready, we will clean the data.

Why is data cleaning necessary?

Data Cleaning helps in preparing the data for Exploratory Data Analysis.

How to do Data Cleaning?

We use libraries like?Pandas,?Numpy?to do Data Cleaning and apply the following?key steps to determine if we need to clean?the dataset.

1. Check how many rows and columns are in the dataset.

2. Look for?duplicate?features by going through the meta info provided.

3. Identify?Numerical?and?Categorical?features in the gathered data and check if?formatting?is required or not.

“Formatting can be something like changing data types of the features, correcting the typos or removing the special characters from the data if there are any.”

“If you are working with real time data, then it’s recommended to save the cleaned dataset in the cloud databases before the next steps.”

Step 4: Exploratory Data Analysis (EDA)

Source: Pixabay

What’s EDA?

In simple terms, EDA is nothing but understanding and analyzing the data by using various Statistical Measures (like mean, median) and Visualization Techniques(like Univariate Analysis, Bivariate Analysis etc.).

When to perform EDA?

After the data cleaning stage. Once the data is cleaned, we perform EDA on cleaned data.

Why is EDA necessary?

Exploratory Data Analysis is considered as the?fundamental?and?crucial step?in solving any Machine Learning use case as it helps us to identify trends, or patterns?in the data.

How to perform EDA?

There are Python libraries like?Pandas,?Numpy,?Statsmodels,?Matplotlib,?Seaborn,?Plotly?etc, to perform Exploratory Data Analysis.

While doing EDA, some of the basic common questions we ask are:

1. What are the independent and dependent features/labels in the collected data?

2. Is the selected label/dependent feature Categorical or Numerical?

3. Are there any missing values in the features/variables?

4. What are the summary statistics (like mean etc.) for Numerical features?

5. What are the summary statistics (like mode etc.) for Categorical features?

6. Are the features/variables normally distributed or skewed?

7. Are there any outliers in the features/variables?

8. Which independent features are correlated with the dependent feature?

9. Is there any correlation between the independent features?

“So, we will try to understand the data by finding answers to the above questions both?Visually?(by plotting graphs) and?Statistically?(hypothesis testing like normality tests).”

“When we are dealing with larger datasets, then it’s a bit difficult to get more insights?from?the?data. Hence, at this stage we sometimes use?Unsupervised learning techniques?like?Clustering?to identify hidden groups/clusters in the data which thereby helps us in understanding the data?more.”

Step 5: Feature Engineering

Source: Pixabay

What’s Feature Engineering?

A?feature?refers to a?column?in a dataset, while?engineering?can be?manipulating, transforming, or constructing, together they’re known as?Feature Engineering. Simply put, Feature Engineering is nothing but transforming existing features or constructing new features.

When to do Feature Engineering?

Feature Engineering is done immediately after Exploratory Data Analysis (EDA)

Why is Feature Engineering necessary?

Feature Engineering transforms the?raw data/features?into?features?which are?suitable for machine learning algorithms. This step is necessary because feature engineering further helps in improving machine learning?model’s performance and accuracy.

Algorithm: Algorithms are mathematical procedures applied on a given data.

Model: Outcome of a machine learning algorithm is a generalized equation for the given data and this generalized equation is called a?model.

How to do Feature Engineering ?

We use libraries like?Pandas,?Numpy,?Scikit-learn?to do Feature Engineering.

Feature Engineering techniques include:

1. Handling Missing Values

Machine Learning 1 年前

Automated Data Science and Machine Learning Platforms…

360 Market Updates 1 年前

ML Systems for Business: A Step-by-Step Guide

Ivan Reznikov 1 年前

2. Handling Skewness

3. Treating Outliers

4. Encoding

5. Handling Imbalanced data

6. Scaling down the features

7. Creating new features from the existing features

Step 6: Feature Selection

Source: Pixabay

What’s Feature Selection?

Feature Selection is the process of selecting the best set of independent features or columns that are required to train a machine learning algorithm.

When to do Feature Selection?

Feature Selection is performed right after the feature engineering step.

Why is Feature Selection necessary?

Feature Selection is necessary for the following reasons:

Improves Machine Learning Model performance.
Reduces training time of machine learning algorithms.
Improves the generalization of the model.

How to do Feature Selection?

We use Python libraries like?Statsmodels?or?Scikit-learn?to do feature selection.

Each of the following methods can be used for selecting the best independent features:

1. Filter methods

2. Wrapper methods

3. Embedded or intrinsic methods

“If the number of selected input features are very large (probably greater than the number of rows/records in the dataset), then we can use Unsupervised learning techniques like?Dimensionality Reduction?at this stage to reduce the total number of inputs to the model.”

Step 7: Model Building

Source: Pixabay

What’s Model Building?

Building a machine learning model is about coming up with a generalized equation for data using machine learning algorithms.

“Machine learning algorithms are not only used to build models but sometimes they are also used for?filling missing values,?detecting outliers, etc. “

When should you build a model ?

You start building immediately after feature selection, with independent features.

Why is Model Building necessary ?

Building a machine learning model helps businesses in predicting the future.

How to build a model?

Scikit-learn?is used to build machine learning models.

Basic Steps to create a machine learning model:

Create two variables to store Dependent and Independent Features separately.
Split the variable(which stores independent features) into either?train,?validation,?test sets?or use?Cross validation?techniques to split the data.

Train?set - To train the algorithms

Validation?set - To optimize the model

Test?set - To evaluate the model.

Cross validation techniques?are used to split the data when you are?working with small datasets.

3. Build a model on a training set.

4.?What models can you build?

Machine Learning algorithms are broadly categorized into two types,?Supervised,?Unsupervised machine?learning algorithms.

Predictive models?are built using?Supervised Machine Learning Algorithms.

The?models?built using?supervised?machine learning algorithms are known as?Supervised Machine Learning Models.
There are?two?types of?Supervised Machine Learning Models?that can be build:
—?Regression models: Some of the regression models are Linear Regression, Decision Tree Regressor, Random Forest Regressor, Support Vector Regression.
—?Classification models: Some of the classification models are Logistic Regression, K-Nearest Neighbors, Decision Tree Classifier, Support Vector Machine(classifier), Random Forest Classifier, XGBoost.
Unsupervised?machine learning algorithms are?not used?to build?models, rather they are?used in either identifying hidden groups/clusters in the data or to reduce dimensions of the data.
Some of the unsupervised learning algorithms are?Clustering?Algorithms(like K-means clustering, etc),?Dimensionality Reduction Techniques(like PCA etc).

Step 8 — Model Evaluation

Source: Pixabay

What’s Model Evaluation?

In simple model evaluation means checking how accurate the model’s predictions are, that is determining how well the model is behaving on train and test dataset.

When to evaluate the model ?

As soon as model building is done, the next step is to evaluate it.

Why is model evaluation necessary?

In general, we will build?many?machine learning models by using different machine learning algorithms, hence evaluating the model?helps in choosing a model which is giving best results.

How to evaluate a model?

We use the?Scikit-learn?library to evaluate models using evaluation metrics.

Metrics are divided into two categories as shown:

Regression Model Metrics: Mean Squared Error, Root Mean Squared Error, Mean Absolute Error
Classification Model Metrics: Accuracy (Confusion Matrix), Recall, Precision, F1-Score, Specificity, ROC (Receiver Operator Characteristics), AUC (Area Under Curve).

Step 9: Model Optimization

Source: Pixabay

What’s Model Optimization?

Most of the machine learning models have some hyperparameters which can be tuned or adjusted. For example: Ridge Regression has hyperparameters like?regularization term, similarly Decision Tree model has hyperparameters like?desired depth?or?number of leaves?in a tree.

The process of?tuning these hyperparameters?to?determine the best combination of hyperparameters?to increase model’s performance?is known as?hyperparameter optimization?or?hyperparameter tuning.

When to optimize the model?

After calculating the Evaluation Metrics, we will choose the models with the best results and then?tune hyperparameters to enhance the results.

Why is Model Optimization necessary?

Optimization increases the performance of the machine learning models which in turn increases the accuracy of the models and gives best predictions.

How to do Model Optimization?

We make use of libraries like?Scikit-learn?etc or we can use frameworks like?Optuna?to optimize by tuning hyperparameters.

Hyperparameter tuning approaches include:

Grid Search
Random Search
Bayesian Optimization
Genetic Algorithms

Step 10 - Conclusion

Source: Pixabay

Finally, we will choose our?hyperparameter optimized model with the best metrics?and use that model for?production.

After all these steps, if you are still not happy with the machine learning model’s performance, then you can?repeat?the entire process starting from?Step 2?through?Step 9. Remember, Machine Learning is an?iterative,?hit and trial process?and its performance also depends on the sample of the data we gathered.

That’s it for this blog. I tried my best to keep it as simple as possible, and I hope you all got an idea on how to build and optimize a machine learning model.

As part of this series, we will implement all the above mentioned steps on?Telco Customer?data and come up with the best model to predict whether a customer churns.

Thanks for reading!!

This guest blog was written by Jaanvi .

Tommy Dang

CEO @ Mage ??♀? ??

2 年

If you know these steps, then you’re already on your way to becoming proficient at harnessing the power of ML.

5 次回应

Jahnavi C

Actively looking for Internship/full-time jobs

2 年

Huge thank you to Mage for publishing my article. Thanks to Nathaniel T. and Thomas Chung for your support

6 次回应

查看更多评论

要查看或添加评论，请登录

Mage的更多文章

See all articles

TLDR

Steps

Step 1: Define the objective

What’s the objective?

When is the objective defined?

Why is it necessary to set an objective?

How to define an objective?

Step 2: Data Gathering

What’s Data Gathering?

When do we gather data?

Why is Data Gathering necessary?

How is Data Gathering done?

Step 3: Data Cleaning

What’s Data Cleaning?

When to clean the data?

Why is data cleaning necessary?

How to do Data Cleaning?

Step 4: Exploratory Data Analysis (EDA)

What’s EDA?

When to perform EDA?

Why is EDA necessary?

How to perform EDA?

Step 5: Feature Engineering

What’s Feature Engineering?

When to do Feature Engineering?

Why is Feature Engineering necessary?

How to do Feature Engineering ?

领英推荐

Step 6: Feature Selection

What’s Feature Selection?

When to do Feature Selection?

Why is Feature Selection necessary?

How to do Feature Selection?

Step 7: Model Building

What’s Model Building?

When should you build a model ?

Why is Model Building necessary ?

How to build a model?

Step 8 — Model Evaluation

What’s Model Evaluation?

When to evaluate the model ?

Why is model evaluation necessary?

How to evaluate a model?

Step 9: Model Optimization

What’s Model Optimization?

When to optimize the model?

Why is Model Optimization necessary?

How to do Model Optimization?

Step 10 - Conclusion

Mage的更多文章

Model Improvement - Data leakage

Model evaluation - MAP@K

Music genre classification part 2

Loan prediction

Youtube's Machine learning (ML) algorithm

League of Legends rank(ing) guide

Music genre classification

Data Cleaning - Filter

Machine learning (ML) applications: customer churn prediction

Feature Engineering - Extract column from JSON

社区洞察

其他会员也浏览了

Basic Building Blocks of K-Means Clustering Algorithms

The Hidden Challenges of Data Sourcing for Machine Learning Models

Power Query and AI

MLOps for Data Scientists

Differentiating Regression Algorithms And Classification Algorithms

6 Best Big Data Analytics Trends and Predictions for 2022

Issue #4: Marvelous MLOps

How do Machine Learning and Data Analytics Collaborate in Modern Industries?

Transforming Data Analytics with Generative AI: Efficiency Gains and Future Insights

Bridging DataOps &?MLOps