登录查看更多内容

Standardization and Normalization Techniques in Machine Learning - Part 07

Vinod Kumar GR

Co-Founder of ApexIQ.Ai | AI Engineer | Youtuber | Content Writer

发布日期: 2024年4月16日

Data is rarely perfect, and it often comes in various shapes and forms, with values that span different scales and ranges. Ensuring that your data is in the right form can make all the difference when training machine learning models.

This is where standardization and normalization come into play, offering strategies to prepare your data for the most optimal model performance.

In this article, we will explore these techniques, their differences, and the scenarios where each is best applied. Whether you’re dealing with feature scaling in the broader context or looking to understand how to make your data machine-learning-ready, the insights you gain here will be invaluable.

In the last article, we discussed Feature scaling and different types of machine learning, and we took a deep dive into the topic it provided a solid foundation in understanding the fundamental concepts clearly.

I’ll give you a simple example of when we use scaling methods,

Suppose you are dealing with an image dataset, you have data of image pixels and it contains pixel values from 0–255. Where 255 is a larger number, although it is a continuous numeric value, the model cannot perform well to capture all those values, so you use scaling methods to scale the data and it will fall into a common range. Where the model can easily capture all the data patterns.

Now let’s get into the topic,

1. Standardization

Standardization, also known as Z-score scaling or zero-mean scaling, is a common method used in data preprocessing to scale and center features in machine learning.
This method transforms the data in a way that makes it suitable for algorithms that assume a standard normal distribution. Standardization makes the data more Gaussian-like, which is useful for some machine-learning algorithms.

Note: This Standardization doesn’t scale data to range(0,1) instead it scales the data to have a mean of 0 and standard deviation of 1. [You understand this line clearly when we discuss the normalization topic]

The mathematical formula for standardization:

x' = (x - mean(x)) / std(x)

where

x is the original feature.
x’ is the scaled feature.
mean(x) represents the mean (average) of all data points (features) in the dataset.
std(x) represents the standard deviation of all data points (features) in the dataset.

Explanation

Calculate Mean and Standard Deviation: For each feature, you calculate the mean (average) and standard deviation. These statistics are used to determine the center and the spread of the data.
Subtract the Mean: You subtract the mean of each feature from every data point. This operation centers the data, making the new mean of the feature 0.
Divide by the Standard Deviation: You divide each data point by the standard deviation of the feature. This scaling operation makes the standard deviation of the feature 1.

You might have doubts, about where you need to use this standardization technique.

Standardization is beneficial when you have data with varying spreads and want to make it suitable for machine learning algorithms that assume a standard normal distribution or are sensitive to feature scales.
It transforms the data to have a mean of 0 and a standard deviation of 1, but it doesn’t necessarily constrain the data to the range (0, 1).
The standardized values can have both positive and negative values, and the specific range depends on the characteristics of the original data.

I got the image from Google, So you can see the top data graph in the image, it is both right-skewed (which means most of the data points fall on the right side of the graph) and left-skewed (which means most of the data points fall on the left side of the graph).

You can see the scale is around 100 and 200. As there is a high numerical value, it is challenging for the model since it needs to capture the important patterns in the widely spread data.

When you apply the standardization, it scales the data to have a mean of 0 and a standard deviation of 1, which brings the data to the center of the graph.

You can see the bottom image, the data falls at the center of the graph. This Standardization process will improve the training process of the model.

Practical Implementation

Official webpage of sklearn.preprocessing.StandardScalar()

"""
This is the basic code for standard scaling implementation
"""

# import the StandardScaler library from sklearn
from sklearn.preprocessing import StandardScaler

# load Sample data
data = [[1.0], [2.0], [3.0], [4.0], [5.0]]

# Create a StandardScaler instance
scaler = StandardScaler()

# Fit and Transform the scaler to the data and transform the data
scaled_data = scaler.fit_transform(data)

# Print the scaled data
print(scaled_data)

I have written code in the colab notebook that I have mentioned below.

Google Colaboratory

You can see the practical implementation of the standardization in the above-given colab link.

So go through this colab notebook once and if you have any questions I’ll provide my email at the end of this article ping me once I’ll try to clarify.

2. Normalization

Let me define, what is normalization.

Normalization is the process of transforming the features (variables) in a dataset to a common scale, typically within the range of (0, 1) or (-1, 1).

The objective of normalization is to ensure that all features have similar scales, which helps prevent certain features from dominating the modeling process due to their larger numerical values.

We have already discussed that it will help in rescaling the data into a common scale in a range of (0, 1) or (-1, 1).

We have seen why feature scaling is important for machine learning in a previous article. If you haven’t read the previous article, please go through the Feature Scaling in Machine Learning article.

I got this image from Google, as you can see the plots before scaling(which is the actual data) and after applying normalization and standardization. When you apply the normalization, the data fall into a range of (0,1).

Above, in standardization content, I have mentioned a note. On that note, I mentioned, “Standard_scalar will not scale data into a range of (0,1) instead it will scale the data to have a mean of 0 and standard deviation of 1.” This is the main difference between standardization and normalization.

Different Methods in Normalization

Yes, in normalization there are mainly 4 different methods of Normalization and they are:

Min-Max scaling
Mean normalization scaling
Max-absolute scaling
Robust scaling(Uses IQR method)

These are the several methods to normalize data, and the choice of method depends on the characteristics of your dataset and the requirements of your modeling task.

1. Min-Max Scaling

Min-Max scaling is also known as Min-Max normalization, transforms data into a specific range, often [0, 1] or [-1, 1]. It rescales the data to ensure that the minimum value maps to 0, and the maximum value scales to 1 (or -1 if using the [-1, 1] range).

Mathematical Formula

For [0, 1] range:

x_normalized = (x - min(x)) / (max(x) - min(x))

For [-1, 1] range:

x_normalized = 2 * ((x - min(x)) / (max(x) - min(x))) - 1

Advantages:

Simple and intuitive method.
Preserves the relationships between data points.
Suitable for algorithms that assume data within a bounded range.

领英推荐

Decision Trees in Machine Learning

Blockchain Council 5 个月前

How Can Data Quality be Increased for ML Models?

Xorbix Technologies, Inc. 2 个月前

[Newsletter] Three Mistakes to Avoid with Machine…

Daniella F Santana 1 年前

Disadvantages:

Sensitive to outliers, as they can affect the range of the scaling.
May not work well with data that does not have clear boundaries.

2. Mean Normalization Scaling

Mean normalization is also known as Z-Score normalization or Standardization, transforms data to have a mean of 0 and a standard deviation of 1. It is particularly useful when dealing with data that follows a Gaussian distribution.

Mathematical Formula:

x_normalized = (x - mean(x)) / std(x)

Advantages:

Makes data compatible with algorithms that assume a standard normal distribution.
Reduces sensitivity to outliers.
Preserves the relative distances between data points.

Disadvantages:

Data may not be bounded within a specific range.
Not suitable for data that does not follow a normal distribution.

3. Max-Absolute Scaling

Max-absolute scaling scales data to the [-1, 1] range by dividing each data point by the maximum absolute value in the dataset.

Mathematical Formula

x_normalized = x / max(|x|)

Advantages:

Preserves the relative distances between data points.
Suitable for data with unknown or varying ranges.
Robust against outliers.

Disadvantages:

Data is not centered around 0.
Sensitivity to negative values may be an issue in some cases.

4. Robust Scaling (Using IQR Method)

Definition: Robust scaling, often referred to as IQR scaling, is a method that scales data using the Interquartile Range (IQR). It is robust to outliers as it uses the middle 50% of the data.

Mathematical Formula

x_normalized = (x - median(x)) / IQR(x)

Advantages:

Robust to outliers, as it focuses on the central portion of the data.
Preserves the relative distances between data points.
Suitable for data with extreme values.

Disadvantages:

Data is not bounded within a specific range.
May not be as effective with data that does not have a central tendency.

In the formula, x represents the original data point, min(x) is the minimum value in the dataset, and max(x) is the maximum value in the dataset.

Practical implementation

Official Webpage of sklearn.preprocessing.Minmaxscaler()

Official Webpage for sklearn.preprocessing.Normalizer()

Official Webpage for sklearn.preprocessing.MaxAbsoluteScaler()

Official Webpage for sklearn.preprocessing.RobustScaler()

"""
This is the basic code for MinMaxScaler implementation
"""

# import the MinMaxScaler library from sklearn
from sklearn.preprocessing import MinMaxScaler
from sklearn.preprocessing import Normalizer
from sklearn.preprocessing import MaxAbsScaler
from sklearn.preprocessing import RobustScaler

# load Sample data
data = [[1.0], [2.0], [3.0], [4.0], [5.0]]

# Create a MinMaxScaler instance
min_max_scaler = MinMaxScaler()
normalize_scaler = Normalizer()
max_abs_scaler = MaxAbsScaler()
robust_Scaler = RobustScaler()

# Fit and Transform the scaler to the data and transform the data
scaled_data = min_max_scaler.fit_transform(data)

# Print the scaled data
print(scaled_data)

This is the simple code to implement the normalization methods, now take a sample dataset and apply the normalization methods, and let’s see the insights in the data.

Google Colaboratory

Go through with this Colab notebook, I have explained this normalization method with an example dataset.

If you have any queries just write a mail(mentioned below), and I’ll try to respond.

That’s it for today's topic, we’ll discuss another topic in the next articles.

Thank you for taking the time to read this article.

I hope it has provided you with valuable insights into the world of feature scaling and how it can be used to enhance the performance of machine learning models. I’m excited to share these hands-on insights and make the content more engaging.

Stay tuned for upcoming articles.

EMAIL -> [email protected]

Previous article: 6. Feature Sca,ing and Different Feature Scaling Methods in ML.

Next article: 8. Data Encoding in ML.

YouTube Channel

要查看或添加评论，请登录

Vinod Kumar GR的更多文章

Day 20: Named Entity Recognition (NER) - Notebook Implementation

2024年9月17日

Day 20: Named Entity Recognition (NER) - Notebook Implementation

Welcome back to our NLP journey! ?? Today is a Coding Day where we will dive into practical implementations of Natural…

2 条评论
Day 19: Sentiment Analysis in NLP - Notebook Implementation

2024年9月16日

Day 19: Sentiment Analysis in NLP - Notebook Implementation

Hey everyone! ?? Welcome back to our NLP journey! ?? Today is a Coding Day where we will dive into practical…
Day 18: Ethical Considerations in Natural Language Processing (NLP)

2024年9月14日

Day 18: Ethical Considerations in Natural Language Processing (NLP)

Hey everyone! ?? Welcome back to our NLP journey! ?? Today, we’re diving deep into the important topic of Ethical…

1 条评论
Day 17: Practical Applications of NLP Libraries

2024年9月12日

Day 17: Practical Applications of NLP Libraries

Hey everyone! ?? Welcome back to our NLP journey! ?? Today, we're going to dive into the practical applications of NLP…
Day 16: Introduction to NLP Libraries: Tools for Natural Language Processing!

2024年9月10日

Day 16: Introduction to NLP Libraries: Tools for Natural Language Processing!

Hey everyone! ?? Welcome back to our NLP journey! ?? Today, we’re diving into the world of NLP Libraries. These…
Day 15: Different Types of Language Models in NLP

2024年9月9日

Day 15: Different Types of Language Models in NLP

Hey everyone! ?? Welcome back to our NLP journey! ?? Today, we're diving into the fascinating world of Language Models.…
Day 14: Applications of Natural Language Processing (NLP)

2024年9月9日

Day 14: Applications of Natural Language Processing (NLP)

Hey everyone! ?? Welcome back to our NLP journey! ?? Today, we're going to explore the diverse applications of Natural…

2 条评论
Day 13: Introduction to Language Models: The Foundation of NLP!

2024年9月5日

Day 13: Introduction to Language Models: The Foundation of NLP!

Hey everyone! ?? Welcome back to our NLP journey! ?? Today, we're going to explore a fundamental concept that powers…
Day 12: Sentiment Analysis: Understanding Emotions in Text!

2024年9月5日

Day 12: Sentiment Analysis: Understanding Emotions in Text!

Hey everyone! ?? Welcome back to our NLP journey! ?? Today, we’re diving into an exciting topic: Sentiment Analysis…

2 条评论
Day 11: Named Entity Recognition: Identifying Key Information in Text!

2024年9月3日

Day 11: Named Entity Recognition: Identifying Key Information in Text!

Hey everyone! ?? Welcome back to our NLP journey! ?? Today, we’re diving into an exciting and essential topic: Named…

See all articles

Standardization and Normalization Techniques in Machine Learning - Part 07

Vinod Kumar GR

Co-Founder of ApexIQ.Ai | AI Engineer | Youtuber | Content Writer

1. Standardization

Explanation

Practical Implementation

Google Colaboratory

2. Normalization

Different Methods in Normalization

1. Min-Max Scaling

领英推荐

2. Mean Normalization Scaling

3. Max-Absolute Scaling

4. Robust Scaling (Using IQR Method)

Practical implementation

Google Colaboratory

EMAIL -> [email protected]

Vinod Kumar GR的更多文章

社区洞察

其他会员也浏览了

Generalization

Understanding Tabular Data with SHAP: A Comprehensive Guide

Making ML More Affordable: 6 Ways to Improve Your Data Labeling Budget

Machine learning as a competitive advantage

Putting Superb Curate to the Test on the MNIST Dataset: How Does It Work?

Machine Learning Becomes Mainstream: How to Increase Your Competitive Advantage

Unleashing Machine Learning Potential with Snowflake: Feature Store Explained

ML Model: A Multi-Layer Approach

Data Requirements and Model Selection in Machine Learning

From Data to Strategy: A Business Leader’s Guide to Machine Learning Models

1. Standardization

Explanation

Practical Implementation

Google Colaboratory

2. Normalization

Different Methods in Normalization

1. Min-Max Scaling

领英推荐

2. Mean Normalization Scaling

3. Max-Absolute Scaling

4. Robust Scaling (Using IQR Method)

Practical implementation

Google Colaboratory

EMAIL -> [email protected]

Vinod Kumar GR的更多文章

Day 20: Named Entity Recognition (NER) - Notebook Implementation

Day 19: Sentiment Analysis in NLP - Notebook Implementation

Day 18: Ethical Considerations in Natural Language Processing (NLP)

Day 17: Practical Applications of NLP Libraries

Day 16: Introduction to NLP Libraries: Tools for Natural Language Processing!

Day 15: Different Types of Language Models in NLP

Day 14: Applications of Natural Language Processing (NLP)

Day 13: Introduction to Language Models: The Foundation of NLP!

Day 12: Sentiment Analysis: Understanding Emotions in Text!

Day 11: Named Entity Recognition: Identifying Key Information in Text!

社区洞察

其他会员也浏览了

Generalization

Understanding Tabular Data with SHAP: A Comprehensive Guide

Making ML More Affordable: 6 Ways to Improve Your Data Labeling Budget

Machine learning as a competitive advantage

Putting Superb Curate to the Test on the MNIST Dataset: How Does It Work?

Machine Learning Becomes Mainstream: How to Increase Your Competitive Advantage

Unleashing Machine Learning Potential with Snowflake: Feature Store Explained

ML Model: A Multi-Layer Approach

Data Requirements and Model Selection in Machine Learning

From Data to Strategy: A Business Leader’s Guide to Machine Learning Models