登录查看更多内容

Feature Scaling & Normalization – The Effect of Standardization for Machine Learning Algorithm

Anandh Shanmugaraj

Group CEO & MD at Gladwin International & Company ?? India's leading Interim Leadership Consulting, Executive Search and Leadership Advisory Firm.

发布日期: 2016年6月16日

About standardization

The result of standardization (or Z-score normalization) is that the features will be rescaled so that they’ll have the properties of a standard normal distribution with

μ=0 and σ=1

where μ is the mean (average) and σ is the standard deviation from the mean; standard scores (also called z scores) of the samples are calculated as follows:

View complete article and codes here

Standardizing the features so that they are centered around 0 with a standard deviation of 1 is not only important if we are comparing measurements that have different units, but it is also a general requirement for many machine learning algorithms. Intuitively, we can think of gradient descent as a prominent example (an optimization algorithm often used in logistic regression, SVMs, perceptrons, neural networks etc.); with features being on different scales, certain weights may update faster than others since the feature values xjxj play a role in the weight updates

so that

wj:=wj+Δwj, where ηη is the learning rate, t the target class label, and o the actual output. Other intuitive examples include K-Nearest Neighbor algorithms and clustering algorithms that use, for example, Euclidean distance measures – in fact, tree-based classifier are probably the only classifiers where feature scaling doesn’t make a difference.

To quote from the scikit-learn documentation:

“Standardization of a dataset is a common requirement for many machine learning estimators: they might behave badly if the individual feature do not more or less look like standard normally distributed data (e.g. Gaussian with 0 mean and unit variance).”

In fact, the only family of algorithms that I could think of being scale-invariant are tree-based methods. Let’s take the general CART decision tree algorithm. Without going into much depth regarding information gain and impurity measures, we can think of the decision as “is feature x_i >= some_val?” Intuitively, we can see that it really doesn’t matter on which scale this feature is (centimeters, Fahrenheit, a standardized scale – it really doesn’t matter).

Some examples of algorithms where feature scaling matters are:

- k-nearest neighbors with an Euclidean distance measure if want all features to contribute equally

- k-means (see k-nearest neighbors)

- logistic regression, SVMs, perceptrons, neural networks etc. if you are using gradient descent/ascent-based optimization, otherwise some weights will update much faster than others

- linear discriminant analysis, principal component analysis, kernel principal component analysis since you want to find directions of maximizing the variance (under the constraints that those directions/eigenvectors/principal components are orthogonal); you want to have features on the same scale since you’d emphasize variables on “larger measurement scales” more. There are many more cases than I can possibly list here … I always recommend you to think about the algorithm and what it’s doing, and then it typically becomes obvious whether we want to scale your features or not.

In addition, we’d also want to think about whether we want to “standardize” or “normalize” (here: scaling to [0, 1] range) our data. Some algorithms assume that our data is centered at 0. For example, if we initialize the weights of a small multi-layer perceptron with tanh activation units to 0 or small random values centered around zero, we want to update the model weights “equally.” As a rule of thumb I’d say: When in doubt, just standardize the data, it shouldn’t hurt.

About Min-Max scaling

An alternative approach to Z-score normalization (or standardization) is the so-called Min-Max scaling (often also simply called “normalization” - a common cause for ambiguities).

In this approach, the data is scaled to a fixed range - usually 0 to 1.

The cost of having this bounded range - in contrast to standardization - is that we will end up with smaller standard deviations, which can suppress the effect of outliers.

A Min-Max scaling is typically done via the following equation:

Z-score standardization or Min-Max scaling?

“Standardization or Min-Max scaling?” - There is no obvious answer to this question: it really depends on the application.

For example, in clustering analyses, standardization may be especially crucial in order to compare similarities between features based on certain distance measures. Another prominent example is the Principal Component Analysis, where we usually prefer standardization over Min-Max scaling, since we are interested in the components that maximize the variance (depending on the question and if the PCA computes the components via the correlation matrix instead of the covariance matrix; but more about PCA in my previous article).

However, this doesn’t mean that Min-Max scaling is not useful at all! A popular application is image processing, where pixel intensities have to be normalized to fit within a certain range (i.e., 0 to 255 for the RGB color range). Also, typical neural network algorithm require data that on a 0-1 scale.

Standardizing and normalizing - how it can be done using scikit-learn

Of course, we could make use of NumPy’s vectorization capabilities to calculate the z-scores for standardization and to normalize the data using the equations that were mentioned in the previous sections. However, there is an even more convenient approach using the preprocessing module from one of Python’s open-source machine learning library scikit-learn.

For the following examples and discussion, we will have a look at the free “Wine” Dataset that is deposited on the UCI machine learning repository.

Forina, M. et al, PARVUS - An Extendible Package for Data Exploration, Classification and Correlation. Institute of Pharmaceutical and Food Analysis and Technologies, Via Brigata Salerno, 16147 Genoa, Italy.

Bache, K. & Lichman, M. (2013). UCI Machine Learning Repository [https://archive.ics.uci.edu/ml]. Irvine, CA: University of California, School of Information and Computer Science.

The Wine dataset consists of 3 different classes where each row correspond to a particular wine sample.

The class labels (1, 2, 3) are listed in the first column, and the columns 2-14 correspond to 13 different attributes (features):

1) Alcohol

2) Malic acid

...

Loading the wine dataset

View codes here

As we can see in the table above, the features Alcohol (percent/volumne) and Malic acid (g/l) are measured on different scales, so that Feature Scaling is necessary important prior to any comparison or combination of...

Continue reading on Gladwin Analytics

Author: Sebastian Raschka

Originally published on > Sebastian Raschka Blog

About Anandh Shanmugaraj:

Anandh is the Founder CEO of Gladwin Analytics - the world's most exclusive professional network of Big Data, Analytics, Internet of Things, Research and Cloud Computing professionals. Join us and access people, jobs, news, companies, universities, updates and insights that make you the best in data science.

要查看或添加评论，请登录

Anandh Shanmugaraj的更多文章

Big Data in Aviation

2017年10月19日

Big Data in Aviation

We hear a lot about big data's ability to deliver usable insights - but what does this mean exactly for enterprises in…

1 条评论
Data Science Opportunities - 45000+ Roles Worldwide - 2016 Year End Update

2016年12月28日

Data Science Opportunities - 45000+ Roles Worldwide - 2016 Year End Update

Below are the list of data science opportunities with industry leading employers and highly successful startups around…
300 Hours of Free Video Tutorials on R Programming

2016年12月22日

300 Hours of Free Video Tutorials on R Programming

Ad: 50,000+ Data Science Jobs - Apply for Machine Learning, Data Mining, Analytics, Research and AI Jobs in USA, UK…

7 条评论
Big Data, IoT and Industrial Internet - Industry Uses

2016年12月4日

Big Data, IoT and Industrial Internet - Industry Uses

Ad: 50000 Data Science Jobs Globally | Over 10000 Hours of Free Data Science Video Tutorials - Only on Gladwin…
Big Data, Hadoop and Spring - Online Tutorials

2016年11月30日

Big Data, Hadoop and Spring - Online Tutorials

Ad: Over 50,000 Data Science Jobs Worldwide | 8000+ Hours of Free Data Science Tutorials…
Introduction to Bioinformatics - 40 Hours of Free Video Tutorials

2016年11月29日

Introduction to Bioinformatics - 40 Hours of Free Video Tutorials

Ad: Over 8000 Hours of Free Data Science Courses | 50000+ Data Science Jobs Worldwide…

2 条评论
R - Advanced Regression Models

2016年11月28日

R - Advanced Regression Models

Ad: Free Python Tutorials | Free R Tutorials | Free Deep Learning Tutorials | Free Machine Learning Tutorials | Free…
Learn Python Programming Free - 127 Hours of Free Tutorials from the world's expert data scientists

2016年11月27日

Learn Python Programming Free - 127 Hours of Free Tutorials from the world's expert data scientists

Ad: 8000 Hours of Data Science Tutorials - Start Learning | 50000+ Data Science Opportunities with worlds leading…

2 条评论
Learn Computer Vision - 20 Hours of Free Expert Video Tutorials

2016年11月26日

Learn Computer Vision - 20 Hours of Free Expert Video Tutorials

Ad: 7500+ Hours of Free Online Courses - Start learning for free. 50000+ Data Science Jobs Worldwide - Find and apply…
Deep Learning Demystified - 70 Hours of World's Finest Tutorials - Free

2016年11月25日

Deep Learning Demystified - 70 Hours of World's Finest Tutorials - Free

Ad: Register now to watch 5000 Hours of Free Data Science Video Tutorials Deep learning (also known as deep structured…

30 条评论

See all articles

Feature Scaling & Normalization – The Effect of Standardization for Machine Learning Algorithm

Anandh Shanmugaraj

Group CEO & MD at Gladwin International & Company ?? India's leading Interim Leadership Consulting, Executive Search and Leadership Advisory Firm.

Anandh Shanmugaraj的更多文章

社区洞察

其他会员也浏览了

A Comprehensive Overview of Classification Methods

The misguided intuition I had to unlearn to come to grips with modern machine learning

Deep Stubborn Networks – A Breakthrough Advance Towards Adversarial Machine Intelligence

Mix It Up!!!!

Hand Gesture Recognition using ML Algorithms

The Math Behind the Foundation of AI

Hand Gesture Recognition using ML Algorithms

Optimizers

Anandh Shanmugaraj的更多文章

Big Data in Aviation

Data Science Opportunities - 45000+ Roles Worldwide - 2016 Year End Update

300 Hours of Free Video Tutorials on R Programming

Big Data, IoT and Industrial Internet - Industry Uses

Big Data, Hadoop and Spring - Online Tutorials

Introduction to Bioinformatics - 40 Hours of Free Video Tutorials

R - Advanced Regression Models

Learn Python Programming Free - 127 Hours of Free Tutorials from the world's expert data scientists

Learn Computer Vision - 20 Hours of Free Expert Video Tutorials

Deep Learning Demystified - 70 Hours of World's Finest Tutorials - Free

社区洞察

其他会员也浏览了

A Comprehensive Overview of Classification Methods

The misguided intuition I had to unlearn to come to grips with modern machine learning

Deep Stubborn Networks – A Breakthrough Advance Towards Adversarial Machine Intelligence

Mix It Up!!!!

Hand Gesture Recognition using ML Algorithms

The Math Behind the Foundation of AI

Hand Gesture Recognition using ML Algorithms

Optimizers