登录查看更多内容

Importance of Data Normalisation for Data Science and Machine Learning Models

Joseph Sefara

Senior Data Scientist Specialist

发布日期: 2020年3月2日

Normalisation is a technique often applied as part of data preparation for machine learning. The goal of normalisation is to change the values of numeric columns in the data set to a common scale, without distorting differences in the ranges of values. For machine learning, every data set does not require normalisation. It is required only when features have different ranges.

For example, consider a data set containing two features, age(x1), and income(x2). Where age ranges from 0–100, while income ranges from 0–20,000 and higher. Income is about 1,000 times larger than age and ranges from 20,000–500,000. So, these two features are in very different ranges. When we do further analysis, like multivariate linear regression, for example, the attributed income will intrinsically influence the result more due to its larger value. But this doesn’t necessarily mean it is more important as a predictor.

To explain further let's build two deep neural network models: one without using normalised data and another one with normalised data and at the end I will compare the results of these two models and show the effect of normalisation on the accuracy of the models.

Below is a Neural Network Model built using original data:

'''Using covertype dataset from kaggle to predict forest cover type'''

# Import pandas, tensorflow and keras

import pandas as pd
from sklearn.cross_validation import train_test_split
import tensorflow as tf
from tensorflow.python.data import Dataset

import keras
from keras.utils import to_categorical
from keras import models
from keras import layers

# Read the data from csv file
df = pd.read_csv('covtype.csv')

# Select predictors
x = df[df.columns[:54]]

# Target variable 
y = df.Cover_Type

# Split data into train and test 
x_train, x_test, y_train, y_test = train_test_split(x, y , train_size = 0.7, random_state =  90)

'''As y variable is multi class categorical variable, hence using softmax as activation function and sparse-categorical cross entropy as loss function.'''

model = keras.Sequential([
 keras.layers.Dense(64, activation=tf.nn.relu,                  
 input_shape=(x_train.shape[1],)),
 keras.layers.Dense(64, activation=tf.nn.relu),
 keras.layers.Dense(8, activation=  'softmax')
 ])

model.compile(optimizer=tf.train.AdamOptimizer(),
              loss='sparse_categorical_crossentropy',
              metrics=['accuracy'])
history1 = model.fit(x_train, y_train, epochs= 26, 
              batch_size = 60,validation_data = (x_test, y_test))
Output:
....
Epoch 26/26 — 17s 42us/step — loss: 8.2614 — acc: 0.4874 — val_loss: 8.2531 — val_acc: 0.4880

Validation accuracy of the above model is just 48.80%.

Now lets first normalise the data and then build a deep neural network model. There are different methods to normalise data here. I will be normalising features by removing the mean and scaling it to unit variance.

from sklearn import preprocessing

df = pd.read_csv('covtype.csv')
x = df[df.columns[:55]]
y = df.Cover_Type
x_train, x_test, y_train, y_test = train_test_split(x, y , train_size = 0.7, random_state =  90)

# Select numerical columns which needs to be normalized
train_norm = x_train[x_train.columns[0:10]]
test_norm = x_test[x_test.columns[0:10]]

# Normalize Training Data 
std_scale = preprocessing.StandardScaler().fit(train_norm)
x_train_norm = std_scale.transform(train_norm)

# Converting numpy array to dataframe
training_norm_col = pd.DataFrame(x_train_norm, index=train_norm.index, columns=train_norm.columns) 
x_train.update(training_norm_col)
print (x_train.head())

# Normalize Testing Data by using mean and SD of training set
x_test_norm = std_scale.transform(test_norm)
testing_norm_col = pd.DataFrame(x_test_norm, index=test_norm.index, columns=test_norm.columns) 
x_test.update(testing_norm_col)
print (x_train.head())

# Build neural network model with normalized data
model = keras.Sequential([
 keras.layers.Dense(64, activation=tf.nn.relu,                  
 input_shape=(x_train.shape[1],)),
 keras.layers.Dense(64, activation=tf.nn.relu),
 keras.layers.Dense(8, activation=  'softmax')
 ])

model.compile(optimizer=tf.train.AdamOptimizer(),
              loss='sparse_categorical_crossentropy',
              metrics=['accuracy'])
history2 = model.fit(x_train, y_train,epochs= 26, batch_size = 60,
           validation_data = (x_test, y_test))

# Output:
....................
Epoch 26/26 - 16s 34us/step - loss: 0.2703 - acc: 0.8907 - val_loss: 0.2773 - val_acc: 0.8893

Validation accuracy of the model is 88.93%, which is pretty better.

From the above graphs, we see that model 1(left side graph) have very low validation accuracy (48%) and a straight line for accuracy is coming in a graph for both test and train data. Straight line for accuracy means that accuracy is not changing with the number of epochs and even at epoch 26 accuracy remains the same (what it was at an epoch 1). The reason for straight accuracy line and low accuracy is that the model is not able to learn in 26 epochs. Because different features do not have similar ranges of values and hence gradients may end up taking a long time and can oscillate back and forth and take a long time before it can finally find its way to the global/local minimum. To overcome the model learning problem, we normalise the data. We make sure that the different features take on similar ranges of values so that gradient descents can converge more quickly. From the above right-hand side graph, we can see that after normalising the data in model 2 accuracy is increasing with every epoch and at epoch 26, accuracy reached 88.93%.

I write about Machine Learning and Data science. If any of those topics interest you, read more here and follow me on LinkedIn && Twitter. ??

要查看或添加评论，请登录

Joseph Sefara的更多文章

Self-Study Data Science

2020年3月9日

Self-Study Data Science

As a data science consultant, lots of people interested in getting into data science have contacted me for guidance on…

2 条评论
162+ Data Science Interview Questions

2020年3月2日

162+ Data Science Interview Questions

A typical interview process for a data science position includes multiple rounds. Often, one of such rounds covers…
Using Data Science to Know if the Customers will Buy the Products in their Cart or not?

2020年2月20日

Using Data Science to Know if the Customers will Buy the Products in their Cart or not?

Using Machine Learning (ML) Classifier specifically XGBoost to predict if a customer will eventually make a purchase…

2 条评论
Pandas DataFrame: Convert the column type from string to datetime format

2019年6月25日

Pandas DataFrame: Convert the column type from string to datetime format

While working with data in Pandas, it is not an unusual thing to encounter time series data and we know Pandas is a…
SVMs versus Logistic Regression

2019年6月5日

SVMs versus Logistic Regression

Like logistic regression (LR), support vector machines (SVMs) can also be generalised to categorical output variables…

1 条评论
When to Scale, Standardise, or Normalise with Scikit-Learn

2019年4月29日

When to Scale, Standardise, or Normalise with Scikit-Learn

Many machine learning algorithms work better when features are on a relatively similar scale and close to normal…

1 条评论
Logistic Regression with Keras

2018年10月17日

Logistic Regression with Keras

Logistic Regression (LR) is a simple yet quite effective method for carrying out binary classification tasks. There are…
Advice to Recent Graduates: Plan, Negotiate and Network

2018年6月8日

Advice to Recent Graduates: Plan, Negotiate and Network

It has been more than 20 years since I graduated college. Since then my career has been productive and focused around…
Respect and Love Your Elders

2017年10月12日

Respect and Love Your Elders

Wise people are very few, if we put a habit of hearing then wise people from our home to the outside world will find…

See all articles

Importance of Data Normalisation for Data Science and Machine Learning Models

Joseph Sefara

Senior Data Scientist Specialist

Joseph Sefara的更多文章

社区洞察

其他会员也浏览了

Artificial Intelligence No 50: Machine learning v.s. Statistics

Data Phoenix Digest - ISSUE 6.2023

Statistical inference vs Machine Learning inference: Bayesian vs frequentist perspectives

Using Generative Adversarial networks (GANs) to augment data

Data Science vs. Artificial Intelligence vs. Machine Learning vs. Deep Learning

XAI: Tabular Data with LIME

Top Data Science and Machine Learning Methods Used

What is Data Science: Exploring the World of Data Science

KDnuggets 16:n32: Data Scientist was sexiest job until…; Up to Speed on Deep Learning

Karthick's Sunday Learning (17/11)

Joseph Sefara的更多文章

Self-Study Data Science

162+ Data Science Interview Questions

Using Data Science to Know if the Customers will Buy the Products in their Cart or not?

Pandas DataFrame: Convert the column type from string to datetime format

SVMs versus Logistic Regression

When to Scale, Standardise, or Normalise with Scikit-Learn

Logistic Regression with Keras

Advice to Recent Graduates: Plan, Negotiate and Network

Respect and Love Your Elders

社区洞察

其他会员也浏览了

Artificial Intelligence No 50: Machine learning v.s. Statistics

Data Phoenix Digest - ISSUE 6.2023

Statistical inference vs Machine Learning inference: Bayesian vs frequentist perspectives

Using Generative Adversarial networks (GANs) to augment data

Data Science vs. Artificial Intelligence vs. Machine Learning vs. Deep Learning

XAI: Tabular Data with LIME

Top Data Science and Machine Learning Methods Used

What is Data Science: Exploring the World of Data Science

KDnuggets 16:n32: Data Scientist was sexiest job until…; Up to Speed on Deep Learning

Karthick's Sunday Learning (17/11)