Machine Learning - Feature Scaling Techniques
Gaurav Pahuja
Senior Data Scientist | DatSci 2019 Finalist | Python/Plotly-Dash | R/R-Shiny | Oracle SQL/BI | SQL | Machine Learning | Deep Learning | Techfitlab
Standardisation, Normalisation and Binning in Python
Content
What is Feature Scaling?
Feature scaling?is a method used to normalise the range of independent variables or features of data within a particular range. In some scenarios, it also helps in speeding up the calculations in an algorithm. If?feature scaling?is not done, then a machine learning algorithm tends to weigh greater values, higher and consider smaller values as the lower values, regardless of the unit of the values.
Which Algorithms require Feature Scaling?
In general, algorithms that exploit distances or similarities between data samples, such as?k-NN,?K-Means,?PCA?and?SVM, are sensitive to feature transformations. The similarity here is defined based on the distance between the points. Lesser the distance between the points equates to more similarity and vice-versa.
Feature Scaling?is a critical step in using?Neural Network?models as large spread of values could lead into larger errors in gradient values causing weight values to change dramatically and making the learning process unstable.
Linear?and?Logistic Regression?models are also sensitive to feature transformations unless regularised. Regression Coefficients are directly influenced by scale of features. [1]
Graphical-model based classifiers (i.e.?Naive Bayes,?LDA), as well as Decision Trees and Tree-based ensemble methods (i.e.?RF,?XGBoost,?AdaBoost) are invariant to?Feature Scaling, but still, it might be a good idea to apply?Feature?Scaling?to reduce the time of execution and it also helps in easily achieving Gradient Descent.
Standardisation, Normalisation and Binning
Standardisation
Standardisation?is a scaling technique where the values are centered around the mean with a unit standard deviation. Standardisation is required when features of input data set have large differences between their ranges, or simply when they are measured in different measurement units,?i.e. Kwh, Meters, Miles and more.
Z-score is one of the most popular methods to standardise data, and can be done by subtracting the mean and dividing by the standard deviation for each value of each feature.
Equation for Standardisation (Z-score):
Method 1
First, we can simply calculate the mean and the standard deviation of the dataset. Then calculate the Z-score by subtracting the mean and dividing the standard deviation as shown in the above equation.
train_mean = train_df.mean()
train_std = train_df.std()
train_df = (train_df - train_mean) / train_std
Output:
Let’s visualise each of the features based on?violinplot?after standardising the dataset.
df_std = train_df.melt(var_name='Columns', value_name='Standardise')
plt.figure(figsize=(12, 6))
ax = sns.violinplot(x='Columns', y='Standardise', data=df_std)
_ = ax.set_xticklabels(train_df.keys(), rotation=90)
Output:
Method 2
In the second method we will use the?StandardScaler?method from the?sklearn.preprocessing?module. This method can also be used with the?ColumnTransformer?method from?sklearn.compose?module to ignore any categorical or target values while standardising the dataset, as shown below in the example we are excluding the?demand?column from the dataset for scaling. We can simply pass the columns that we want to scale in our dataset and then fit that by using?fit_transform(), which then returns the scaled values in an array format and finally we can convert the dataset into pandas dataframe and pass the column names and index for the dataframe.
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler
col_names = ['temp', 'rain', 'msl', 'dewpt', 'rhum','Day sin','Day cos','Year sin','Year cos','Wx','Wy','demand']
features = train_df[col_names]
ct = ColumnTransformer([
('demand', StandardScaler(), ['temp', 'rain', 'msl', 'dewpt', 'rhum','Day sin','Day cos','Year sin','Year cos','Wx','Wy'])], remainder='passthrough')
date = train_df.index
train_df = pd.DataFrame(ct.fit_transform(features), columns=['temp', 'rain', 'msl', 'dewpt', 'rhum','Day sin','Day cos','Year sin','Year cos','Wx','Wy','demand'],
index = date)
Output:
Normalisation
Normalisation?is another scaling technique in which values are shifted and rescaled so that they end up ranging between 0 and 1. The goal of?normalisation?is to change the values of numeric columns in the dataset to use a common scale, without distorting differences in the ranges of values or losing information.
Equation for Normalisation:
Method 1
First, we can simply calculate the minimum and the maximum values of the dataset. Then calculate the X-norm by subtracting the minimum and dividing by the maximum minus minimum of the dataset as shown in the above equation.
train_min = train_df.min()
train_max = train_df.max()
train_df = (train_df - train_min) / (train_max - train_min)
Output:
Let’s visualise each of the features based on?violinplot?after normalising the dataset.
df_mm = train_df.melt(var_name='Columns', value_name='Normalised')
plt.figure(figsize=(12, 6))
ax = sns.violinplot(x='Columns', y='Normalised', data=df_mm)
_ = ax.set_xticklabels(train_df.keys())
ax.set_ylim([-1, 2])
Output:
Method 2
In the second method we will use the?MinMaxScaler?method from the?sklearn.preprocessing?module. This method can also be used with the?ColumnTransformer?method from?sklearn.compose?module to ignore any categorical or target values while standardising the dataset, as shown below in the example we are excluding the?demand?column from the dataset for scaling. We can simply pass the columns that we want to scale in our dataset and then fit that by using?fit_transform(), which then returns the scaled values in an array format and finally we can convert the dataset into pandas dataframe and pass the column names and index for the dataframe.
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import MinMaxScaler
col_names = ['temp', 'rain', 'msl', 'dewpt', 'rhum','Day sin','Day cos','Year sin','Year cos','Wx','Wy','demand']
features = train_df[col_names]
ct = ColumnTransformer([('demand', MinMaxScaler(), ['temp', 'rain', 'msl', 'dewpt', 'rhum','Day sin','Day cos','Year sin','Year cos','Wx','Wy'])], remainder='passthrough')
date = train_df.index
train_df = pd.DataFrame(ct.fit_transform(features), columns=['temp', 'rain', 'msl', 'dewpt', 'rhum','Day sin','Day cos','Year sin','Year cos','Wx','Wy','demand'],
index = date)
Output:
Binning
Binning?is used for the transformation of a continuous or numerical variable into a categorical feature. It is a useful technique to reduce the influence of outliers or extreme values on the model. If your data is heavily skewed, binning helps control that. It helps if you want to get an idea about how much of your data is distributed where, but the bin sizes need to be chosen correctly.
In this example we will be using?pandas.qcut(),?which?Discretise variables into equal-sized buckets based on rank or based on sample quantiles. For example 1000 values for 10 quantiles would produce a Categorical object indicating quantile membership for each data point.
One problem with?pandas.qcut()?is that it chooses the bins/quantiles so that each one has the same number of records, but all records with the same value must stay in the same bin/quantile (this behaviour is in accordance with the statistical definition of quantile). One way to solve this is to introduce a minimal amount of noise, which will artificially create unique bin edges. [2]
col_names = ['temp', 'rain', 'msl', 'dewpt', 'rhum','Day sin','Day cos','Year sin','Year cos','Wx','Wy','demand']
features = train_df[col_names]
nbins = 10
def jitter(a_series, noise_reduction=1000000):
return (np.random.random(len(a_series))*a_series.std()/noise_reduction)-(a_series.std()/(2*noise_reduction))
for feature in features:
train_df[feature] = pd.qcut(train_df[feature] + jitter(train_df[feature]), nbins, labels=False)
Output:
Let’s visualise each of the features based on?violinplot?after binning the dataset.
df_mm = train_df.melt(var_name='Columns', value_name='Binning')
plt.figure(figsize=(12, 6))
ax = sns.violinplot(x='Columns', y='Binning', data=df_mm)
_ = ax.set_xticklabels(train_df.keys())
ax.set_ylim([-1, 10])
Output:
Data Leakage
In a real life problem you can apply the similar scaling method to transform your test dataset using training dataset to avoid?“data leaking”.
In theory if you’re going to normalise the test data by removing the mean and dividing out the variance or removing the minimum and dividing the maximum minus minimum for the whole dataset to figure out the feature mean and variance or minimum and maximum, you’re using the knowledge about the distribution of the test set to set the scale of the training set —?‘leaking’?information.
The best way is to use the training set to calculate the mean and variance or minimum and maximum, normalise the training set, and then at test time, use that same (training) mean and variance or minimum and maximum to normalise the test set.
# Method 1
test_df = (test_df - train_mean) / train_std
# Method 2
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler
col_names = ['temp', 'rain', 'msl', 'dewpt', 'rhum','Day sin','Day cos','Year sin','Year cos','Wx','Wy','demand']
features = train_df[col_names]
train_scaler = StandardScaler().fit(train_df[:-1])
ct = ColumnTransformer([('demand', train_scaler, ['temp', 'rain', 'msl', 'dewpt', 'rhum','Day sin','Day cos','Year sin','Year cos','Wx','Wy'])], remainder='passthrough')
date = train_df.index
train_df = pd.DataFrame(ct.fit_transform(features), columns=['temp', 'rain', 'msl', 'dewpt', 'rhum','Day sin','Day cos','Year sin','Year cos','Wx','Wy','demand'], index = date)
# Test Dataset
test_features = test_df[col_names]
test_date = test_df.index
test_df = pd.DataFrame(ct.fit_transform(test_features), columns=['temp', 'rain', 'msl', 'dewpt', 'rhum','Day sin','Day cos','Year sin','Year cos','Wx','Wy','demand'], index = test_date)
Summary
In this article, we covered what is Feature Scaling and which algorithms require Feature Scaling. We learned about different techniques which can be used to scale the features such as Standardisation, Normalisation and Binning. Finally, we looked at different examples for applying Feature Scaling through these techniques (Standardisation, Normalisation and Binning).
Framework:?Jupyter Notebook,?Language:?Python,?Libraries:?sklearn, pandas, seaborn and matplotlib.
References
[2]?Stackoverflow
Before you go, share and follow me on?LinkedIn?if you thought this post was helpful. I really appreciate your support. Thanks!