Machine Learning 101 - Part 1- A Tutorial on Data Preprocessing
Machine learning is the future of the tech industry. Machine learning plays a big role in virtually every aspect of our daily lives ranging from facial recognition, movie recommendations, driving route optimization, cyber attack detection, and to name a few. In the mean time, while there is a huge demand of machine learning experts, the supply is lagging far behind. Having some machine learning knowledge is a huge bonus for your career. The goal of this Machine Learning 101 series is to prep you for basic machine learning algorithms and learn how to implement them properly. Some well-known machine learning algorithms include linear regression, polynomial regression, support vector regression, decision tree regression, random forest regression, and so on. We will try to touch base most of the algorithms in the Machine Learning 101 series. To start with, let's turn our attention to the data preprocessing.
Installing Python and Anaconda
Before we dive into the machine learning world, we need to be well equipped with some essential tools. Throughout this Machine Learning 101 series, I am gonna use Python use the main programming language. To make the best use of Python, I recommend downloading Anaconda. Anaconda is the fastest and easiest way to do Python and R data science and machine learning on Linux, Windows, and Mac OS X. It's the industry standard for developing, testing, and training on a single machine. For Python, most of the machine learning packages have been preinstalled on Anaconda, what you need to do is just import those packages as needed. After Anaconda is installed, launch the Anaconda Navigator application. When you are on the landing page of Anaconda Navigator, launch the Spyder application. Spyder is a versatile Python IDE. We will be using Spyder throughout the tutorial.
On the top right corner of Spyder, you can choose which directory as your working directory. On the top right pane, you can toggle your cursor to File explorer to see what files are under the current working directory. You can also toggle your cursor to Variable explorer to inspect the variables which have been created.
Importing Libraries
There are a few essential libraries that we need to import for almost every machine learning algorithm.
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
numpy is the fundamental package for scientific computing with Python. matplotlib.pyplot is mainly intended to plot and visualize your results. pandas is a software library written for the Python programming language for data manipulation and analysis.
Getting the Dataset
The dataset may come in different formats, but for this tutorial, we are only working with csv datasets. We use the pandas library to import the dataset. Here is a screenshot of the Data.csv:
A washer manufacturer would like to find out which family is likely to buy its latest washer. They collected data from thousands of family. The independent variables include the state where each family reside, the family size, and the annual income. The dependent variable is whether the surveyed family has purchased the washer or not. The dataset showed above is sample data.
dataset = pd.read_csv("Data.csv") // import Data.csv data to dataset
X = dataset.iloc[:, :-1].values
y = dataset.iloc[:, 3].values
The following is a screenshot of the imported dataset:
X indicates the independent variables matrix, while y indicates the dependent variable. In the X = dataset.iloc[:, :-1].values command, the first : inside the brackets means taking all the rows of a variable, i.e. all the rows from index 0 to index 9. It is worth noting that all rows and columns of the dataset are zero-index based. The second : inside the brackets means taking all the columns except for the last one. It makes sense because the last column is for the independent variable. In the y = dataset.iloc[:, 3].values command, we take all the rows of column 3 of the dataset as our dependent variable (ignore the first index column, it is not part of the dataset).
Dealing With Missing Data
Sometimes the dataset contains missing data. For example, in the sample data shown above, row 6 misses the family size data, and row 4 misses the annual income data. How do we deal with missing data? One way is to remove the whole observation, but it is often dangerous because the removed observation may contain crucial information. In general, removing the whole observation is not recommended. The most common solution is to compute the mean of the whole column, and replace the missing data with the mean. Let's use the mean strategy to replace the missing data:
# Dealing with missing data
from sklearn.preprocessing import Imputer
imputer = Imputer(missing_values="NaN", strategy="mean", axis=0)
imputer = imputer.fit(X[:, 1:3])
X[:, 1:3] = imputer.transform(X[:, 1:3])
First of all, we need to import the Imputer object from the sklearn.preprocessing library. Imputer is the object which deal with missing data. Then we need to create an Imputer instance specifying the fitting strategies. In our example, missing values are identified by "NaN". There are three strategies for replacing the missing data. If "mean", then replace missing values using the mean. If "median", then replace missing values using the median. If "most_frequent", then replace missing using the most frequent value. In our example, we choose "mean" strategy. If axis=0, then impute along columns. If axis=1, then impute along rows. In our example, it makes sense to impute along columns.
The imputer = imputer.fit(X[:, 1:3]) command fits the imputer to X[:, 1:3]. We chose 1:3 (excluding column 3) because we only need to replace missing data in column 1 and column 2.
The X[:, 1:3] = imputer.transform(X[:, 1:3]) command transforms the old X[:, 1:3], replaces the missing data with the column mean, then return the new X[:, 1:3].
If you look at the value of X, you will notice the previous NaN values have been replaced by the column mean. Well done! We are one step closer to the final goal.
Dealing With Categorical Data
We have successfully replaced missing data with column means. But we still can't perform machine learning algorithms on our dataset. One reason is that the first column contains string values (i.e categorical data), while machine learning algorithms require numerical values. Let's look at the code snippet step by step to see how we transform the categorical data into numerical data.
# Encoding categorical variables
from sklearn.preprocessing import LabelEncoder, OneHotEncoder
labelencoder_X = LabelEncoder()
X[:, 0] = labelencoder_X.fit_transform(X[:, 0])
onehotencoder ?= OneHotEncoder(categorical_features=[0])
X = onehotencoder.fit_transform(X).toarray()
labelencoder_y = LabelEncoder()
y = labelencoder_y.fit_transform(y)
First of all, we import the LabelEncoder from sklearn.preprocessing library. LabelEncoder simply transforms each categorical variable into a different numerical label. Then we create an instance named labelencoder_X out of the LabelEncoder class. We use the fit_transform method of the labelencoder_X instance to transform column 0 of matrix X and return the new X[:, 0]. The following is the value of X after being transformed by LabelEncoder:
We noticed that 'New York' is labelled 2, 'California' is labelled 0, while 'New Jersey' is labelled 1. But do you notice the potential issues here? Yes. Since 2 is greater than 1, and 1 is greater than 0. The equations of the machine learning model will think that 'New York' has a higher value than 'New Jersey', and 'New Jersey' has a higher value than 'California', which is not the case. These are three categories and there is no relational order between the three. In order to solve this issue, we need to use so-called dummy variables. Instead of having one column here, we need three columns here equal to the number of categories. Each of the columns will represent one state. The value for each cell will be either 1 or 0, there is only cell with value 1 for each row. The configuration is shown below:
We need to import the OneHotEncoder object from the sklearn.preprocessing library to create dummy variables. The onehotencoder = OneHotEncoder(categorical_features=[0]) command create an instance of the OneHotEncoder object, and it specifies that we need to create dummy variables for column 0. The X = onehotencoder.fit_transform(X).toarray() command transforms the old X to new X with new dummy variables. The following screenshot shows the value of new X with dummy variables:
Like we expected, the one column categorical variables have been transformed to three columns of dummy variables. The next task is to transform the dependent variable y. Since the ordering of the labels don't matter for the dependent variable, we only need to use the LabelEncoder to encode the dependent variable. The newly encoded y value is shown as follow:
By now, we can already see the light at the end of the tunnel.
Splitting the Dataset into the Training Set and Test Set
The goal of machine learning is to use the knowledge gained from training data to predict the results for the untrained data. So naturally we would like to split the dataset into the training set and test set. To split the dataset, we need to import the train_test_split function from the sklearn.model_selection library.
# Spliting the dataset into training set and test set
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y,
test_size = 0.2, random_state = 0)
Inside the parameter list train_test_split function, test_size specifies the ratio of the test size. Usually we choose a small number as the ratio of the test size to the total size. 0.2 is a common choice. Rarely will we choose a ratio larger than 0.4. The random_state variable is the seed used by the random number generator, and it is optional. If you don't specify the random_state variable, your results may be different for each run. Just for the purpose of this course, if you want your results to be exactly the same for every run, you can specify the random_state to be a certain integer. I choose 0 for this tutorial. Awesome, we are just one step away from the final goal.
Feature Scaling
Finally, let's review the independent variables matrix X after so many transformations:
If we look at the family size and the annual income variables, we notice that these two variables are not on the same scale. Family size is from 2 to 6, while the annual income is from 48500 to 115000. Because these variables are not on the same scale, this will cause some issues because lots of machine learning models are based on Euclidean distances. In Cartesian coordinates, if p = (p1, p2,..., pn) and q = (q1, q2,..., qn) are two points in Euclidean n-space, then the distance (d) from p to q, or from q to p is given by the Pythagorean formula:
If we think of family size and annual income as two coordinates of the observations, then the Euclidean distance between two observations will be dominated by the annual income just simply because the values of annual income are much larger than those of the family size. In the machine learning models, it will look like the family size variable doesn't exist because the Euclidean distance is dominated by the annual income. So it is absolutely necessary to put the variables on the same scale, i.e., the values of those variables are in the same range. There are a couple of common feature scaling methods. One common method is called min-max normalization as shown below:
Another common method is called Mean normalization as shown below:
There is another common method called Standardization as shown below:
The feature scaling function in our machine learning models usually use one the of above methods.
In this tutorial, we will use the StandardScaler object from the sklearn.preprocessing library to achieve feature scaling.
# Feature Scaling
from sklearn.preprocessing import StandardScaler
sc_X = StandardScaler()
X_train = sc_X.fit_transform(X_train)
X_test = sc_X.transform(X_test)
After importing the StandardScaler object, we create an instance of the StandardScaler object, then apply fit_transform function to the X_train dataset. Since the sc_X instance has already been fitted on X_train, we don't need to fit sc_X on X_test again. We can just use the transform method of sc_X on X_test and it is guaranteed that X_test will be scaled in the fashion as the X_train. Here is what X_train looks like after feature scaling:
Regarding feature scaling, there remain a few questions.
One question is whether we need to scale the dummy variables. If you google this question, there doesn't seem to be a consensus on this question. My opinion is that it depends on the context. In most cases, your dummy variables are in the same range of your other variables, so your machine learning models will not break even if you don't scale the dummy variables, and you keep the original interpretation of the dummy variables in your model. On the other hand, if you scale the dummy variables, then everything will be on the same scale, and you may get better prediction results. In this tutorial, we chose to scale the dummy variables.
Another question is whether we should scale the dependent variable. For our example here, the answer is no. Because our dependent variable here is a classification variable which only takes two values. But for other examples where the dependent variable takes a huge range of values, we will need to apply feature scaling to the dependent variable y as well.
Before we wrap up the feature scaling session, there are a few notes worth pointing out. One note is that even if sometimes machine learning models are not based on Euclidean distances, we will still do feature scaling because the algorithm will converge much faster. For example, the decision tree regression algorithm is not based on Euclidean distances, but if we don't do feature scaling, the algorithm will run for a really long time. Another note is that although feature scaling is recommended, most of the time we don't need to do it explicitly because most regression algorithms perform feature scaling automatically.
Conclusion
Congratulations if you reach this session! Let's review what we have learn. First of all, we learned how to use the pandas library to read cvs data. Secondly, we learned how to use the Imputer object to fill missing data. Thirdly, we learned how to use the LabelEncoder and OneHotEncoder object to transform categorical data to numerical data. Then we learned how to split the dataset into the training set and test set. The last but not the least skill we learned is how to use the StandardScaler object to scale the variables on the same scale. To sum up what we have learned, here is a Python template for data preprocessing:
#!/usr/bin/env python3
# -*- coding: utf-8 -*-
"""
Created on Mon Nov 26 02:54:58 2018
@author: Hua Wang
"""
# Importing libraries
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
# Importing csv data
dataset = pd.read_csv("Data.csv")
X = dataset.iloc[:, :-1].values
y = dataset.iloc[:, 3].values
# Dealing with missing data
from sklearn.preprocessing import Imputer
imputer = Imputer(missing_values="NaN", strategy="mean", axis=0)
imputer = imputer.fit(X[:, 1:3])
X[:, 1:3] = imputer.transform(X[:, 1:3])
# Encoding categorical variables
from sklearn.preprocessing import LabelEncoder, OneHotEncoder
labelencoder_X = LabelEncoder()
X[:, 0] = labelencoder_X.fit_transform(X[:, 0])
onehotencoder ?= OneHotEncoder(categorical_features=[0])
X = onehotencoder.fit_transform(X).toarray()
labelencoder_y = LabelEncoder()
y = labelencoder_y.fit_transform(y)
# Spliting the dataset into training set and test set
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 0)
# Feature Scaling
from sklearn.preprocessing import StandardScaler
sc_X = StandardScaler()
X_train = sc_X.fit_transform(X_train)
X_test = sc_X.transform(X_test)
Data preprocessing is a critical step in our machine learning journey. I am glad that we got this part covered. The best way to learn is to follow the example and see the results yourself. In part 2 of Machine Learning 101, I am gonna introduce the simple linear regression algorithm. Stay tuned.
References
You can download the Python script and dataset for this tutorial from my Github repo: Data Preprocessing.
Great