Building a Machine Learning Pipeline
Siddharth Singh
Experienced Senior Software Engineer | 6 Years in System Design and Backend Services | Formerly at Meta and Oracle | Proven Leader in Project Design and Execution
In this article, we will be learning what Supervised learning is and how we can use it in regression tasks.
Supervised Learning
In machine learning and artificial intelligence, supervised learning refers to a class of systems and algorithms that determine a predictive model using data points with known outcomes.
Regression?models are used to predict a continuous value. Predicting the prices of a house given the features of the house like size, price, etc is one of the common examples of Regression.
Basic steps in any machine learning project:
# Importing all the modules that will be used in this notebook?
import pandas as pd
import os
import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import StratifiedShuffleSplit
from pandas.plotting import scatter_matrix
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import GridSearchCV
from sklearn import svm
1. Getting the data
Real-world data is filled with errors, missing values, and cleaning that up before you can use it to learn machine learning can be hard and take a lot of time. Fortunately, thousands of datasets are already available from various data sources that you can download and use.
For this article, we will be using the housing data for California. We will visualize it, clean it, train different models on it and see how to select the correct model.
2. Discover and visualize data to gain insights
Before we begin with any ML project, it is important to visualize the data. Visualizing the data can often give you insights into the data that were previously not known to you.
housing_df = pd.read_csv(os.path.join("datasets","housing.csv"))
housing_df.head()
housing_df.plot(kind="scatter", x="longitude", y="latitude", alpha=0.4,
s=housing_df["population"]/100,?
? ? ? ? ? ? ?label="population", figsize=(10,7), c="median_house_value",?
? ? ? ? ? ? ?cmap=plt.get_cmap("jet"), colorbar=True)
From this map, we can see that the houses away from the ocean are cheaper than ones closer to it. So the column?"ocean_proximity" might have some effect on the house price.
To see which attributes have a correlation with the house value we can see the correlation matrix. We see that the median_income has the highest correlation to the house?value
correlation_matrix = housing_df.corr()
correlation_matrix["median_house_value"].sort_values(ascending=False)
# We can plot a scatter matrix to see which attributes effect the other
attributes = ["median_house_value","median_income", "total_rooms", "housing_median_age"]
scatter_matrix(housing_df[attributes], figsize=(12,8))
3. Prepare the data for machine learning algorithms
3.1 Splitting data into test/train set
Before we begin training our machine learning model, it is important that we divide our data into train and test sets. We don't touch test data until we have finalized our model. Then in the end we use the test set to see how accurately our model will behave in production.
Stratified Split: We need to make sure that the test set is a correct representation of the train set.
For example, suppose income is a very important attribute to decide the house price and there are 5 income levels. Using Stratified Split we can ensure that if the train set has [10,20, 30,20,20]% rows with 5 income levels, then the test also has the same ratio of these income levels.
housing_df["income_cat"] = pd.cut(housing_df["median_income"],
? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?bins=[0.0,1.5,3.0, 4.5,6, np.inf],
? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?labels=[1,2,3,4,5])
#create stratifies test/train split?
# use same random state to get the same split of test/train each time you run this code.
split = StratifiedShuffleSplit(n_splits=1, test_size=0.2, random_state=2)
for train_index, test_index in split.split(housing_df,
housing_df["income_cat"]):
? ? strat_train_set = housing_df.loc[train_index]
? ?
strat_test_set = housing_df.loc[test_index]? ?
#We only needed the income_cat to split the data, we can drop it now.?
strat_train_set.drop("income_cat", axis=1, inplace=True)
strat_test_set.drop("income_cat", axis=1, inplace=True)
strat_train_set.head()
For training supervised machine learning models, we need to give the data and the "label".
housing = strat_train_set.drop("median_house_value", axis=1)
housing_labels = strat_train_set["median_house_value"].copy()
3.2 Data cleaning
The data we provide to the machine learning algorithm should not contain and missing values.
How to deal with missing data?
One way is to remove all the rows which have missing values. Though this is the easiest way, it might not always be possible to remove those rows. Sometimes more than half of the rows will have some missing values in one of the columns( attributes).
The other way is to fill these missing values with something. For example, if the house age is missing, we can fill those columns with the mean age of all the houses.
Check the data types and?non-null count?of each attribute If anyone has the type "object" check its values, it might have NA or missing values.
Transformers
There are many ways to get the mean/median and set it in the missing places. One of the ways is the use SimpleImputer which comes with sklearn.
SimpleImputer?is a transformer, which transforms the data in the way you want. Using this is especially helpful when we create data pipelines to apply multiple transformations to the data.
# Checking how many attributes have null values
housing_temp = housing.copy()
housing_temp.isnull().sum(axis=0)
longitude 0
latitude 0
housing_median_age 0
total_rooms 0
total_bedrooms 156
population 0
households 0
median_income 0
ocean_proximity 0
Transformers to fill median values can only be applied on columns/attributes that have numerical values so we need to give it a list of?numerical attributes. If the missing values are in some other format you can change how to create simpleImputer object.
housing_num = housing.drop("ocean_proximity", axis=1)
num_attrs = list(housing_num)
s_imputer = SimpleImputer(strategy="median")
housing_temp[num_attrs] = s_imputer.fit_transform(housing_temp[num_attrs])
Now if you see all the missing values have been filled. Notice we only made the change on a copy of original data and the original data still has missing values. To clean the data it is better to form data pipelines. These pipelines will apply all the transformations like above and you can use the same pipeline for cleaning/transforming new data.
longitude 0
latitude 0
housing_median_age 0
total_rooms 0
total_bedrooms 0
population 0
households 0
median_income 0
ocean_proximity 0
dtype: int64
Converting categorical data to numerical
Machine learning models need the input to be in numerical form and if we provide categorical data like ocean proximity, it will not work. To fix this issue we can convert that into numerical values. There are 5 different values for ocean proximity, we can give each category a number. for example inland=0, near ocean=1, and so on.
OneHotEncoder
The problem with this approach is that ML algorithms will assume that two nearby values are more similar than two distinct values. This may be true in some cases categories like good, average, bad but might not be true for ocean proximity. To fix this issue, a common solution is to create one binary attribute per category: one attribute equals to 1 when the category is "near ocean" and 0 otherwise, and so on for other categories. Since we have 5 different categories, we will get 5 attributes when we encode it this way.
Creating custom transformers
Although scikit-Learn provides many useful transformers, you will need to write your own for tasks suck as custom cleanup operations or combining specific attributes. We need to make these transformers such that it works with the pipelines. You can do that by implementing fit(), transform(), and fit_transform() in your class. If you want to give some parameters to this class and later tune that hyperparameter you should add BaseEstimator as a base class which will add methods get_params() and set_params() which are used in hyperparameter tuning.
# Creating custom transformers
class AttributeAdder(BaseEstimator, TransformerMixin):
? ?
def fit(self, X):
? ? ? ? return self
? ??
? ? def transform(self, X):
? ? ? ? total_rooms_ix = 3 # X.columns.get_loc("total_rooms")
? ? ? ??
? ? ? ? total_bedrooms_ix =4 # X.columns.get_loc("total_bedrooms")
? ? ? ?
population_ix = 5 # X.columns.get_loc("population")
? ? ? ?
households_ix = 6 #X.columns.get_loc("households")
? ? ? ? rooms_per_household = X[:, total_rooms_ix] / X[:, households_ix]
? ? ? ? population_per_household = X[:, population_ix] / X[:, households_ix]
? ? ? ?
bedrooms_per_room = X[:, total_bedrooms_ix] / X[:, total_rooms_ix]
? ? ? ?
return np.c_[X, rooms_per_household, population_per_household, bedrooms_per_room]
3.3 Creating a data processing pipeline?
Now we will create a pipeline that will do all the data transformations we want on our data. You can perform all these operations manually one by one too but having a pipeline makes it a lot easier to manage the data transformations. Also once the pipeline is ready you can use it to transform the test data when we want to test the accuracy of our system.
领英推荐
Column Transformer
Usually, the transformers will apply all the transformations to all the attributes/columns. So if we want to apply some transformations only on certain columns, we can use the ColumnTransformer.
3.4 Feature Scaling
Machine learning algorithms don't perform well when the input numerical attributes have very different scales. For example, the total number of rooms ranges from 6 to 39320, while income ranges from 0 to 15.
There are two common ways of scaling:
Now we will create the data pipeline where we apply all the transformations to clean and modify the data.
housing_num = housing.drop("ocean_proximity", axis=1)
# extract all the numerical attributes
num_attrs = list(housing_num)
# categorical attributes
cat_attrs = ["ocean_proximity"]
# This first pipeline will be for applying transformations to all columns
#The code below applies 3 transformers to our data
#1. SimpleImputer(strategy="median") -> Replace the missing with median
#2. 'attr_adder', AttributeAdder() -> Our custom transformer
#3. 'std_scaler', StandardScaler() -> Used for feature scaling
data_preprocessing_pipe = Pipeline([
? ? ('imputer', SimpleImputer(strategy="median")),
? ? ('attr_adder', AttributeAdder()),
? ? ('std_scaler', StandardScaler())
])
# This pipeline will apply the transformations in the first pipeline and then apply transforamtions on specific columns
# If there are some columns that you dont want to modify, you can set remainder="passthrough" while creating pipeline
full_pipeline = ColumnTransformer([
? ? ('num_data_pipe', data_preprocessing_pipe, num_attrs),
? ? ('cat_attrs', OneHotEncoder(), cat_attrs)
])
processed_data = full_pipeline.fit_transform(housing)
processed_data.shape
4. Select and train a model
Training a model and testing its accuracy?
For now, we will use linear regression, train it on our data and test its accuracy on the train set. We should not test accuracy on the test set until we have finalized our model. If we test different models on our test set and pick the best one, it might not perform that well with unseen data. This is because the model we selected is biased to give better results on the test set. Due to this reason we use a cross-validation set, which we will see soon.
lin_reg = LinearRegression()
lin_reg.fit(processed_data, housing_labels)
# Calculate train set accuracy using root mean square
np.sqrt(mean_squared_error(lin_reg.predict(processed_data), housing_labels))
68265.90542805477
So our model can predict the price with +-68,265. To get a more accurate score, we can use a cross-validation score. Now let's see if we can reduce this error.
Model Selection?
Which machine learning algorithm will give the best result will depend on the dataset you have. So if you are not sure that one particular algorithm will give the best result on your data, you can try multiple algorithms.
The accuracy will also depend on the hyperparameters that you use with the machine learning algorithms, so we would also need to tune those hyperparameters to get the best values. This technique is called?hyperparameter tuning.
For searching for the best hyperparameter, sklearn comes with a module name GridSearchCV and RandomizedSearchCV.
# here we will search for the best parameters for RandomForest algorithm
# If this takes long time to compute either reduce the number of values for each hyper parameter or use randomized?search
# Hyper parameters for random forest:
param_grid = [
? ? {'n_estimators': [3,10,30], 'max_features':[2,4,6,8]},
? ? {'bootstrap':[False],'n_estimators': [3,10], 'max_features':[2,3,4] }
]
rf_reg = RandomForestRegressor()
grid_search = GridSearchCV(rf_reg, param_grid, cv=5,
scoring='neg_mean_squared_error', return_train_score=True)
grid_search.fit(processed_data, housing_labels)
# Now lets see which parameters got the best result
print("best parameters are: {}\n\n".format(grid_search.best_params_))
# to see the scores for all combinations of parameters we can use:
cvres = grid_search.cv_results_
for mean_score, params in zip(cvres["mean_test_score"], cvres["params"]):
? ? print(np.sqrt(-mean_score), params)
So we improved our score from almost 70,000 to 50,000. To see if we can do better, we can try different algorithms and try a grid search on each algorithm.
Auto Model Selection
There is no particular ML model which will give the best result with all different kinds of problems. So to find out which model works best for you, you can all of them(that you think might work) out and see which one gives the best result.
You can list all the models you want to try in a dictionary/JSON file, with all the hyperparameters that you want to try with each, and just loop over each model to see which one gives you the best result.
As these models don't depend on each other you can execute them in parallel as well.
# Auto Model Selection
all_models = {
? ??
? ? 'svm':{
? ? ? ? 'model': svm.SVC(),
? ? ? ? 'params': {
? ? ? ? ? ? 'c' : [1,10,20],
? ? ? ? ? ? 'kernel': ['linear', 'poly', 'rbf', 'sigmoid', 'precomputed'],?
? ? ? ? ? ? 'degree':[2,3,4],?
? ? ? ? ? ? 'gamma': ['scale', 'auto']
? ? ? ? }
? ? },
? ? 'Random_forest': {
? ? ? ? 'model': RandomForestRegressor(),
? ? ? ? 'params':[
? ? ? ? ? ? {'n_estimators': [10,30,50], 'max_features':[2,4,6,8]},
? ? ? ? ? ? {'bootstrap':[False],'n_estimators': [3,10], 'max_features':[2,3,4] }
? ? ? ? ? ? ]
? ? }
? ??
}
def display_scores(scores):
? ? print("Scores: {}".format(scores))
? ? print("mean: {}".format(scores.mean()))
? ? print("Standard_deviation: {}".format(scores.std()))
? ??
for model in all_models:
? ? grid_searh = GridSearchCV(all_models[model]['model'], all_models[model]['params'], cv=5,scoring='neg_mean_squared_error')
? ?
scores = cross_val_score(grid_search.best_estimator_, processed_data, housing_labels,scoring="neg_mean_squared_error", cv=10)
? ?
print("\n{} grid search best Scores:".format(model))
? ?
display_scores(np.sqrt(-scores))
As we can see that the average cross-validation score is even less than the score we got by testing directly on train data, indicating that the models are not overfitting.
There are a lot of machine learning algorithms that you can try instead of just these two and you might get a better result using one of them.
Evaluating the model on the test set
Now that we have finalized our model, we can test it on our test set. This will give us an idea of how the model will perform in the real world with unseen data. Remember with the unseen data and test data, we still need to preprocess it to remove missing values, encode categories, scale the inputs, etc.
final_model = grid_search.best_estimator_
housing_test_unprocessed = strat_test_set.drop(["median_house_value"], axis=1)
housing_test_labels = strat_test_set["median_house_value"].copy()
housing_test_prepared = full_pipeline.transform(housing_test_unprocessed)
housing_test_preds = final_model.predict(housing_test_prepared)
mse = mean_squared_error(housing_test_labels, housing_test_preds)
print("Final score Random Forest: {}".format(np.sqrt(mse)))
Final score Random Forest: 47722.5592620724
Overfitting
So we got a final root mean square error of 48,000, a lot better than 70, 000 that we had got with linear regression. The error on the test set is less than the train and cross-validation indicating the model is not overfitting. If in case you see that the test accuracy is a lot worse than the train, it means that the model is overfitting.
To reduce overfitting, we can use regularizations techniques like:
So when should you use plain Linear Regression, Ridge, Lasso, or Elastic Net?
It is almost always preferable to have at least a little bit of regularization, so generally, you should avoid plain Linear Regression. Ridge is a good default, but if you suspect that only a few features are useful, you should prefer Lasso or Elastic Net because they tend to reduce the useless features' weight down to zero.
References