登录查看更多内容

Building a Machine Learning Pipeline

Siddharth Singh

Experienced Senior Software Engineer | 6 Years in System Design and Backend Services | Formerly at Meta and Oracle | Proven Leader in Project Design and Execution

发布日期: 2021年7月27日

+ 关注

In this article, we will be learning what Supervised learning is and how we can use it in regression tasks.

Supervised Learning

In machine learning and artificial intelligence, supervised learning refers to a class of systems and algorithms that determine a predictive model using data points with known outcomes.

Regression?models are used to predict a continuous value. Predicting the prices of a house given the features of the house like size, price, etc is one of the common examples of Regression.

Basic steps in any machine learning project:

Get the data
Discover and visualize data to gain insights
Prepare the data for machine learning algorithms
Select and train a model
Fine-tune the selected model
Analyze the error

# Importing all the modules that will be used in this notebook?
import pandas as pd
import os
import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import StratifiedShuffleSplit
from pandas.plotting import scatter_matrix
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import GridSearchCV
from sklearn import svm

1. Getting the data

Real-world data is filled with errors, missing values, and cleaning that up before you can use it to learn machine learning can be hard and take a lot of time. Fortunately, thousands of datasets are already available from various data sources that you can download and use.

For this article, we will be using the housing data for California. We will visualize it, clean it, train different models on it and see how to select the correct model.

2. Discover and visualize data to gain insights

Before we begin with any ML project, it is important to visualize the data. Visualizing the data can often give you insights into the data that were previously not known to you.

housing_df = pd.read_csv(os.path.join("datasets","housing.csv"))

housing_df.head()

housing_df.plot(kind="scatter", x="longitude", y="latitude", alpha=0.4, 

s=housing_df["population"]/100,?
? ? ? ? ? ? ?label="population", figsize=(10,7), c="median_house_value",?
? ? ? ? ? ? ?cmap=plt.get_cmap("jet"), colorbar=True)

From this map, we can see that the houses away from the ocean are cheaper than ones closer to it. So the column?"ocean_proximity" might have some effect on the house price.

To see which attributes have a correlation with the house value we can see the correlation matrix. We see that the median_income has the highest correlation to the house?value


correlation_matrix = housing_df.corr()
correlation_matrix["median_house_value"].sort_values(ascending=False)

# We can plot a scatter matrix to see which attributes effect the other

attributes = ["median_house_value","median_income", "total_rooms", "housing_median_age"]

scatter_matrix(housing_df[attributes], figsize=(12,8))

3. Prepare the data for machine learning algorithms

3.1 Splitting data into test/train set

Before we begin training our machine learning model, it is important that we divide our data into train and test sets. We don't touch test data until we have finalized our model. Then in the end we use the test set to see how accurately our model will behave in production.

Stratified Split: We need to make sure that the test set is a correct representation of the train set.

For example, suppose income is a very important attribute to decide the house price and there are 5 income levels. Using Stratified Split we can ensure that if the train set has [10,20, 30,20,20]% rows with 5 income levels, then the test also has the same ratio of these income levels.

housing_df["income_cat"] = pd.cut(housing_df["median_income"],
? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?bins=[0.0,1.5,3.0, 4.5,6, np.inf],
? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?labels=[1,2,3,4,5])
#create stratifies test/train split?
# use same random state to get the same split of test/train each time you run this code.

split = StratifiedShuffleSplit(n_splits=1, test_size=0.2, random_state=2)

for train_index, test_index in split.split(housing_df, 
housing_df["income_cat"]):
? ? strat_train_set = housing_df.loc[train_index]
? ? 
    strat_test_set = housing_df.loc[test_index]? ?


#We only needed the income_cat to split the data, we can drop it now.?

strat_train_set.drop("income_cat", axis=1, inplace=True)
strat_test_set.drop("income_cat", axis=1, inplace=True)
strat_train_set.head()

For training supervised machine learning models, we need to give the data and the "label".

housing = strat_train_set.drop("median_house_value", axis=1)

housing_labels = strat_train_set["median_house_value"].copy()

3.2 Data cleaning

The data we provide to the machine learning algorithm should not contain and missing values.

How to deal with missing data?

One way is to remove all the rows which have missing values. Though this is the easiest way, it might not always be possible to remove those rows. Sometimes more than half of the rows will have some missing values in one of the columns( attributes).

The other way is to fill these missing values with something. For example, if the house age is missing, we can fill those columns with the mean age of all the houses.

Check the data types and?non-null count?of each attribute If anyone has the type "object" check its values, it might have NA or missing values.

Transformers

There are many ways to get the mean/median and set it in the missing places. One of the ways is the use SimpleImputer which comes with sklearn.

SimpleImputer?is a transformer, which transforms the data in the way you want. Using this is especially helpful when we create data pipelines to apply multiple transformations to the data.

# Checking how many attributes have null values

housing_temp = housing.copy()

housing_temp.isnull().sum(axis=0)



longitude               0
latitude                0
housing_median_age      0
total_rooms             0
total_bedrooms        156
population              0
households              0
median_income           0

ocean_proximity         0

Transformers to fill median values can only be applied on columns/attributes that have numerical values so we need to give it a list of?numerical attributes. If the missing values are in some other format you can change how to create simpleImputer object.


housing_num = housing.drop("ocean_proximity", axis=1)
num_attrs = list(housing_num)
s_imputer = SimpleImputer(strategy="median")

housing_temp[num_attrs] = s_imputer.fit_transform(housing_temp[num_attrs])

Now if you see all the missing values have been filled. Notice we only made the change on a copy of original data and the original data still has missing values. To clean the data it is better to form data pipelines. These pipelines will apply all the transformations like above and you can use the same pipeline for cleaning/transforming new data.

longitude             0
latitude              0
housing_median_age    0
total_rooms           0
total_bedrooms        0
population            0
households            0
median_income         0
ocean_proximity       0


dtype: int64

Converting categorical data to numerical

Machine learning models need the input to be in numerical form and if we provide categorical data like ocean proximity, it will not work. To fix this issue we can convert that into numerical values. There are 5 different values for ocean proximity, we can give each category a number. for example inland=0, near ocean=1, and so on.

OneHotEncoder

The problem with this approach is that ML algorithms will assume that two nearby values are more similar than two distinct values. This may be true in some cases categories like good, average, bad but might not be true for ocean proximity. To fix this issue, a common solution is to create one binary attribute per category: one attribute equals to 1 when the category is "near ocean" and 0 otherwise, and so on for other categories. Since we have 5 different categories, we will get 5 attributes when we encode it this way.

Creating custom transformers

Although scikit-Learn provides many useful transformers, you will need to write your own for tasks suck as custom cleanup operations or combining specific attributes. We need to make these transformers such that it works with the pipelines. You can do that by implementing fit(), transform(), and fit_transform() in your class. If you want to give some parameters to this class and later tune that hyperparameter you should add BaseEstimator as a base class which will add methods get_params() and set_params() which are used in hyperparameter tuning.

# Creating custom transformers
class AttributeAdder(BaseEstimator, TransformerMixin):
? ? 
    def fit(self, X):
? ? ? ? return self
? ??
? ? def transform(self, X):
? ? ? ? total_rooms_ix = 3 # X.columns.get_loc("total_rooms")
? ? ? ??
? ? ? ? total_bedrooms_ix =4 # X.columns.get_loc("total_bedrooms")
? ? ? ? 
        population_ix = 5 # X.columns.get_loc("population")
? ? ? ? 
        households_ix = 6 #X.columns.get_loc("households")

? ? ? ? rooms_per_household = X[:, total_rooms_ix] / X[:, households_ix]

? ? ? ? population_per_household = X[:, population_ix] / X[:,   households_ix]
? ? ? ? 
        bedrooms_per_room = X[:, total_bedrooms_ix] / X[:, total_rooms_ix]

? ? ? ? 
        return np.c_[X, rooms_per_household, population_per_household,         bedrooms_per_room]

3.3 Creating a data processing pipeline?

Now we will create a pipeline that will do all the data transformations we want on our data. You can perform all these operations manually one by one too but having a pipeline makes it a lot easier to manage the data transformations. Also once the pipeline is ready you can use it to transform the test data when we want to test the accuracy of our system.

领英推荐

Introduction to Simple Linear Regression in Machine…

Learnbay 2 年前

4 algorithms machine learning engineers should know

Naveen Joshi 7 年前

Extracting Graph Level Features from Graphs for…

Ajay Taneja 1 年前

Column Transformer

Usually, the transformers will apply all the transformations to all the attributes/columns. So if we want to apply some transformations only on certain columns, we can use the ColumnTransformer.

3.4 Feature Scaling

Machine learning algorithms don't perform well when the input numerical attributes have very different scales. For example, the total number of rooms ranges from 6 to 39320, while income ranges from 0 to 15.

There are two common ways of scaling:

Min-max scaling( Normalization)
Standardization: Subtract the mean value and then divide by standard deviation so that the resulting distribution has unit variance

Now we will create the data pipeline where we apply all the transformations to clean and modify the data.



housing_num = housing.drop("ocean_proximity", axis=1)

# extract all the numerical attributes
num_attrs = list(housing_num)

# categorical attributes
cat_attrs = ["ocean_proximity"]

# This first pipeline will be for applying transformations to all columns
#The code below applies 3 transformers to our data
#1. SimpleImputer(strategy="median") -> Replace the missing with median
#2. 'attr_adder', AttributeAdder() -> Our custom transformer
#3. 'std_scaler', StandardScaler() -> Used for feature scaling 

data_preprocessing_pipe = Pipeline([
? ? ('imputer', SimpleImputer(strategy="median")),
? ? ('attr_adder', AttributeAdder()),
? ? ('std_scaler', StandardScaler())
])


# This pipeline will apply the transformations in the first pipeline and then apply transforamtions on specific columns
# If there are some columns that you dont want to modify, you can set remainder="passthrough" while creating pipeline

full_pipeline = ColumnTransformer([
? ? ('num_data_pipe', data_preprocessing_pipe, num_attrs),
? ? ('cat_attrs', OneHotEncoder(), cat_attrs)
])

processed_data = full_pipeline.fit_transform(housing)

processed_data.shape

4. Select and train a model

Training a model and testing its accuracy?

For now, we will use linear regression, train it on our data and test its accuracy on the train set. We should not test accuracy on the test set until we have finalized our model. If we test different models on our test set and pick the best one, it might not perform that well with unseen data. This is because the model we selected is biased to give better results on the test set. Due to this reason we use a cross-validation set, which we will see soon.

lin_reg = LinearRegression()

lin_reg.fit(processed_data, housing_labels)


# Calculate train set accuracy using root mean square


np.sqrt(mean_squared_error(lin_reg.predict(processed_data), housing_labels))

68265.90542805477

So our model can predict the price with +-68,265. To get a more accurate score, we can use a cross-validation score. Now let's see if we can reduce this error.

Model Selection?

Which machine learning algorithm will give the best result will depend on the dataset you have. So if you are not sure that one particular algorithm will give the best result on your data, you can try multiple algorithms.

The accuracy will also depend on the hyperparameters that you use with the machine learning algorithms, so we would also need to tune those hyperparameters to get the best values. This technique is called?hyperparameter tuning.

For searching for the best hyperparameter, sklearn comes with a module name GridSearchCV and RandomizedSearchCV.

GridSearchCV: Does exhaustive search over specified parameter values for an estimator.
RandomizedSearchCV: Randomized search on hyperparameters, we should use this when the different combinations of hyperparameters are a lot and it will take a lot of time to train the model on all combinations.

# here we will search for the best parameters for RandomForest algorithm
# If this takes long time to compute either reduce the number of values for each hyper parameter or use randomized?search

# Hyper parameters for random forest:

param_grid = [
? ? {'n_estimators': [3,10,30], 'max_features':[2,4,6,8]},
? ? {'bootstrap':[False],'n_estimators': [3,10], 'max_features':[2,3,4] }
]


rf_reg = RandomForestRegressor()


grid_search = GridSearchCV(rf_reg, param_grid, cv=5, 

scoring='neg_mean_squared_error', return_train_score=True)


grid_search.fit(processed_data, housing_labels)


# Now lets see which parameters got the best result
print("best parameters are: {}\n\n".format(grid_search.best_params_))

# to see the scores for all combinations of parameters we can use:

cvres = grid_search.cv_results_

for mean_score, params in zip(cvres["mean_test_score"], cvres["params"]):

? ? print(np.sqrt(-mean_score), params)

So we improved our score from almost 70,000 to 50,000. To see if we can do better, we can try different algorithms and try a grid search on each algorithm.

Auto Model Selection

There is no particular ML model which will give the best result with all different kinds of problems. So to find out which model works best for you, you can all of them(that you think might work) out and see which one gives the best result.

You can list all the models you want to try in a dictionary/JSON file, with all the hyperparameters that you want to try with each, and just loop over each model to see which one gives you the best result.

As these models don't depend on each other you can execute them in parallel as well.

# Auto Model Selection
all_models = {
? ??
? ? 'svm':{
? ? ? ? 'model': svm.SVC(),
? ? ? ? 'params': {
? ? ? ? ? ? 'c' : [1,10,20],
? ? ? ? ? ? 'kernel': ['linear', 'poly', 'rbf', 'sigmoid', 'precomputed'],?
? ? ? ? ? ? 'degree':[2,3,4],?
? ? ? ? ? ? 'gamma': ['scale', 'auto']
? ? ? ? }
? ? },
? ? 'Random_forest': {
? ? ? ? 'model': RandomForestRegressor(),
? ? ? ? 'params':[
? ? ? ? ? ? {'n_estimators': [10,30,50], 'max_features':[2,4,6,8]},
? ? ? ? ? ? {'bootstrap':[False],'n_estimators': [3,10], 'max_features':[2,3,4] }
? ? ? ? ? ? ]
? ? }
? ??
}


def display_scores(scores):
? ? print("Scores: {}".format(scores))
? ? print("mean: {}".format(scores.mean()))
? ? print("Standard_deviation: {}".format(scores.std()))
? ??

for model in all_models:
? ? grid_searh = GridSearchCV(all_models[model]['model'], all_models[model]['params'], cv=5,scoring='neg_mean_squared_error')
? ? 
    scores = cross_val_score(grid_search.best_estimator_, processed_data, housing_labels,scoring="neg_mean_squared_error", cv=10)
? ? 
    print("\n{} grid search best Scores:".format(model))
? ? 

    display_scores(np.sqrt(-scores))

As we can see that the average cross-validation score is even less than the score we got by testing directly on train data, indicating that the models are not overfitting.

There are a lot of machine learning algorithms that you can try instead of just these two and you might get a better result using one of them.

Evaluating the model on the test set

Now that we have finalized our model, we can test it on our test set. This will give us an idea of how the model will perform in the real world with unseen data. Remember with the unseen data and test data, we still need to preprocess it to remove missing values, encode categories, scale the inputs, etc.

final_model = grid_search.best_estimator_
housing_test_unprocessed = strat_test_set.drop(["median_house_value"], axis=1)

housing_test_labels = strat_test_set["median_house_value"].copy()
housing_test_prepared = full_pipeline.transform(housing_test_unprocessed)
housing_test_preds = final_model.predict(housing_test_prepared)

mse = mean_squared_error(housing_test_labels, housing_test_preds)



print("Final score Random Forest: {}".format(np.sqrt(mse)))

Final score Random Forest: 47722.5592620724

Overfitting

So we got a final root mean square error of 48,000, a lot better than 70, 000 that we had got with linear regression. The error on the test set is less than the train and cross-validation indicating the model is not overfitting. If in case you see that the test accuracy is a lot worse than the train, it means that the model is overfitting.

To reduce overfitting, we can use regularizations techniques like:

Ridge regression(regularized version of linear regression)
Lasso Regression( an important characteristic of Lasso regression is that it tends to eliminate the weights of the least important features. In other words, Lasso regression automatically performs feature selection.
Elastic Net: Middle ground between Ridge and Lasso regression.
Early Stopping: A very different way to regularize iterative learning algorithms such as?Gradient Descent?is to stop training as soon as the validation error reaches a minimum.

So when should you use plain Linear Regression, Ridge, Lasso, or Elastic Net?

It is almost always preferable to have at least a little bit of regularization, so generally, you should avoid plain Linear Regression. Ridge is a good default, but if you suspect that only a few features are useful, you should prefer Lasso or Elastic Net because they tend to reduce the useless features' weight down to zero.

References

Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow by Aurelien Geron
https://scikit-learn.org/

要查看或添加评论，请登录

Siddharth Singh的更多文章

API Gateway Authentication with Amazon Cognito

2018年10月19日

API Gateway Authentication with Amazon Cognito

Amazon API Gateway is a fully managed service that makes it easy for developers to create, publish, maintain, monitor…
Debugging database latency on AWS RDS

2018年8月27日

Debugging database latency on AWS RDS

About I was recently debugging the high latency of a website I am working on and I would like to share how I did that…

1 条评论
Deploying Django project to Elastic Beanstalk part 2

2018年8月18日

Deploying Django project to Elastic Beanstalk part 2

What is Elastic Beanstalk? AWS Elastic Beanstalk is an easy-to-use service for deploying and scaling web applications…
Deploying Django project to Elastic Beanstalk part 1

2018年8月17日

Deploying Django project to Elastic Beanstalk part 1

What is Elastic Beanstalk? AWS Elastic Beanstalk is an easy-to-use service for deploying and scaling web applications…
Creating a new user and restricting him to a specific folder in AWS EC2 Linux

2018年4月25日

Creating a new user and restricting him to a specific folder in AWS EC2 Linux

Creating a new user and restricting him to a specific folder in AWS EC2 Linux Creating a new website? Need to give…

1 条评论
Creating Database on AWS and using it on MySQL Workbench

2018年4月23日

Creating Database on AWS and using it on MySQL Workbench

Creating Database on AWS and using it on MySQL Workbench The relational databases in AWS are under the name of RDS…

See all articles

Building a Machine Learning Pipeline

Siddharth Singh

Experienced Senior Software Engineer | 6 Years in System Design and Backend Services | Formerly at Meta and Oracle | Proven Leader in Project Design and Execution

Supervised Learning

Basic steps in any machine learning project:

1. Getting the data

2. Discover and visualize data to gain insights

3. Prepare the data for machine learning algorithms

3.1 Splitting data into test/train set

3.2 Data cleaning

Converting categorical data to numerical

Creating custom transformers

3.3 Creating a data processing pipeline?

领英推荐

Column Transformer

3.4 Feature Scaling

4. Select and train a model

Training a model and testing its accuracy?

Model Selection?

Auto Model Selection

Evaluating the model on the test set

Overfitting

References

Siddharth Singh的更多文章

社区洞察

其他会员也浏览了

Exploring The Impact Of Machine Learning On Various Industries

ML Day 8: Basic ML Algorithms Every IT Professional Should Know

10 Things Everyone Needs to Know about Machine Learning in 2019

Different types of Machine Learning - Part 02

10 Machine Learning Algorithms Explained Using Real-World Analogies

Machine Learning Algorithms: A Deep Dive into Key Techniques

Top Machine Learning Algorithms You Should Know to Become a Data Scientist

Machine Learning

How to Handle Imbalanced Datasets in Machine Learning: A Step-by-Step Guide

From Equations to Intelligence: The Mathematical Roots in Machine Learning (Part-1: Linear Algebra and Calculus)

Supervised Learning

Basic steps in any machine learning project:

1. Getting the data

2. Discover and visualize data to gain insights

3. Prepare the data for machine learning algorithms

3.1 Splitting data into test/train set

3.2 Data cleaning

Converting categorical data to numerical

Creating custom transformers

3.3 Creating a data processing pipeline?

领英推荐

Column Transformer

3.4 Feature Scaling

4. Select and train a model

Training a model and testing its accuracy?

Model Selection?

Auto Model Selection

Evaluating the model on the test set

Overfitting

References

Siddharth Singh的更多文章

API Gateway Authentication with Amazon Cognito

Debugging database latency on AWS RDS

Deploying Django project to Elastic Beanstalk part 2

Deploying Django project to Elastic Beanstalk part 1

Creating a new user and restricting him to a specific folder in AWS EC2 Linux

Creating Database on AWS and using it on MySQL Workbench

社区洞察

其他会员也浏览了

Exploring The Impact Of Machine Learning On Various Industries

ML Day 8: Basic ML Algorithms Every IT Professional Should Know

10 Things Everyone Needs to Know about Machine Learning in 2019

Different types of Machine Learning - Part 02

10 Machine Learning Algorithms Explained Using Real-World Analogies

Machine Learning Algorithms: A Deep Dive into Key Techniques

Top Machine Learning Algorithms You Should Know to Become a Data Scientist

Machine Learning

How to Handle Imbalanced Datasets in Machine Learning: A Step-by-Step Guide

From Equations to Intelligence: The Mathematical Roots in Machine Learning (Part-1: Linear Algebra and Calculus)