10 Machine Learning Regressors in Python

10 Machine Learning Regressors in Python

In the last tutorial, we completed the?Data Pre-Processing?step. We saw pre-processing techniques applied in transformation and variable selection, dimensionality reduction, and sampling for machine learning throughout this previous tutorial.

Now we can move on to the next steps within the Data Science process. We’ll apply the rest of the?model building process with various?Regression Algorithms?to understand and how to use machine learning with python language. In the next moment, we will discuss the?Classification algorithms.

We will not go into detail about the algorithms. Instead, the purpose here will be to understand the detailed process of building the Machine Learning model, training models, model evaluation, and prediction.

Jupyter Notebook

See the?Jupyter Notebook?for the concepts we’ll cover on building machine learning models and my?LinkedIn profile?for other Data Science articles and tutorials.

We previously worked with classification algorithms, and now we will address the regression algorithms. Both classification and regression are subcategories of supervised learning. When we deliver input data and output data to the algorithm, we predict classes in the?classification, and in the?regression, we expect numerical values.

The process of constructing the model is the same regardless of the algorithm. What will change in essence is the algorithm used and the metric for evaluating the model, the rest of the process is standard. The techniques may be slightly different depending much more on the data set we are working with, but the process will be the same regardless of the algorithm.

Business Problem Definition

Let’s create a predictive model that can predict the price of homes based on some variables (characteristics) on several homes in a Boston neighborhood. Then, based on a series of attributes, we will indicate a numeric value through regression.

Critical Metrics for Regression Assessment

We need metrics to evaluate the outcome of a regression model. Therefore, the choice of an algorithm will define which metric will be used to measure your performance. scikit-learn does not implement all performance metrics*

  • Mean Squared Error (MSE) — Average Square Error
  • Root Mean Squared Error (RMSE) — Square Root MSE
  • Mean Absolute Error (MAE) — Average Absolute Error
  • R Squared (R2) — Coefficient of Determination
  • Adjusted R Squared (R2) — R Adjusted
  • Mean Square Percentage Error (MSPE)
  • Mean Absolute Percentage Error (MAPE)
  • Root Mean Squared Logarithmic Error (RMSLE)

1. MSE — the magnitude of model error

from sklearn.metrics import mean_squared_error

Maybe it’s the most accessible metric to understand. N is the number of observations in the dataset, the sum,?Yi?of the historical values that have already been collected, and?y^?is the model’s prediction. We square it so we don’t have negative values.

The algorithm is fed?X?(input predictor variable) and?Y?(output target variable) during algorithm training. The algorithm learns mathematical relationships and makes a prediction defined as?y^.

After the prediction, we calculate the difference between the model forecast and the historical value of the target variable. This calculation will return an error rate — Average quadratic error. Depending on the value of the?MSE, we were able to verify whether or not the model performs well through the error rate; that is, the smaller — the better the model.

It is perhaps the simplest and most common metric for regression evaluation and probably the least useful. The?MSE?misunderstands the average square error of our predictions. For each point, it calculates the square difference between the predictions and the actual value of the target variable and then calculates the average of those values.

The higher this value, the worse the model is. Of course, this value will never be negative since individual prediction errors square us, but it would be zero for a perfect model.

# MSE - Mean Squared Error

# Import modules
from pandas import read_csv
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error 
from sklearn.linear_model import LinearRegression

# Loading data
file = 'https://lib.stat.cmu.edu/datasets/boston'
columns = ['CRIM', 'ZN', 'INDUS', 'CHAS', 'NOX', 'RM', 'AGE', 'DIS', 'RAD', 'TAX', 'PTRATIO','B', 'LSTAT', 'MEDV']
data = read_csv(file, delim_whitespace = True, names = columns)
array = data.values

# Separating the array into input and output components
X = array[:,0:13]
Y = array[:,13]

# Divides the data into training and testing 
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size = 0.33, random_state = 5)

# Creating model
model = LinearRegression()

# Traning model
model.fit(X_train, Y_train)

# Making predictions
Y_pred = model.predict(X_test)

# Result
mse = mean_squared_error(Y_test, Y_pred)
print("The MSE of the model is::", mse)        

Once we have the predictions, we?Y_pred?call the mean_squared_error that will apply the?MSE. Finally, the function receives the parameter Y_test, the actual value of?Y?and?Y_pred?we calculated in the forecast set.

We have the?MSE?for this model of?28.53. The ideal is that we do a work of pre-processing, the transformation of variables, standardization in this data set to reduce the value of the MSE.?The smaller the MSE, the better our model.

2. RMSE

We use the?MSE?or?RMSE?depending on the type of interpretation we want for the final result.?We should compare two models using the same metric*

To get to the?RMSE?result, calculate the?square root of the MSE?that we figured earlier with the?mean_squared_error.

3. MAE

from sklearn.metrics import mean_absolute_error

In some situations, we will use absolute values, usually when we have outliers in the dataset. For this, we use the mean fundamental error, that is, the sum of the absolute difference between forecasts and actual values. Thus, we use absolute values instead of the squared error of the?MSE?to calculate.

The value of?0 indicates no error?— the perfect prediction is very rare to happen. Our job as Data scientists is to reduce this rate as much as possible.

# MAE - Mean Absolute Error

# Import modules
from pandas import read_csv
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_absolute_error 
from sklearn.linear_model import LinearRegression

# Loading data
file = 'https://lib.stat.cmu.edu/datasets/boston'
columns = ['CRIM', 'ZN', 'INDUS', 'CHAS', 'NOX', 'RM', 'AGE', 'DIS', 'RAD', 'TAX', 'PTRATIO','B', 'LSTAT', 'MEDV']
data = read_csv(file, delim_whitespace = True, names = columns)
array = data.values

# Separating the array into input and output components
X = array[:,0:13]
Y = array[:,13]

# Divides the data into training and testing 
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size = 0.33, random_state = 5)

# Creating model
model = LinearRegression()

# Training model
model.fit(X_train, Y_train)

# Making predictions
Y_pred = model.predict(X_test)

# Metric Result
mae = mean_absolute_error(Y_test, Y_pred)
print("The MAE of the model is", mae)        

With the Linear Regression, we have the?MAE of 3.45, but be careful — we can’t compare the MAE to the?MSE of 28.5!?We should compare different models but with equal metrics.

4. R2 — Coefficient of Determination

from sklearn.metrics import r2_score

The advantage of?R2?is that it returns a coefficient that goes from 0 to 1. Thus, as higher, the better the model is, unlike the other metrics where we interpret an error rate below.

This metric reflects the level of accuracy of the predictions relative to the values observed through the?r2_score.

# R2

# Import modules
from pandas import read_csv
from sklearn.model_selection import train_test_split
from sklearn.metrics import r2_score
from sklearn.linear_model import LinearRegression

# Loading data
file = 'https://lib.stat.cmu.edu/datasets/boston'
columns = ['CRIM', 'ZN', 'INDUS', 'CHAS', 'NOX', 'RM', 'AGE', 'DIS', 'RAD', 'TAX', 'PTRATIO','B', 'LSTAT', 'MEDV']
data = read_csv(file, delim_whitespace = True, names = columns)
array = data.values

# Separating the array into input and output components
X = array[:,0:13]
Y = array[:,13]

# Divides the data into training and testing 
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size = 0.33, random_state = 5)

# Creating model
model = LinearRegression()

# Training model
model.fit(X_train, Y_train)

# Making predictions
Y_pred = model.predict(X_test)

# MetricResult
r2 = r2_score(Y_test, Y_pred)
print("The R2 of the model is:", r2)        

1. Linear Regression

from sklearn.linear_model import

Linear Regression is the most straightforward algorithm of all, where we have two main variants of the regression:

  • Simple Linear Regression:?an input variable
  • Multiple Linear Regression:?Many Input Variables

Regression assumes that the data are in?Normal Distribution. The variables are relevant for the construction of the model. They are?not collinear, that is, variables with high correlation — it is up to the Food Scientist the algorithm with really relevant variables.

# Linear Regression

# Import modules
from pandas import read_csv
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
from sklearn.linear_model import LinearRegression

# Loading data
file = 'https://lib.stat.cmu.edu/datasets/boston'
columns = ['CRIM', 'ZN', 'INDUS', 'CHAS', 'NOX', 'RM', 'AGE', 'DIS', 'RAD', 'TAX', 'PTRATIO','B', 'LSTAT', 'MEDV']
data = read_csv(file, delim_whitespace = True, names = columns)
array = data.values

# Separating the array into input and output components
X = array[:,0:13]
Y = array[:,13]

# Divides the data into training and testing 
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size = 0.33, random_state = 5)

# Creating model
model = LinearRegression()

# Training model
model.fit(X_train, Y_train)

# Making predictions
Y_pred = model.predict(X_test)

# Metric Result
mse = mean_squared_error(Y_test, Y_pred)
print("The MSE of the model is:", mse)        

Here we deal with multiple linear regression, and we are dealing with more than one predictor variable. Therefore, we use the LinearRegression algorithm of the?linear_model module, load the data, place the predictor variables in?X, the target variables in?Y, divide the random form into training and test data, and then create the?Linear Regression model?and train the relationships of the training data.

Finally, we make the predictions. Once the predictions are made, we put them into the?MSE?metric to calculate the error rate of the forecasts.

With this Linear Regression Algorithm, we have an?MSE error rate?of?28.53%?without pre-processing the data. Can we improve this error rate just by changing the algorithm? We could also apply normalization, standardization, variable transformation, variable selection, cross-validation — to focus on the process. We will change only the algorithm.

2. Ridge Regression

from sklearn.linear_model Ridge import


It is a Linear Regression algorithm where the?Loss Function?is modified to minimize the complexity of the model. A Loss Function is the?Cost function?or?Error function.

When we build the model, we need to automate the process. To automate the process, we need to use the?Gradient?Descent?algorithm. To use The?Gradient Descent, we have to use a?Cost Function?to optimize and reduce the?Cost Function, consequently reducing the model’s error rate.

Machine Learning is an optimization problem. We want to optimize the cost function, reduce the error rate of the model. Therefore, we have the Linear Regression algorithm, the Optimization algorithm, a Cost Function, which puts it all together and trains the model to find the best mathematical relationship between input and output data.

# Ridge Regression

# Import modules
from pandas import read_csv
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
from sklearn.linear_model import Ridge

# Loading data
file = 'https://lib.stat.cmu.edu/datasets/boston'
columns = ['CRIM', 'ZN', 'INDUS', 'CHAS', 'NOX', 'RM', 'AGE', 'DIS', 'RAD', 'TAX', 'PTRATIO','B', 'LSTAT', 'MEDV']
data = read_csv(file, delim_whitespace = True, names = columns)
array = data.values

# Separating the array into input and output components
X = array[:,0:13]
Y = array[:,13]

# Divides the data into training and testing
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size = 0.33, random_state = 5)

# Creating model
model = Ridge()

# Training model
model.fit(X_train, Y_train)

# Making predictions
Y_pred = model.predict(X_test)

# Metric Result
mse = mean_squared_error(Y_test, Y_pred)
print("The MSE of the model is:", mse)        

The?MSE?of the?Ridge Regression?model is?29.29%.?That is, we can not improve the performance of the model. On the contrary, we just worsened by changing the algorithm.

3. Lasso Regression

from sklearn.linear_model import Lasso

Lasso (Least Absolute Shrinkage and Selection Operator) Regression is a modification of linear regression, and like Ridge Regression, the Loss Function is modified to minimize model complexity.

The algorithm does change the penalty rate to have a more straightforward cost function to optimize.

# Lasso Regression

# Import modules
from pandas import read_csv
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
from sklearn.linear_model import Lasso

# Loading data
file = 'https://lib.stat.cmu.edu/datasets/boston'
columns = ['CRIM', 'ZN', 'INDUS', 'CHAS', 'NOX', 'RM', 'AGE', 'DIS', 'RAD', 'TAX', 'PTRATIO','B', 'LSTAT', 'MEDV']
data = read_csv(file, delim_whitespace = True, names = columns)
array = data.values

# Separating the array into input and output components
X = array[:,0:13]
Y = array[:,13]

# Divides the data into training and testing with the train_test_split()
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size = 0.33, random_state = 5)

# Creating model
model = Lasso()

# Training model
model.fit(X_train, Y_train)

# Making predictiosn
Y_pred = model.predict(X_test)

# Metrics Result
mse = mean_squared_error(Y_test, Y_pred)
print("The MSE of the model is:", mse)        

The MSE with?Lasso Regression?result was?33.39.?By changing the algorithm, we are increasing the error rate. We use the?MSE?to make this comparison.

4. ElasticNet Regression (Ridge ~ Lasso)

from sklearn.linear_model import ElasticNet

ElasticNet?is a form of regression regularization that combines the properties of?ridge?and?lasso?regression. The objective is to minimize the complexity of the model by penalizing the model using the sum of squares of the coefficients.

Therefore, all algorithms seen earlier are only variations — Linear Regression, a Ridge modification, and Lasso and now?ElasticNet.

# ElasticNet Regression

# Import modules
from pandas import read_csv
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
from sklearn.linear_model import ElasticNet

# Loading data
file = 'https://lib.stat.cmu.edu/datasets/boston'
columns = ['CRIM', 'ZN', 'INDUS', 'CHAS', 'NOX', 'RM', 'AGE', 'DIS', 'RAD', 'TAX', 'PTRATIO','B', 'LSTAT', 'MEDV']
data = read_csv(file, delim_whitespace = True, names = columns)
array = data.values

# Separating the array into input and output components
X = array[:,0:13]
Y = array[:,13]

# Divides the data into training and testing
train_test_split()
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size = 0.33, random_state = 5)

# Creating model
model = ElasticNet()

# Training model
model.fit(X_train, Y_train)

# Making predictions
Y_pred = model.predict(X_test)

# MetricResult
mse = mean_squared_error(Y_test, Y_pred)
print("The MSE of the model is:", mse)        

5. KNN

from sklearn.neighbors import KNeighborsRegressor

We can use KNN for both classification and regression; use the KNeighborsRegressor algorithm; it is the algorithm for regression.

# KNN

# Import modules
from pandas import read_csv
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
from sklearn.neighbors import KNeighborsRegressor

# Loading data
file = 'https://lib.stat.cmu.edu/datasets/boston'
columns = ['CRIM', 'ZN', 'INDUS', 'CHAS', 'NOX', 'RM', 'AGE', 'DIS', 'RAD', 'TAX', 'PTRATIO','B', 'LSTAT', 'MEDV']
data = read_csv(file, delim_whitespace = True, names = columns)
array = data.values

# Separating the array into input and output components
X = array[:,0:13]
Y = array[:,13]

# Divides the data into training and testing 
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size = 0.33, random_state = 5)

# Creating model
model = KNeighborsRegressor()

# Training model
model.fit(X_train, Y_train)

# Making predictions
Y_pred = model.predict(X_test)

# Metric result = mean_squared_error(Y_test, Y_pred)
print("The MSE of the model is:", mse)        

The MSE with?KNN Regressor?was?47.70.?We made the cost function much worse. KNN is a much simpler algorithm, and perhaps for this dataset, it is not optimal.

6. CART — Classification and Regression Trees

from?sklearn.tree?import DecisionTreeRegressor

We can also use it in both classification and regression categories.

# Classification and Regression Trees

# Import modules
from pandas import read_csv
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
from sklearn.tree import DecisionTreeRegressor

# Loading data
file = 'https://lib.stat.cmu.edu/datasets/boston'
columns = ['CRIM', 'ZN', 'INDUS', 'CHAS', 'NOX', 'RM', 'AGE', 'DIS', 'RAD', 'TAX', 'PTRATIO','B', 'LSTAT', 'MEDV']
data = read_csv(file, delim_whitespace = True, names = columns)
array = data.values

# Separating the array into input and output components
X = array[:,0:13]
Y = array[:,13]

# Divides the data into training and testing 
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size = 0.33, random_state = 5)

# Creating model
model = DecisionTreeRegressor()

# Training model
model.fit(X_train, Y_train)

# Making predictions Y_pred = model.predict(X_test)

# Metric Result
mse = mean_squared_error(Y_test, Y_pred)
print("The MSE of the model is:", mse)        

The MSE with CART was?30.03. Thus, we can already see an improvement in the lower cost function. Generally,?decision trees?perform excellently.

7. SVM — Support Vector Machine

from?sklearn.SVM?import SVR

We use the?SVC?for classification and the?SVR?for regression. The rest is the same thing.

# Support Vector Machine

# Import modules
from pandas import read_csv
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
from sklearn.svm import SVR 

# Loading data
file = 'https://lib.stat.cmu.edu/datasets/boston'
columns = ['CRIM', 'ZN', 'INDUS', 'CHAS', 'NOX', 'RM', 'AGE', 'DIS', 'RAD', 'TAX', 'PTRATIO','B', 'LSTAT', 'MEDV']
data = read_csv(file, delim_whitespace = True, names = columns)
array = data.values

# Separating the array into input and output components
X = array[:,0:13]
Y = array[:,13]

# Divides the data into training and testingX_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size = 0.33, random_state = 5)

# Creating model
model = SVR()# Training model
model.fit(X_train, Y_train)

# Making predictions
Y_pred = model.predict(X_test)

# Metric result
mse = mean_squared_error(Y_test, Y_pred)
print("The MSE of the model is:", mse)        

The?MSE?with?SVR?was?93.21! Worst performance ever. SVM is a much more complex algorithm, and because no processing has been done on the data, the algorithm rejects the data as-is.

We must do a little better treatment. A more complex algorithm that performs good results requires much more pre-processing.

Model Optimization — Parameter Adjustment

Optimizing a regression model follows the same rules for classification, with no significant difference.

All machine learning algorithms are parameterized, which means you can adjust predictive model performance by tuning parameters.

The goal is to find the best combination of the parameters in each machine learning algorithm. This process is also called hyperparameter optimization. scikit-learn offers two methods for automatic parameter optimization

1. Grid Search Parameter Tuning

from sklearn.model_selection import GridSearchCV

This method methodically performs combinations between all algorithm parameters, creating a grid.

Let’s try this method using the Ridge Regression algorithm to see how we can optimize this algorithm in practice.

# Grid Search Parameter Tuning

# Import modules
from pandas import read_csv
import numpy as np
from sklearn.model_selection import GridSearchCV
from sklearn.linear_model import Ridge

# Loading data
file = 'https://lib.stat.cmu.edu/datasets/boston'
columns = ['CRIM', 'ZN', 'INDUS', 'CHAS', 'NOX', 'RM', 'AGE', 'DIS', 'RAD', 'TAX', 'PTRATIO','B', 'LSTAT', 'MEDV']
data = read_csv(file, delim_whitespace = True, names = columns)
array = data.values

# Separating the array into input and output components
X = array[:,0:13]
Y = array[:,13]

# Setting the values that will be tested
alpha_values = np.array([1,0.1,0.01,0.001,0.0001,0])
grid_values = dict(alpha = alpha_values)

# Creating model
model = Ridge()
# Creating grid
grid = GridSearchCV(estimator = model, param_grid = grid_values)
grid.fit(X, Y)

# Print the result of the best parameter for the algorithm
print("Best Model Parameters:", grid.best_estimator_)        

We create a grid with the parameters we want to try and make a dictionary called grid_values. First, we started the Ridge model and then called GridSearchCV to test the combination of parameters.

The output will be the best parameters for the Ridge algorithm with GridSearchCV.

2. Random Search Parameter Tuning

from sklearn.model_selection import RandomizedSearchCV

This method generates samples of algorithm parameters from a uniform random distribution for a fixed number of interactions.

A model is built and tested for each combination of parameters.

This example shows that the alpha parameter's value very close to 1 will present the best results.

Therefore, we compared the models according to the metrics and chose the one that has the best value.

# Random Search Parameter Tuning

# Import modules
from pandas import read_csv
import numpy as np
from scipy.stats import uniform
from sklearn.linear_model import Ridge
from sklearn.model_selection import RandomizedSearchCV

# Loading data
file = 'https://lib.stat.cmu.edu/datasets/boston'
columns = ['CRIM', 'ZN', 'INDUS', 'CHAS', 'NOX', 'RM', 'AGE', 'DIS', 'RAD', 'TAX', 'PTRATIO','B', 'LSTAT', 'MEDV']
data = read_csv(file, delim_whitespace = True, names = columns)
array = data.values

# Separating the array into input and output components
X = array[:,0:13]
Y = array[:,13]

# Setting parameters
grid_values = {'alpha': uniform()}
seed = 7Creating model
model = Ridge()
iterations = 100
rsearch = RandomizedSearchCV(estimator = model, 
                             param_distributions = grid_values, 
                             n_iter = iterations, 
                             random_state = seed)
rsearch.fit(X, Y)
# Result
print("Best Model Parameters:", rsearch.best_estimator_)        

Therefore, we compared the models according to the metrics and chose the one that has the best value.

Save and load the model.

import pickle
# Saving result

# Import modules
from pandas import read_csv
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
from sklearn.linear_model import Ridge
import pickle

# Loading data
file = 'https://lib.stat.cmu.edu/datasets/boston'
columns = ['CRIM', 'ZN', 'INDUS', 'CHAS', 'NOX', 'RM', 'AGE', 'DIS', 'RAD', 'TAX', 'PTRATIO','B', 'LSTAT', 'MEDV']
data = read_csv(file, delim_whitespace = True, names = columns)
array = data.values 

# Separating the array into input and output components
X = array[:,0:13]
Y = array[:,13]

# Setting parameters
test_size = 0.35
seed = 7

# Divides the data into training and testing 
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size = 0.33, random_state = 5) = seed)

# Creating model
model = Ridge()

# Training model
model.fit(X_train, Y_train)

# Saving model
file = 'final_regression_model.sav'
pickle.dump(model, open(file, 'wb'))
print("Model saved!")

# Loading model
final_regressor_model = pickle.load(open(file, 'rb'))
print("Model saved!")

# Making Predictions 
Y_pred = final_regressor_model.predict(X_test)

# Metric Result
mse = mean_squared_error(Y_test, Y_pred)
print("The MSE of the model is:", mse)        

Here we had an overview of the machine learning process, that is, building the models; the focus was not to detail how each algorithm works. Instead, our goal was to understand the process.

The data scientist’s job is to master as much as possible everything we’ve seen from pre-processing, model selection, performance metrics, model optimization, and forecasting.

And there we have it. I hope you have found this helpful. Thank you for reading. ??

要查看或添加评论,请登录

Leonardo A.的更多文章

社区洞察

其他会员也浏览了