登录查看更多内容

Fit & predict for regression

Abu Chowdhury, PMP?, MSFE, MSCS, BSEE

Mortgage World Bankers - Predictive modeling for residential & commercial Lending in NY, NJ, CT, PA, FL

发布日期: 2018年11月5日

If your problem requires a continuous outcome? Regression, which is best suited to solving such problems. The fundamental concepts in regression and apply them to predict the life expectancy in a given country using Gapminder data. Now, we will fit a linear regression and predict life expectancy using just one feature. Use the 'fertility' feature of the Gapminder dataset.

df.info()

RangeIndex: 139 entries, 0 to 138

Data columns (total 10 columns):

population 139 non-null float64

fertility 139 non-null float64

HIV 139 non-null float64

CO2 139 non-null float64

BMI_male 139 non-null float64

GDP 139 non-null float64

BMI_female 139 non-null float64

life 139 non-null float64

child_mortality 139 non-null float64

Region 139 non-null object

dtypes: float64(9), object(1)

memory usage: 10.9+ KB

Since the goal is to predict life expectancy, the target variable here is 'life'. The array for the target variable has been pre-loaded as y and the array for 'fertility' has been pre-loaded as X_fertility.

A scatter plot with 'fertility' on the x-axis and 'life' on the y-axis has been generated. As you can see, there is a strongly negative correlation, so a linear regression should be able to capture this trend. We need to fit a linear regression and then predict the life expectancy, overlaying these predicted values on the plot to generate a regression line. We will also compute and print the R^2 score using sckit-learn's .score() method.

Import LinearRegression from sklearn.linear_model.
Create a LinearRegression regressor called reg.
Set up the prediction space to range from the minimum to the maximum of X_fertility. This has been done for you.
Fit the regressor to the data (X_fertility and y) and compute its predictions using the .predict() method and the prediction_space array.
Compute and print the R^2 score using the .score() method.
Overlay the plot with your linear regression line.

# Import LinearRegression

from sklearn.linear_model import LinearRegression

# Create the regressor: reg

reg = LinearRegression()

# Create the prediction space

prediction_space = np.linspace(min(X_fertility), max(X_fertility)).reshape(-1,1)

# Fit the model to the data

reg.fit(X_fertility, y)

# Compute predictions over the prediction space: y_pred

y_pred = reg.predict(prediction_space)

# Print R^2

print(reg.score(X_fertility, y))

import matplotlib.pyplot as plt

plt.scatter(X_fertility, y, color='blue', alpha=0.5)

plt.legend()

plt.ylabel('Life Expectancy')

plt.xlabel('Fertility')

plt.title('Fit & predict for regression .@achowdhu')

# Plot regression line

plt.plot(prediction_space, y_pred, color='black', linewidth=3)

plt.show()

Output: 0.6192442167740035

Notice how the line captures the underlying trend in the data. And the performance is quite decent for this basic regression model with only one feature!

What Is Goodness-of-Fit for a Linear Model?

Definition: Residual = Observed value - Fitted value

Linear regression calculates an equation that minimizes the distance between the fitted line and all of the data points. Technically, ordinary least squares (OLS) regression minimizes the sum of the squared residuals.

In general, a model fits the data well if the differences between the observed values and the model's predicted values are small and unbiased.

Before you look at the statistical measures for goodness-of-fit, you should check the residual plots. Residual plots can reveal unwanted residual patterns that indicate biased results more effectively than numbers. When your residual plots pass muster, you can trust your numerical results and check the goodness-of-fit statistics.

What Is R-squared?

R-squared is a statistical measure of how close the data are to the fitted regression line. It is also known as the coefficient of determination, or the coefficient of multiple determination for multiple regression.

The definition of R-squared is fairly straight-forward; it is the percentage of the response variable variation that is explained by a linear model. Or:

R-squared = Explained variation / Total variation

R-squared is always between 0 and 100%:

0% indicates that the model explains none of the variability of the response data around its mean.
100% indicates that the model explains all the variability of the response data around its mean.

In general, the higher the R-squared, the better the model fits your data. However, there are important conditions for this guideline that I’ll talk about both in this post and my next post.

RMSE

The RMSE is the square root of the variance of the residuals. It indicates the absolute fit of the model to the data–how close the observed data points are to the model’s predicted values. Whereas R-squared is a relative measure of fit, RMSE is an absolute measure of fit. As the square root of a variance, RMSE can be interpreted as the standard deviation of the unexplained variance, and has the useful property of being in the same units as the response variable. Lower values of RMSE indicate better fit. RMSE is a good measure of how accurately the model predicts the response, and it is the most important criterion for fit if the main purpose of the model is prediction.

The best measure of model fit depends on the researcher’s objectives, and more than one are often useful. The statistics discussed above are applicable to regression models that use OLS estimation. Many types of regression models, however, such as mixed models, generalized linear models, and event history models, use maximum likelihood estimation. These statistics are not available for such models.

Train/test split for regression

The train and test sets are vital to ensure that your supervised learning model is able to generalize well to new data. This was true for classification models, and is equally true for linear regression models.

We will split the Gapminder dataset into training and testing sets, and then fit and predict a linear regression over all features. In addition to computing the R^2 score, we will also compute the Root Mean Squared Error (RMSE), which is another commonly used metric to evaluate regression models. The feature array X and target variable array y have been pre-loaded for you from the DataFrame df.

Import LinearRegression from sklearn.linear_model, mean_squared_error from sklearn.metrics, and train_test_split from sklearn.model_selection.
Using X and y, create training and test sets such that 30% is used for testing and 70% for training. Use a random state of 42.
Create a linear regression regressor called reg_all, fit it to the training set, and evaluate it on the test set.
Compute and print the R^2 score using the .score() method on the test set.
Compute and print the RMSE. To do this, first compute the Mean Squared Error using the mean_squared_error() function with the arguments y_test and y_pred, and then take its square root using np.sqrt().

# Import necessary modules

from sklearn.linear_model import LinearRegression

from sklearn.metrics import mean_squared_error

from sklearn.model_selection import train_test_split

# Create training and test sets

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state=42)

# Create the regressor: reg_all

reg_all = LinearRegression()

# Fit the regressor to the training data

reg_all.fit(X_train, y_train)

# Predict on the test data: y_pred

y_pred = reg_all.predict(X_test)

# Compute and print R^2 and RMSE

print("R^2: {}".format(reg_all.score(X_test, y_test)))

rmse = np.sqrt(mean_squared_error(y_test, y_pred))

print("Root Mean Squared Error: {}".format(rmse))

<script.py> output: R^2: 0.838046873142936 Root Mean Squared Error : 3.2476010800377213

Using all features has improved the model score. This makes sense, as the model has more information to learn from. However, there is one potential pitfall to this process. Can you spot it? You'll learn about this as well how to better validate your models!

Cross-validation motivation

● Model performance is dependent on way the data is split

● Not representative of the model’s ability to generalize

● Solution: Cross-validation!

Cross-validation and model performance

● 5 folds = 5-fold CV ● 10 folds = 10-fold CV ● k folds = k-fold CV ● More folds = More computationally expensive

5-fold cross-validation

Cross-validation is a vital step in evaluating a model. It maximizes the amount of data that is used to train the model, as during the course of training, the model is not only trained, but also tested on all of the available data. Here, we will practice 5-fold cross validation on the Gapminder data. By default, scikit-learn's cross_val_score() function uses R^2 as the metric of choice for regression. Since we are performing 5-fold cross-validation, the function will return 5 scores. We will compute these 5 scores and then take their average.

The DataFrame has been loaded as df and split into the feature/target variable arrays X and y. The modules pandas and numpy have been imported as pd and np, respectively.

Import LinearRegression from sklearn.linear_model and cross_val_score from sklearn.model_selection.
Create a linear regression regressor called reg.
Use the cross_val_score() function to perform 5-fold cross-validation on X and y.
Compute and print the average cross-validation score. We can use NumPy's mean() function to compute the average.

# Import the necessary modules

from sklearn.linear_model import LinearRegression

from sklearn.model_selection import cross_val_score

# Create a linear regression object: reg

reg = LinearRegression()

# Compute 5-fold cross-validation scores: cv_scores

cv_scores = cross_val_score(reg, X, y, cv=5)

# Print the 5-fold cross-validation scores

print(cv_scores)

# Print the average 5-fold cross-validation score

print("Average 5-Fold CV Score: {}".format(np.mean(cv_scores)))

Output: [ 0.81720569 0.82917058 0.90214134 0.80633989 0.94495637]

Average 5-Fold CV Score: 0.8599627722793232

K-Fold CV comparison

Cross validation is essential but do not forget that the more folds you use, the more computationally expensive cross-validation becomes. Here, we will perform 3-fold cross-validation and then 10-fold cross-validation on the Gapminder dataset.

In the IPython Shell, we can use %timeit to see how long each 3-fold CV takes compared to 10-fold CV by executing the following cv=3 and cv=10:

%timeit cross_val_score(reg, X, y, cv = ____)

pandas and numpy are available in the workspace as pd and np. The DataFrame has been loaded as df and the feature/target variable arrays X and y have been created.

Import LinearRegression from sklearn.linear_model and cross_val_score from sklearn.model_selection.
Create a linear regression regressor called reg.
Perform 3-fold CV and then 10-fold CV. Compare the resulting mean scores.

# Import necessary modules

from sklearn.linear_model import LinearRegression

from sklearn.model_selection import cross_val_score

# Create a linear regression object: reg

reg = LinearRegression()

# Perform 3-fold CV

cvscores_3 = cross_val_score(reg, X, y, cv=3)

print(np.mean(cvscores_3))

# Perform 10-fold CV

cvscores_10 = cross_val_score(reg, X, y, cv=10)

print(np.mean(cvscores_10))

Output:

0.871871278262

0.843612862013

cvscores_3 = %timeit cross_val_score(reg, X, y, cv=3) 100 loops, best of 3: 8.89 ms per loop

cvscores_10 = %timeit cross_val_score(reg, X, y, cv=10) 10 loops, best of 3: 34.4 ms per loop

Use %timeit in the IPython Shell to see how much longer it takes 10-fold cross-validation to run compared to 3-fold cross-validation?

Why regularize?

● Recall: Linear regression minimizes a loss function

● It chooses a coefficient for each feature variable

● Large coefficients can lead to over-fitting

● Penalizing large coefficients: Regularization

Lasso regression for feature selection

● Can be used to select important features of a dataset

● Shrinks the coefficients of less important features to exactly 0

Regularization I: Lasso

Lasso selected out the 'RM' feature as being the most important for predicting Boston house prices, while shrinking the coefficients of certain other features to 0. Its ability to perform feature selection in this way becomes even more useful when you are dealing with data involving thousands of features.

We will fit a lasso regression to the Gapminder data you have been working with and plot the coefficients. Just as with the Boston data, you will find that the coefficients of some features are shrunk to 0, with only the most important ones remaining.

The feature and target variable arrays have been pre-loaded as X and y.

Import Lasso from sklearn.linear_model.
Instantiate a Lasso regressor with an alpha of 0.4 and specify normalize=True.
Fit the regressor to the data and compute the coefficients using the coef_ attribute.
Plot the coefficients on the y-axis and column names on the x-axis.

# Import Lasso

from sklearn.linear_model import Lasso

# Instantiate a lasso regressor: lasso

lasso = Lasso(alpha=0.4, normalize=True)

# Fit the regressor to the data

lasso.fit(X,y,)

# Compute and print the coefficients

lasso_coef = lasso.coef_

print(lasso_coef)

# Plot the coefficients

plt.plot(range(len(df_columns)), lasso_coef)

plt.xticks(range(len(df_columns)), df_columns.values, rotation=60)

plt.margins(0.02)

plt.show()

Output: [-0. -0. -0. 0. 0. 0. -0. -0.07087587]

According to the lasso algorithm, it seems like 'child_mortality' is the most important feature when predicting life expectancy.

Regularization II: Ridge

Lasso is great for feature selection, but when building regression models, Ridge regression should be your first choice. Recall that lasso performs regularization by adding to the loss function a penalty term of the absolute value of each coefficient multiplied by some alpha. This is also known as L1.

L1 regularization because the regularization term is the L1 norm of the coefficients. This is not the only way to regularize, however. If instead you took the sum of he squared values of the coefficients multiplied by some alpha - like in Ridge regression - you would be computing the L2 norm. In this exercise, you will practice fitting ridge regression models over a range of different alphas, and plot cross-validated R^2 scores for each, using this function that we have defined for you, which plots the R^2 score as well as standard error for each alpha:

Def displa_plot(cv_scores, cv_scores_std):
    fig = plt.figure()
    ax = fig.add_subplot(1,1,1)
    ax.plot(alpha_space, cv_scores)

    std_error = cv_scores_std / np.sqrt(10)

    ax.fill_between(alpha_space, cv_scores + std_error, cv_scores - std_error, alpha=0.2)
    ax.set_ylabel('CV Score +/- Std Error')
    ax.set_xlabel('Alpha')
    ax.axhline(np.max(cv_scores), linestyle='--', color='.5')
    ax.set_xlim([alpha_space[0], alpha_space[-1]])
    ax.set_xscale('log')
    plt.show()

Don't worry about the specifics of the above function works. The motivation behind this exercise is for you to see how the R^2 score varies with different alphas, and to understand the importance of selecting the right value for alpha.

Instantiate a Ridge regressor and specify normalize=True.
Inside the for loop:
Specify the alpha value for the regressor to use.
Perform 10-fold cross-validation on the regressor with the specified alpha. The data is available in the arrays X and y.
Append the average and the standard deviation of the computed cross-validated scores. NumPy has been pre-imported for you as np.
Use the display_plot() function to visualize the scores and standard deviations.

# Import necessary modules

from sklearn.linear_model import Ridge

from sklearn.model_selection import cross_val_score

# Setup the array of alphas and lists to store scores

alpha_space = np.logspace(-4, 0, 50)

ridge_scores = []

ridge_scores_std = []

# Create a ridge regressor: ridge

ridge = Ridge(normalize=True)

# Compute scores over range of alphas

for alpha in alpha_space:

# Specify the alpha value to use: ridge.alpha

ridge.alpha = alpha

# Perform 10-fold CV: ridge_cv_scores

ridge_cv_scores = cross_val_score(ridge, X, y, cv=10)

# Append the mean of ridge_cv_scores to ridge_scores

ridge_scores.append(np.mean(ridge_cv_scores))

# Append the std of ridge_cv_scores to ridge_scores_std

ridge_scores_std.append(np.std(ridge_cv_scores))

# Display the plot

display_plot(ridge_scores, ridge_scores_std)

Notice how the cross-validation scores change with different alphas. Which alpha should you pick? How can you fine-tune your model?

要查看或添加评论，请登录

Abu Chowdhury, PMP?, MSFE, MSCS, BSEE的更多文章

PCA - Principal Component Analysis

2018年12月15日

PCA - Principal Component Analysis

Dimension reduction ● More efficient storage and computation ● Remove less-informative "noise" features ● ..
Predicting breast cancer with Machine Learning (K-NN, SVM, RFC)

2018年12月9日

Predicting breast cancer with Machine Learning (K-NN, SVM, RFC)

This breast cancer databases was obtained from the University of Wisconsin Hospitals, Madison from Dr. William H.

1 条评论
Unsupervised learning using Scikit Learn

2018年12月7日

Unsupervised learning using Scikit Learn

Supervised vs unsupervised learning ● Supervised learning finds ptterns for a prediction task ● E.g.
Machine Learning School Budgets

2018年11月22日

Machine Learning School Budgets

Loading the data Now it's time to check out the dataset! You'll use pandas (which has been pre-imported as pd) to load…
Pre-processing data in Python for Machine Learning

2018年11月8日

Pre-processing data in Python for Machine Learning

Exploring categorical features The Gapminder dataset that contained a categorical 'Region' feature, which we dropped in…
Building a logistic regression model and the ROC curve; Hyperparameter tuning with GridSearchCV

2018年11月6日

Building a logistic regression model and the ROC curve; Hyperparameter tuning with GridSearchCV

Logistic regression for binary classification ● Logistic regression outputs probabilities ● If the probability ‘p’ is…
How good is your model?

2018年11月6日

How good is your model?

Metrics for classification The performance of k-NN classifier based on its accuracy. However, accuracy is not always an…

1 条评论
#MachineLearning Train/Test Split + Fit/Predict/Accuracy

2018年11月4日

#MachineLearning Train/Test Split + Fit/Predict/Accuracy

We will introduce the classification problems and learn how to solve them using supervised learning techniques…
EDA - Hospital Readmissions Data Analysis and Recommendations for Reduction?

2018年10月18日

EDA - Hospital Readmissions Data Analysis and Recommendations for Reduction?

Background? In October 2012, the US government's Center for Medicare and Medicaid Services (CMS) began reducing…
Examine Racial Discrimination in the US Job Market Using Exploratory Data Analysis

2018年10月16日

Examine Racial Discrimination in the US Job Market Using Exploratory Data Analysis

Background When racial disparities in life outcomes occur, explicit or subtle prejudice leading to discriminatory…

See all articles

Output: 0.6192442167740035

Notice how the line captures the underlying trend in the data. And the performance is quite decent for this basic regression model with only one feature!

What Is Goodness-of-Fit for a Linear Model?

What Is R-squared?

Train/test split for regression

<script.py> output: R^2: 0.838046873142936 Root Mean Squared Error : 3.2476010800377213

Using all features has improved the model score. This makes sense, as the model has more information to learn from. However, there is one potential pitfall to this process. Can you spot it? You'll learn about this as well how to better validate your models!

Cross-validation motivation

Cross-validation and model performance

5-fold cross-validation

Output: [ 0.81720569 0.82917058 0.90214134 0.80633989 0.94495637]

Average 5-Fold CV Score: 0.8599627722793232

K-Fold CV comparison

Output:

0.871871278262

0.843612862013

cvscores_3 = %timeit cross_val_score(reg, X, y, cv=3) 100 loops, best of 3: 8.89 ms per loop

cvscores_10 = %timeit cross_val_score(reg, X, y, cv=10) 10 loops, best of 3: 34.4 ms per loop

Use %timeit in the IPython Shell to see how much longer it takes 10-fold cross-validation to run compared to 3-fold cross-validation?

Why regularize?

Lasso regression for feature selection

Regularization I: Lasso

According to the lasso algorithm, it seems like 'child_mortality' is the most important feature when predicting life expectancy.

Regularization II: Ridge

Don't worry about the specifics of the above function works. The motivation behind this exercise is for you to see how the R^2 score varies with different alphas, and to understand the importance of selecting the right value for alpha.

Notice how the cross-validation scores change with different alphas. Which alpha should you pick? How can you fine-tune your model?

Abu Chowdhury, PMP?, MSFE, MSCS, BSEE的更多文章

PCA - Principal Component Analysis

Predicting breast cancer with Machine Learning (K-NN, SVM, RFC)

Unsupervised learning using Scikit Learn

Machine Learning School Budgets

Pre-processing data in Python for Machine Learning

Building a logistic regression model and the ROC curve; Hyperparameter tuning with GridSearchCV

How good is your model?

#MachineLearning Train/Test Split + Fit/Predict/Accuracy

EDA - Hospital Readmissions Data Analysis and Recommendations for Reduction?

Examine Racial Discrimination in the US Job Market Using Exploratory Data Analysis

社区洞察

其他会员也浏览了

Copulas explained

Linear Regression

Evaluation of logistic regression model ( Must read for all )

R-squared in Regression Analysis

The Distribution of Independent Variables in Regression Models

Regularization in Regression: A Simple Guide to Lasso and Ridge

Ridge Regression and Lasso Regression.

Proportions as Dependent Variable in Regression–Which Type of Model?

Statistical life - Regression to Mean

Linear Regression (Less Linear Than You Might Think)