Building a logistic regression model and the ROC curve; Hyperparameter tuning with GridSearchCV
Abu Chowdhury, PMP?, MSFE, MSCS, BSEE
Mortgage World Bankers - Predictive modeling for residential & commercial Lending in NY, NJ, CT, PA, FL
Logistic regression for binary classification
● Logistic regression outputs probabilities
● If the probability ‘p’ is greater than 0.5:
● The data is labeled ‘1’
● If the probability ‘p’ is less than 0.5:
● The data is labeled ‘0’
Probability thresholds
● By default, logistic regression threshold = 0.5
● Not specific to logistic regression
● k-NN classifiers also have thresholds
● What happens if we vary the threshold?
Time to build your first logistic regression model! The scikit-learn makes it very easy to try different models, since the Train-Test-Split/Instantiate/Fit/Predict paradigm applies to all classifiers and regressors - which are known in scikit-learn as 'estimators'. We will see this now as we train a logistic regression model on exactly the same data. Will it outperform k-NN? There's only one way to find out!
The feature and target variable arrays X and y have been pre-loaded, and train_test_split has been imported for you from sklearn.model_selection.
Import:
LogisticRegression from sklearn.linear_model.
confusion_matrix and classification_report from sklearn.metrics.
Create training and test sets with 40% (or 0.4) of the data used for testing. Use a random state of 42. This has been done for you.
Instantiate a LogisticRegression classifier called logreg.
Fit the classifier to the training data and predict the labels of the test set.
Compute and print the confusion matrix and classification report. This has been done for you, so hit 'Submit Answer' to see how logistic regression compares to k-NN! Here is the confusion_matrix and classification report for k-NN. For details https://www.dhirubhai.net/pulse/how-good-your-model-abu-chowdhury-pmp-msfe-mscs-bsee/
Here is the program and output confusion_matrix and classification report for Logistic Regression :
We now know how to use logistic regression for binary classification - great work! There is 7% improvement. Logistic regression is used in a variety of machine learning applications and will become a vital part of your data science toolbox.
- Have a look at the definitions of precision and recall. Are true negatives taken into consideration here? Also consider what would happen in extreme cases. That is, what does a recall of 1 or 0 correspond to? What about precision?
True negatives do not appear at all in the definitions of precision and recall.
When the threshold is very close to 1, precision is also 1, because the classifier is absolutely certain about its predictions.
A recall of 1 corresponds to a classifier with a low threshold in which all females who contract diabetes were correctly classified as such, at the expense of many misclassifications of those who did not have diabetes.
Precision is undefined for a classifier which makes no positive predictions, that is, classifies everyoneas not having diabetes.
Notice how a high precision corresponds to a low recall: The classifier has a high threshold to ensure the positive predictions it makes are correct, which means it may miss some positive labels that have lower probabilities.
Plotting an ROC curve
The popular Hosmer-Lemeshow test for logistic regression, which can be viewed as assessing whether the model is well calibrated. In this post we'll look at one approach to assessing the discrimination of a fitted logistic model, via the receiver operating characteristic (ROC) curve.
The receiver operating characteristic (ROC) curve
Now we come to the ROC curve, which is simply a plot of the values of sensitivity against one minus specificity, as the value of the cut-point c; c is increased from 0 through to 1.
A model with high discrimination ability will have high sensitivity and specificity simultaneously, leading to an ROC curve which goes close to the top left corner of the plot. A model with no discrimination ability will have an ROC curve which is the 45 degree diagonal line.
A logistic regression doesn't "agree" with anything because the nature of the outcome is 0/1 and the nature of the prediction is a continuous probability. Agreement requires comparable scales: 0.999 does not equal 1. One way of developing a classifier from a probability is by dichotomizing at a threshold. The obvious limitation with that approach: the threshold is arbitrary and can be artificially chosen to produce very high or very low sensitivity (or specificity). Thus, the ROC considers all possible thresholds.
A discriminating model is capable of ranking people in terms of their risk. The predicted risk from the model could be way off, but if you want to design a substudy or clinical trial to recruit "high risk" participants, such a model gives you a way forward. Preventative tamoxifen is recommended for women in the highest risk category of breast cancer as the result of such a study.
Discrimination != Calibration. If my model assigns all non-events a probability of 0.45 and all events a probability of 0.46, the discrimination is perfect, even if the incidence/prevalence is <0.001.
We now have a new addition to your toolbox of classifiers!
Classification reports and confusion matrices are great methods to quantitatively evaluate model performance, while ROC curves provide a way to visually evaluate models. Most classifiers in scikit-learn have a .predict_proba() method which returns the probability of a given sample being in a particular class. Having built a logistic regression model, we will now evaluate its performance by plotting an ROC curve. In doing so, we will make use of the .predict_proba() method and become familiar with its functionality.
Here, you'll continue working with the PIMA Indians diabetes dataset. The classifier has already been fit to the training data and is available as logreg.
- Import roc_curve from sklearn.metrics.
- Using the logreg classifier, which has been fit to the training data, compute the predicted probabilities of the labels of the test set X_test. Save the result as y_pred_prob.
- Use the roc_curve() function with y_test and y_pred_prob and unpack the result into the variables fpr, tpr, and thresholds.
- Plot the ROC curve with fpr on the x-axis and tpr on the y-axis.
This ROC curve provides a nice visual way to assess your classifier's performance.
Interpretation of the area under the ROC curve
Although it is not obvious from its definition, the area under the ROC curve (AUC) has a somewhat appealing interpretation. It turns out that the AUC is the probability that if you were to take a random pair of observations, one with P=1
P=1 and one with P=0
P=0, the observation with P=1
P=1 has a higher predicted probability than the other. The AUC thus gives the probability that the model correctly ranks such pairs of observations.
In the biomedical context of risk prediction modelling, the AUC has been criticized by some. In the risk prediction context, individuals have their risk of developing (for example) coronary heart disease over the next 10 years predicted. Thus a measure of discrimination which examines the predicted probability of pairs of individuals, one with P=1
P=1 and one with P=0
P=0, does not really match the prospective risk prediction setting, where we do not have such pairs.
AUC computation
Say you have a binary classifier that in fact is just randomly making guesses. It would be correct approximately 50% of the time, and the resulting ROC curve would be a diagonal line in which the True Positive Rate and False Positive Rate are always equal. The Area under this ROC curve would be 0.5. This is one way in which the AUC, which Hugo discussed in the video, is an informative metric to evaluate a model. If the AUC is greater than 0.5, the model is better than random guessing. Always a good sign!
In this exercise, you'll calculate AUC scores using the roc_auc_score() function from sklearn.metrics as well as by performing cross-validation on the diabetes dataset.
X and y, along with training and test sets X_train, X_test, y_train, y_test, have been pre-loaded for you, and a logistic regression classifier logreg has been fit to the training data.
- Import roc_auc_score from sklearn.metrics and cross_val_score from sklearn.model_selection.
- Using the logreg classifier, which has been fit to the training data, compute the predicted probabilities of the labels of the test set X_test. Save the result as y_pred_prob.
- Compute the AUC score using the roc_auc_score() function, the test set labels y_test, and the predicted probabilities y_pred_prob.
- Compute the AUC scores by performing 5-fold cross-validation. Use the cross_val_score() function and specify the scoring parameter to be 'roc_auc'.
- Use the command from y import x to import x from y.
- Use the .predict_proba() method on logreg to compute the predicted probabilities. Be sure to access the 2nd column of the resulting array.
- Pass in y_test and y_pred_prob as arguments to the roc_auc_score()function to calculate the AUC score.
- You have to specify the additional keyword argument scoring='roc_auc' inside cross_val_score() to compute the AUC scores by performing cross-validation. Be sure to also specify cv=5 and pass in the feature and target variable arrays X and y in the correct order.
AUC: 0.8255758614125261
AUC scores computed using 5-fold cross-validation: [0.80185185 0.80666667 0.81481481 0.86245283 0.8554717 ]
Now have a number of different methods you can use to evaluate your model's performance.
For more on risk prediction, and other approaches to assessing the discrimination of logistic (and other) regression models, looking at Steyerberg's Clinical Prediction Models book, an (open access) article published in Epidemiology, and Harrell's Regression Modeling Strategies' book.
Hyperparameter tuning with GridSearchCV
Hyperparameter tuning
● Linear regression: Choosing parameters
● Ridge/lasso regression: Choosing alpha
● k-Nearest Neighbors: Choosing n_neighbors
● Parameters like alpha and k: Hyperparameters
● Hyperparameters cannot be learned by fi!ing the model
Choosing the correct hyperparameter
● Try a bunch of different hyperparameter values
● Fit all of them separately
● See how well each performs
● Choose the best performing one
● It is essential to use cross-validation
How to tune the n_neighbors parameter of the KNeighborsClassifier() using GridSearchCV on the voting dataset. Using logistic regression on the diabetes dataset instead!
Like the alpha parameter of lasso and ridge regularization that you saw earlier, logistic regression also has a regularization parameter: C. C controls the inverse of the regularization strength, and this is what you will tune in this exercise. A large C can lead to an overfit model, while a small C can lead to an underfit model.
The hyperparameter space for C has been setup for you. Your job is to use GridSearchCV and logistic regression to find the optimal C in this hyperparameter space. The feature array is available as X and target variable array is available as y.
You may be wondering why you aren't asked to split the data into training and test sets. Good observation! Here, we want you to focus on the process of setting up the hyperparameter grid and performing grid-search cross-validation. We will indeed want to hold out a portion of your data for evaluation purposes.
- Import LogisticRegression from sklearn.linear_model and GridSearchCV from sklearn.model_selection.
- Setup the hyperparameter grid by using c_space as the grid of values to tune C over.
- Instantiate a logistic regression classifier called logreg.
- Use GridSearchCV with 5-fold cross-validation to tune C:
- Inside GridSearchCV(), specify the classifier, parameter grid, and number of folds to use.
- Use the .fit() method on the GridSearchCV object to fit it to the data X and y.
- Print the best parameter and best score obtained from GridSearchCV by accessing the best_params_ and best_score_ attributes of logreg_cv.
- Import LogisticRegression from sklearn.linear_model and GridSearchCV from sklearn.model_selection.
- Setup hyperparameter grid by using c_space as the grid of values to tune C over.
- Instantiate a logistic regression classifier called logreg.
- Use GridSearchCV with 5-fold cross-validation to tune C:
- Inside GridSearchCV(), specify the classifier, parameter grid, and number of folds to use.
- Use the .fit() method on the GridSearchCV object to fit it to the data X and y.
- Print the best parameter and best score obtained from GridSearchCV by accessing the best_params_ and best_score_ attributes of logreg_cv.
Tuned Logistic Regression Parameters: {'C': 3.727593720314938}
Best score is 0.7708333333333334
It looks like a 'C' of 3.727 results in the best performance.
Hyperparameter tuning with RandomizedSearchCV
GridSearchCV can be computationally expensive, especially if you are searching over a large hyperparameter space and dealing with multiple hyperparameters. A solution to this is to use RandomizedSearchCV, in which not all hyperparameter values are tried out. Instead, a fixed number of hyperparameter settings is sampled from specified probability distributions. You'll practice using RandomizedSearchCV in this exercise and see how this works.
Here, you'll also be introduced to a new model: the Decision Tree. Don't worry about the specifics of how this model works. Just like k-NN, linear regression, and logistic regression, decision trees in scikit-learn have .fit() and .predict() methods that you can use in exactly the same way as before. Decision trees have many parameters that can be tuned, such as max_features, max_depth, and min_samples_leaf: This makes it an ideal use case for RandomizedSearchCV.
As before, the feature array X and target variable array y of the diabetes dataset have been pre-loaded. The hyperparameter settings have been specified for you. Your goal is to use RandomizedSearchCV to find the optimal hyperparameters. Go for it!
- Import DecisionTreeClassifier from sklearn.tree and RandomizedSearchCV from sklearn.model_selection.
- Specify the parameters and distributions to sample from. This has been done for you.
- Instantiate a DecisionTreeClassifier.
- Use RandomizedSearchCV with 5-fold cross-validation to tune the hyperparameters:
- Inside RandomizedSearchCV(), specify the classifier, parameter distribution, and number of folds to use.
- Use the .fit() method on the RandomizedSearchCV object to fit it to the data X and y.
- Print the best parameter and best score obtained from RandomizedSearchCV by accessing the best_params_ and best_score_ attributes of tree_cv.
Tuned Decision Tree Parameters: {'criterion': 'entropy', 'max_depth': 3, 'max_features': 5, 'min_samples_leaf': 4}
Best score is 0.73046875
You'll see a lot more of decision trees and RandomizedSearchCV as you continue your machine learning journey. Note that RandomizedSearchCV will never outperform GridSearchCV. Instead, it is valuable because it saves on computation time.
Hold-out set in practice I: Classification
Hold-out set reasoning
● How well can the model perform on never before seen data?
● Using ALL data for cross-validation is not ideal
● Split data into training and hold-out set at the beginning
● Perform grid search cross-validation on training set
● Choose best hyperparameters and evaluate on hold-out set
You will now practice evaluating a model with tuned hyperparameters on a hold-out set. The feature array and target variable array from the diabetes dataset have been pre-loaded as X and y.
In addition to C, logistic regression has a 'penalty' hyperparameter which specifies whether to use 'l1' or 'l2' regularization. You can create a hold-out set, tune the 'C' and 'penalty' hyperparameters of a logistic regression classifier using GridSearchCV on the training set, and then evaluate its performance against the hold-out set.
- Create the hyperparameter grid:
- Use the array c_space as the grid of values for 'C'.
- For 'penalty', specify a list consisting of 'l1' and 'l2'.
- Instantiate a logistic regression classifier.
- Create training and test sets. Use a test_size of 0.4 and random_state of 42. In practice, the test set here will function as the hold-out set.
- Tune the hyperparameters on the training set using GridSearchCV with 5-folds. This involves first instantiating the GridSearchCV object with the correct parameters and then fitting it to the training data.
- Print the best parameter and best score obtained from GridSearchCV by accessing the best_params_ and best_score_ attributes of logreg_cv.
Tuned Logistic Regression Parameter: {'C': 0.4393970560760795, 'penalty': 'l1'}
Tuned Logistic Regression Accuracy: 0.7652173913043478
The idea is to tune the model's hyperparameters on the training set, and then evaluate its performance on the hold-out set which it has never seen before.
Tuned ElasticNet l1 ratio: {'l1_ratio': 0.034482758620689655}
Tuned ElasticNet R squared: 0.25110015989224843
Tuned ElasticNet MSE: 0.16587834626775252
Now that we understand how to fine-tune your models, it's time to learn about preprocessing techniques and how to piece together all the different stages of the machine learning process into a pipeline!