How to choose hyperparameters for a model with GridSearchCV.

It has been always a challenge to find hyperparameters which are just right for your model. Using different method, for instance, regression or decision trees requires to choose a degree of a polynomial for the former method and maximum depth, minimum number of samples to split and minimum number of samples per leaf for the later.

Hyperparameters define the complexity of the model. If the model is too simple it tends to generalize our data too much and we call it underfitting. On the other hand, complex models try to memorize the answers which would perfectly work on the training data and will lose the accuracy on the testing data and it is called overfitting.

You can rely on your intuition and it would probably be OK when you are playing with one hyperparameter. However, if you have two or more parameters to adjust, the possible combination of them dramatically increases making it hard to choose just the right combination for your model.

To determine the optimal hyperparameters of your model the Grid Search method can be used. It is a straightforward method where you go through all the possible combination for each hyperparameter wtih other parameter combinations. Of course, to go over all possible combinations might take forever, but with SciKit Learn you will make your life a lot easier.

For our example we will use the Decision Tree classifier trained and then test it on the well-known Titanic Survival dataset. Let's start with loading the dataset on Jupyter Notebook and displaying some of its rows.

import numpy as np
import pandas as pd
from IPython.display import display 
%matplotlib inline

# Set a random seed
import random
random.seed(42)

# Load the dataset
full_data = pd.read_csv('titanic_data.csv')

# Print the first few entries of the RMS Titanic data
full_data.head()

The output is

The features which are found in this table are:

  • Survived: Outcome of survival (0 = No; 1 = Yes)
  • Pclass: Socio-economic class (1 = Upper class; 2 = Middle class; 3 = Lower class)
  • Name: Name of passenger
  • Sex: Sex of the passenger
  • Age: Age of the passenger (Some entries contain NaN)
  • SibSp: Number of siblings and spouses of the passenger aboard
  • Parch: Number of parents and children of the passenger aboard
  • Ticket: Ticket number of the passenger
  • Fare: Fare paid by the passenger
  • Cabin Cabin number of the passenger (Some entries contain NaN)
  • Embarked: Port of embarkation of the passenger (C = Cherbourg; Q = Queenstown; S = Southampton

Because we want to predict if a passenger or a crew member either survived or not, we will move Survived variable to a separate data series outcomes and remove it from the initial data frame. We also assume that the Name variable does not carry any information which we can use for prediction and it will be also dropped.

# Store the 'Survived' feature in a new variable

outcomes = full_data['Survived']
# Remove Survived' and 'Name' from the dataset
features_raw = full_data.drop(['Survived', 'Name'], axis = 1)
# Show the new dataset with 'Survived' removed
display(features_raw.head())

For the preprocessing of our data, we will one-hot encode the categorical features using .get_dummies method. We will also substitute NaN with 0.0:

features = pd.get_dummies(features_raw)
display(features.head()) 
features = features.fillna(0.0)
  

Now we will split our data into training and testing sets.

from sklearn.model_selection import train_test_split 
X_train, X_test, y_train, y_test = train_test_split(features, 
                                   outcomes, test_size=0.2, random_state=42)

We will train our model on the training set with the default hyperparameters: max_depth=None, min_samples_split=2, min_samples_leaf=1, then with the arbitrary chosen parameters and finally with parameters obtained from the GridSearchCV.

So, let's start with training our model with the default parameters in DecisionTreeClassifier:

# Import the Decision Tree Classifier

from sklearn.tree import DecisionTreeClassifier

# Define the classifier
model = DecisionTreeClassifier()
# fit the classifier into the model
model.fit(X_train, y_train)

On the next step we will test our model:

# Making predictions
y_train_pred = model.predict(X_train)
y_test_pred = model.predict(X_test)

# Calculate the accuracy
from sklearn.metrics import accuracy_score, f1_score, make_scorer
from sklearn.metrics import 
train_accuracy = accuracy_score(y_train, y_train_pred)
test_accuracy = accuracy_score(y_test, y_test_pred)
print('The training accuracy is', train_accuracy)
print('The test accuracy is', test_accuracy)
# Calculate the f1_score of the new model.
print('The training F1 Score is', f1_score(y_train_pred, y_train))
print('The testing F1 Score is', f1_score(y_test_pred, y_test))


#Results
#The training accuracy is 1.0
#The test accuracy is 0.815642458101
#The training F1 Score is 1.0
#The testing F1 Score is 0.765957446809


We can see that the training accuracy is 1 and the test accuracy is 0. It indicates that the model is overfitted meaning that the model performs too well on the training data with 100% accuracy, but the testing accuracy is only 81.5%. Another metrics which was calculated is F1 score, which is bounded between 0 and 1. This parameter is also shows the performance of the model and the higher f1_score the better the model. So, later we will compare this f1_score with f1_scores from the models below.

The following arbitrary chosen parameter will be used for the second model: max_depth=2, min_samples_split=6, min_samples_leaf=3

# Define the model with the arbitrary chosen parameters 
model_arb = DecisionTreeClassifier(max_depth=2, min_samples_leaf=3, 
                                   min_samples_split=6) 
model_arb.fit(X_train, y_train)

# Make predictions
y_arb_train_pred = model_arb.predict(X_train)
y_arb_test_pred = model_arb.predict(X_test)

# Calculate the accuracy
train_accuracy_arb = accuracy_score(y_train, y_arb_train_pred)
test_accuracy_arb = accuracy_score(y_test, y_arb_test_pred)
print('The training accuracy is', train_accuracy_arb)
print('The test accuracy is', test_accuracy_arb)
print('The training F1 Score is', f1_score(y_arb_train_pred, y_train))
print('The testing F1 Score is', f1_score(y_arb_test_pred, y_test))

#Results
#The training accuracy is 0.7921348314606742
#The test accuracy is 0.7653631284916201
#The training F1 Score is 0.628140703517588
#The testing F1 Score is 0.631578947368421
 
  

We can see that the accuracy decreased for training and testing sets and became 0.79 and 0.77 respectively. F1_score also lost its value as well as the performance for training and testing sets and became 0.63 and 0.63 respectively.

Finally, we can use the GridSearchCV class. This class provides the functions to define the best hyperparameters for your model. In a dictionary parameters all hyperparameters are listed which we want to try. Our metric for opting the best hyperparameters is f1_score. The model, parameters and object scorer will we passed to GridSearchCV.

from sklearn.model_selection import GridSearchCV

dtgs_model = DecisionTreeClassifier(random_state=42)

# Create the parameters list you wish to tune.
parameters = {'max_depth':[2,3,4,5,6,7, 8,9,10,11,12],
              'min_samples_leaf':[2,3,4,5,6,7, 8,9,10, 11,12], 
              'min_samples_split':[2,3,4,5,6,7,8,9,10,11,12]}

# Make an scorer scoring object.
scorer = make_scorer(f1_score)

# Perform grid search on the classifier using 'scorer' as the scoring method.
grid_obj = GridSearchCV(dtgs_model
, parameters, scoring=scorer)

# Fit the grid search object to the training data and find the optimal parameters.
grid_fit = grid_obj.fit(X_train, y_train)

# Getthe estimator.
best_clf = grid_fit.best_estimator_

# Fit the new model.
best_clf.fit(X_train, y_train)

# Calculate the accuracy
train_accuracy_dtgs = accuracy_score(y_train, best_train_predictions)
test_accuracy_dtgs = accuracy_score(y_test, best_test_predictions)
print('The training accuracy is', train_accuracy_dtgs)
print('The test accuracy is', test_accuracy_dtgs)


# Make predictions using the new model.
best_train_predictions = best_clf.predict(X_train)
best_test_predictions = best_clf.predict(X_test)

# Calculate the f1_score of the new model.
print('The training F1 Score is', f1_score(best_train_predictions, y_train))
print('The testing F1 Score is', f1_score(best_test_predictions, y_test))


# Let's also explore what parameters ended up being used in the new model.
best_clf

#Results
#The training accuracy is 0.8764044943820225
#The test accuracy is 0.8491620111731844
#The training F1 Score is 0.8253968253968255
#The testing F1 Score is 0.8085106382978724
#DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=8,
#            max_features=None, max_leaf_nodes=None,
#            min_impurity_decrease=0.0, min_impurity_split=None,
#            min_samples_leaf=7, min_samples_split=2,
#            min_weight_fraction_leaf=0.0, presort=False, random_state=42,
#            splitter='best')

 
  

We can see that our model with hyperparameters max_depth = 8, min_sample_leaf = 7 and min_sample_split=2 has the highest testing accuracy = 0.849 and F1_score = 0.809 compare to the previous model. In fact, these hyperparameters were chosen based on optimization by cross-validated grid-search over a parameter grid. The critrion for chosing the best model was the highest f1_score among models with other combinations of hyperparameters defined in parameters dictionary. In addition, it is importent to note that the training accuracy decreased from 1 to 0.825 compare to the model with default parameters. It indicates that the obtained model is more generalized and is not overfitted as the first model. At the same time, the training accuracy is higher here than in the model with the arbitrary parameters.

To sum up, Gird Search method and its implementation in SciKit Learn is an efficient way to define hyperparameters for a model. It is crucial to opt the right parameters of the model which generalize well the data avoiding too simple, underfitted, or too complicated, overfitted, models.

References:

  1. Machine Learning course. Udacity
  2. SciKit Learning documentation 
Faisal Ansari

DATA ANALYST ?? Total 10 YOE ?? Excel | SQL | Python | Power BI | Tableau

10 个月

Nicely explained Tuning the Hyper Parameter of RandomForestClassifier using GridSearchCV. OneHotEncoding is used efficiently. Here I want to make it easier for those people who are new in Data Science that we can use range() function instead giving a list of parameters: parameters = { 'max_depth' : list(range(2,13)), 'min_samples_leaf' : list(range(2,13)), 'min_samples_split' : list(range(2,13)) }

回复
Juan Jose Pi?ero

Senior Data Scientist and Biostatistician

1 年

What if in the final model I don't want to automatically choose the best parameters but fix one of them?

回复

要查看或添加评论,请登录

Boris Kushnarev的更多文章

社区洞察

其他会员也浏览了