登录查看更多内容

How to choose hyperparameters for a model with GridSearchCV.

Boris Kushnarev

Business Analyst at Optus | Data Analytics Consultant at The Data School | Neo4j Certified Professional | Tableau Data Analyst Certification | Alteryx Advanced Certified | [email protected]

发布日期: 2019年2月4日

It has been always a challenge to find hyperparameters which are just right for your model. Using different method, for instance, regression or decision trees requires to choose a degree of a polynomial for the former method and maximum depth, minimum number of samples to split and minimum number of samples per leaf for the later.

Hyperparameters define the complexity of the model. If the model is too simple it tends to generalize our data too much and we call it underfitting. On the other hand, complex models try to memorize the answers which would perfectly work on the training data and will lose the accuracy on the testing data and it is called overfitting.

You can rely on your intuition and it would probably be OK when you are playing with one hyperparameter. However, if you have two or more parameters to adjust, the possible combination of them dramatically increases making it hard to choose just the right combination for your model.

To determine the optimal hyperparameters of your model the Grid Search method can be used. It is a straightforward method where you go through all the possible combination for each hyperparameter wtih other parameter combinations. Of course, to go over all possible combinations might take forever, but with SciKit Learn you will make your life a lot easier.

For our example we will use the Decision Tree classifier trained and then test it on the well-known Titanic Survival dataset. Let's start with loading the dataset on Jupyter Notebook and displaying some of its rows.

import numpy as np
import pandas as pd
from IPython.display import display 
%matplotlib inline

# Set a random seed
import random
random.seed(42)

# Load the dataset
full_data = pd.read_csv('titanic_data.csv')

# Print the first few entries of the RMS Titanic data
full_data.head()

The output is

The features which are found in this table are:

Survived: Outcome of survival (0 = No; 1 = Yes)
Pclass: Socio-economic class (1 = Upper class; 2 = Middle class; 3 = Lower class)
Name: Name of passenger
Sex: Sex of the passenger
Age: Age of the passenger (Some entries contain NaN)
SibSp: Number of siblings and spouses of the passenger aboard
Parch: Number of parents and children of the passenger aboard
Ticket: Ticket number of the passenger
Fare: Fare paid by the passenger
Cabin Cabin number of the passenger (Some entries contain NaN)
Embarked: Port of embarkation of the passenger (C = Cherbourg; Q = Queenstown; S = Southampton

Because we want to predict if a passenger or a crew member either survived or not, we will move Survived variable to a separate data series outcomes and remove it from the initial data frame. We also assume that the Name variable does not carry any information which we can use for prediction and it will be also dropped.

# Store the 'Survived' feature in a new variable

outcomes = full_data['Survived']
# Remove Survived' and 'Name' from the dataset
features_raw = full_data.drop(['Survived', 'Name'], axis = 1)
# Show the new dataset with 'Survived' removed
display(features_raw.head())

For the preprocessing of our data, we will one-hot encode the categorical features using .get_dummies method. We will also substitute NaN with 0.0:

features = pd.get_dummies(features_raw)
display(features.head()) 
features = features.fillna(0.0)

Now we will split our data into training and testing sets.

from sklearn.model_selection import train_test_split 
X_train, X_test, y_train, y_test = train_test_split(features, 
                                   outcomes, test_size=0.2, random_state=42)

We will train our model on the training set with the default hyperparameters: max_depth=None, min_samples_split=2, min_samples_leaf=1, then with the arbitrary chosen parameters and finally with parameters obtained from the GridSearchCV.

So, let's start with training our model with the default parameters in DecisionTreeClassifier:

# Import the Decision Tree Classifier

from sklearn.tree import DecisionTreeClassifier

# Define the classifier
model = DecisionTreeClassifier()
# fit the classifier into the model
model.fit(X_train, y_train)

On the next step we will test our model:

# Making predictions
y_train_pred = model.predict(X_train)
y_test_pred = model.predict(X_test)

# Calculate the accuracy
from sklearn.metrics import accuracy_score, f1_score, make_scorer
from sklearn.metrics import 
train_accuracy = accuracy_score(y_train, y_train_pred)
test_accuracy = accuracy_score(y_test, y_test_pred)
print('The training accuracy is', train_accuracy)
print('The test accuracy is', test_accuracy)
# Calculate the f1_score of the new model.
print('The training F1 Score is', f1_score(y_train_pred, y_train))
print('The testing F1 Score is', f1_score(y_test_pred, y_test))


#Results
#The training accuracy is 1.0
#The test accuracy is 0.815642458101
#The training F1 Score is 1.0
#The testing F1 Score is 0.765957446809

We can see that the training accuracy is 1 and the test accuracy is 0. It indicates that the model is overfitted meaning that the model performs too well on the training data with 100% accuracy, but the testing accuracy is only 81.5%. Another metrics which was calculated is F1 score, which is bounded between 0 and 1. This parameter is also shows the performance of the model and the higher f1_score the better the model. So, later we will compare this f1_score with f1_scores from the models below.

The following arbitrary chosen parameter will be used for the second model: max_depth=2, min_samples_split=6, min_samples_leaf=3

# Define the model with the arbitrary chosen parameters 
model_arb = DecisionTreeClassifier(max_depth=2, min_samples_leaf=3, 
                                   min_samples_split=6) 
model_arb.fit(X_train, y_train)

# Make predictions
y_arb_train_pred = model_arb.predict(X_train)
y_arb_test_pred = model_arb.predict(X_test)

# Calculate the accuracy
train_accuracy_arb = accuracy_score(y_train, y_arb_train_pred)
test_accuracy_arb = accuracy_score(y_test, y_arb_test_pred)
print('The training accuracy is', train_accuracy_arb)
print('The test accuracy is', test_accuracy_arb)
print('The training F1 Score is', f1_score(y_arb_train_pred, y_train))
print('The testing F1 Score is', f1_score(y_arb_test_pred, y_test))

#Results
#The training accuracy is 0.7921348314606742
#The test accuracy is 0.7653631284916201
#The training F1 Score is 0.628140703517588
#The testing F1 Score is 0.631578947368421

We can see that the accuracy decreased for training and testing sets and became 0.79 and 0.77 respectively. F1_score also lost its value as well as the performance for training and testing sets and became 0.63 and 0.63 respectively.

Finally, we can use the GridSearchCV class. This class provides the functions to define the best hyperparameters for your model. In a dictionary parameters all hyperparameters are listed which we want to try. Our metric for opting the best hyperparameters is f1_score. The model, parameters and object scorer will we passed to GridSearchCV.

from sklearn.model_selection import GridSearchCV

dtgs_model = DecisionTreeClassifier(random_state=42)

# Create the parameters list you wish to tune.
parameters = {'max_depth':[2,3,4,5,6,7, 8,9,10,11,12],
              'min_samples_leaf':[2,3,4,5,6,7, 8,9,10, 11,12], 
              'min_samples_split':[2,3,4,5,6,7,8,9,10,11,12]}

# Make an scorer scoring object.
scorer = make_scorer(f1_score)

# Perform grid search on the classifier using 'scorer' as the scoring method.
grid_obj = GridSearchCV(dtgs_model
, parameters, scoring=scorer)

# Fit the grid search object to the training data and find the optimal parameters.
grid_fit = grid_obj.fit(X_train, y_train)

# Getthe estimator.
best_clf = grid_fit.best_estimator_

# Fit the new model.
best_clf.fit(X_train, y_train)

# Calculate the accuracy
train_accuracy_dtgs = accuracy_score(y_train, best_train_predictions)
test_accuracy_dtgs = accuracy_score(y_test, best_test_predictions)
print('The training accuracy is', train_accuracy_dtgs)
print('The test accuracy is', test_accuracy_dtgs)


# Make predictions using the new model.
best_train_predictions = best_clf.predict(X_train)
best_test_predictions = best_clf.predict(X_test)

# Calculate the f1_score of the new model.
print('The training F1 Score is', f1_score(best_train_predictions, y_train))
print('The testing F1 Score is', f1_score(best_test_predictions, y_test))


# Let's also explore what parameters ended up being used in the new model.
best_clf

#Results
#The training accuracy is 0.8764044943820225
#The test accuracy is 0.8491620111731844
#The training F1 Score is 0.8253968253968255
#The testing F1 Score is 0.8085106382978724
#DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=8,
#            max_features=None, max_leaf_nodes=None,
#            min_impurity_decrease=0.0, min_impurity_split=None,
#            min_samples_leaf=7, min_samples_split=2,
#            min_weight_fraction_leaf=0.0, presort=False, random_state=42,
#            splitter='best')

We can see that our model with hyperparameters max_depth = 8, min_sample_leaf = 7 and min_sample_split=2 has the highest testing accuracy = 0.849 and F1_score = 0.809 compare to the previous model. In fact, these hyperparameters were chosen based on optimization by cross-validated grid-search over a parameter grid. The critrion for chosing the best model was the highest f1_score among models with other combinations of hyperparameters defined in parameters dictionary. In addition, it is importent to note that the training accuracy decreased from 1 to 0.825 compare to the model with default parameters. It indicates that the obtained model is more generalized and is not overfitted as the first model. At the same time, the training accuracy is higher here than in the model with the arbitrary parameters.

To sum up, Gird Search method and its implementation in SciKit Learn is an efficient way to define hyperparameters for a model. It is crucial to opt the right parameters of the model which generalize well the data avoiding too simple, underfitted, or too complicated, overfitted, models.

References:

Machine Learning course. Udacity
SciKit Learning documentation

Faisal Ansari

DATA ANALYST ?? Total 10 YOE ?? Excel | SQL | Python | Power BI | Tableau

10 个月

Nicely explained Tuning the Hyper Parameter of RandomForestClassifier using GridSearchCV. OneHotEncoding is used efficiently. Here I want to make it easier for those people who are new in Data Science that we can use range() function instead giving a list of parameters: parameters = { 'max_depth' : list(range(2,13)), 'min_samples_leaf' : list(range(2,13)), 'min_samples_split' : list(range(2,13)) }

Juan Jose Pi?ero

Senior Data Scientist and Biostatistician

1 年

What if in the final model I don't want to automatically choose the best parameters but fix one of them?

查看更多评论

要查看或添加评论，请登录

Boris Kushnarev的更多文章

Star Schema in Power BI: The Key to Simplicity and Speed

2024年11月3日

Star Schema in Power BI: The Key to Simplicity and Speed

In the world of data analytics, effective data modeling is the foundation for insightful and efficient reporting, and…
Implementing Row-Level Security in Power BI: Ensuring Data Privacy and Access Control

2024年10月27日

Implementing Row-Level Security in Power BI: Ensuring Data Privacy and Access Control

In my recent Power BI project, I tackled a common but crucial challenge—establishing secure access to specific data…

3 条评论
Algorithmic Trading and Strategies.

2024年6月24日

Algorithmic Trading and Strategies.

Algorithmic trading, or algotrading, has gained popularity for several reasons. First, the high-level programming…
Algorithmic implementation of the VWAP Trading Strategy

2023年8月28日

Algorithmic implementation of the VWAP Trading Strategy

I recently had the pleasure of meeting Talgat Baibussinov, a seasoned professional in the fields of trading and…
Recommendation systems

2020年2月10日

Recommendation systems

Recommendation systems play an important role in e-commerce, banking, and other businesses nowadays. Based on the…
PCA results interpretation

2019年3月1日

PCA results interpretation

Principal component analysis or PCA for short is the useful method for reducing the dimensionality of the considered…
Flower classifier with Convolutional Neural Networks using PyTorch

2019年1月24日

Flower classifier with Convolutional Neural Networks using PyTorch

Nowadays, Convolutional Neural Networks play an importent role in solving different questions in computer vision…

See all articles

How to choose hyperparameters for a model with GridSearchCV.

Boris Kushnarev

Business Analyst at Optus | Data Analytics Consultant at The Data School | Neo4j Certified Professional | Tableau Data Analyst Certification | Alteryx Advanced Certified | [email protected]

Boris Kushnarev的更多文章

社区洞察

其他会员也浏览了

How logistic regression can save the day?

K- Nearest Neighbors Explaination

Day 2 - Classification

ML Classification Algorithms to Predict Market Movements and Backtesting

Logistic regression made simple

Stock Market Prediction: A Practical Guide for choosing models

Day 13 — Density-Based Spatial Clustering of Applications with Noise (DBSCAN)

Safetensors vs GGUF

What is Decision Trees and Random Forests

Measuring Forecast Accuracy

Boris Kushnarev的更多文章

Star Schema in Power BI: The Key to Simplicity and Speed

Implementing Row-Level Security in Power BI: Ensuring Data Privacy and Access Control

Algorithmic Trading and Strategies.

Algorithmic implementation of the VWAP Trading Strategy

Recommendation systems

PCA results interpretation

Flower classifier with Convolutional Neural Networks using PyTorch

社区洞察

其他会员也浏览了

How logistic regression can save the day?

K- Nearest Neighbors Explaination

Day 2 - Classification

ML Classification Algorithms to Predict Market Movements and Backtesting

Logistic regression made simple

Stock Market Prediction: A Practical Guide for choosing models

Day 13 — Density-Based Spatial Clustering of Applications with Noise (DBSCAN)

Safetensors vs GGUF

What is Decision Trees and Random Forests

Measuring Forecast Accuracy