ML Pipelines for Model Tuning
It has happened many times to me, staring blank at a dataset and a collection of models and wondering how to find the one that fits best, before Scikit-learn ML pipelines were a thing in my life. Pipelined were a bit confusing in the beginning, but after using it in several projects, I cannot do without it. ML pipelines are remarkable for their simplicity and neatness, and if used properly, can also increase your (and your model's) productivity by leaps and bounds.
To demonstrate, I model the telco customer churn dataset. This data is freely available on Kaggle. This data is regarding the phenomenon of 'churning' of customers. In simple terms, the company needs to understand why and when a customer may choose to leave them. First, let's import some necessary modules and import the data:
import pandas as pd
import numpy as np
df = pd.read_csv('path/data')
Now have a look at the dataset and try to understand it. The quick way to check columns and data types with pandas dataframe is:
the output is:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7043 entries, 0 to 7042
Data columns (total 21 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 customerid 7043 non-null object
1 gender 7043 non-null object
2 seniorcitizen 7043 non-null int64
3 partner 7043 non-null object
4 dependents 7043 non-null object
5 tenure 7043 non-null float64
6 phoneservice 7043 non-null object
7 multiplelines 7043 non-null object
8 internetservice 7043 non-null object
9 onlinesecurity 7043 non-null object
10 onlinebackup 7043 non-null object
11 deviceprotection 7043 non-null object
12 techsupport 7043 non-null object
13 streamingtv 7043 non-null object
14 streamingmovies 7043 non-null object
15 contract 7043 non-null object
16 paperlessbilling 7043 non-null object
17 paymentmethod 7043 non-null object
18 monthlycharges 7043 non-null float64
19 totalcharges 7043 non-null float64
20 churn 7043 non-null object
dtypes: float64(3), int64(1), object(17)memory usage: 1.1+ MB
The data fields are mostly self-explanatory, but I'll introduce some here:
I'll change some of the data types (seniorcitizen, tenure and totalcharges) to make it suitable for modeling:
df = df.astype({"seniorcitizen": int}
df = df.astype({"tenure": 'float64'}))
The totalcharges column has some empty strings so we need to put a little more effort for that:
df['totalcharges'] = df.totalcharges.apply(lambda x: pd.to_numeric(x) if x != ' ' else 0)
Now It's time for some quick exploratory analysis. One good thing about this dataset is of course no missing values, but there can be hidden problems too, like outliers. We'll see that with the help of pandas boxplot:
df.boxplot(column = ['monthlycharges', 'tenure'], by = ['churn', 'gender'], figsize=(15, 5))
This gives a few of quick observations:
For now, we accept the outliers in data and jump directly to feature correlations. The idea is to drop out correlated and/or redundant features.
import seaborn as sns
sns.heatmap(df.corr(), cmap = 'coolwarm', annot = True)
The outcome: tenure and totalcharges are correlated; monthlycharges and totalcharges are correlated. We drop totalcharges from the dataset and proceed.
df['target'] = df.churn.apply(lambda x: 0 if x == 'Yes' else 1)
X = df.drop(columns = ['customerid','totalcharges','churn', 'target'])
y =
Importing required modules:
from sklearn.model_selection import train_test_split
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import RandomizedSearchCV
from sklearn.model_selection import cross_validate
from sklearn.pipeline import make_pipeline
from sklearn.metrics import make_scorer
from sklearn.metrics import accuracy_score
from sklearn.metrics import f1_score
from sklearn.compose import make_column_transformer
from sklearn.compose import make_column_selector
from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from xgboost import XGBClassifier
from sklearn.inspection import permutation_importance
from sklearn import set_confi# classification models loading
I will use the make_pipeline shortcut instead of regular Pipeline for the former's utter simplicity. I have chosen 5 classification models, e.g. linear regression, k-nearest neighbors, decision tree, random forest classifier (ensemble method) and finally XGBoost (another ensemble method).
clf = LogisticRegression()
knn = KNeighborsClassifier()
tree = DecisionTreeClassifier()
rfcl = RandomForestClassifier()
xgb = XGBClassifier()
In our dataset, we have both categorical as well as numerical data. Current classification models do not understand categorical data as 'categories', so there is an obvious need to address that. We will do that via one-hot encoding (OHE). The numerical data also need to be transformed by normalization and standardization. I used StandardScaler to do that. Notice, that for binary variables, OHE creates redundant and correlated feature. It is addressed by setting drop = 'if_binary' in the function.
num_transformer = StandardScaler()
num_col = make_column_selector(dtype_include = ['float64'])
cat_transformer = OneHotEncoder( drop = 'if_binary')
cat_col = make_column_selector(dtype_include = ['object'])
preproc = make_column_transformer(?
? (num_transformer, num_col),? ?
(cat_transformer, cat_col),? ?
remainder = 'passthrough'
Before diving into the pipelines, let's have a quick look at our data and moderate it with some trimming and transformations. It's necessary to have a sensible data before splitting into training and testing sets. For example, we definitely need to drop a few columns like customerid (totally useless here for modeling purpose) and monthlycharges (high correlation with tenure) as obvious cases. The target column, 'churn' also needs to be transformed to integer values 0 and 1.
df['target'] = df.churn.apply(lambda x: 0 if x == 'Yes' else 1)
X = df.drop(columns = ['customerid','totalcharges','churn', 'target'])
y =
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state = 42)
The much-awaited pipeline started taking form above. numerical and categorical columns are chosen with make_column_selector function (additional perk of the make_pipeline shortcut) and numerical or categorical transformations applied accordingly. Next the preprocessor pipeline is constructed by joining parallel transformations (categorical and numerical) using make_column_transformer shortcut. Note that this method does not require the explicit mention of each step. Then how do we find out parameter names while tuning the model? We'll find out soon. Let's complete the pipeline with a model:
clf_pipe = make_pipeline(preproc, clf)
Let's make the pipeline a little more extensive with cross validation of train and validation dataset, and parameter tuning capabilities. This can be achieved in different ways. The simplest way is to apply a grid search CV with pre-defined parameter grid. Random search CV can also be used, but for this example let's stick to grid search.
clf_params = {'logisticregression__C' : [1, 10, 50, 100, 200
? ? ? ? ? ? ? }
clf_grid = GridSearchCV(clf_pipe, param_grid = clf_params, \
cv = 5, scoring='f1', n_jobs = -2, verbose = 1)]
The tuning parameters and scorings are there. Scoring parameter used is f1-score because our target is imbalanced and we need a balance between precision and recall. The data imbalance can be quickly checked by using
Which gives the output as:
No 5174
Yes 1869
Name: churn, dtype: int64
This is a good thing that more customers stay back, but you get the idea. You've to be extra cautious when data has imbalance.
Also here is a word on how to construct parameter grid. For make_pipeline shortcut, pipeline parameters are not explicitly mentioned. To find the available parameter keys, you've to type
>>dict_keys(['memory', 'steps', 'verbose',
'columntransformer', 'logisticregression',
'columntransformer__n_jobs', 'columntransformer__remainder',
'columntransformer__sparse_threshold', 'columntransformer__transformer_weights',
'columntransformer__transformers', 'columntransformer__verbose',
'columntransformer__verbose_feature_names_out', 'columntransformer__standardscaler',
'columntransformer__onehotencoder', 'columntransformer__standardscaler__copy',
'columntransformer__standardscaler__with_mean', 'columntransformer__standardscaler__with_std',
'columntransformer__onehotencoder__categories', 'columntransformer__onehotencoder__drop',
'columntransformer__onehotencoder__dtype', 'columntransformer__onehotencoder__handle_unknown',
'columntransformer__onehotencoder__sparse', 'logisticregression__C', 'logisticregression__class_weight',
'logisticregression__dual', 'logisticregression__fit_intercept', 'logisticregression__intercept_scaling',
'logisticregression__l1_ratio', 'logisticregression__max_iter', 'logisticregression__multi_class',
'logisticregression__n_jobs', 'logisticregression__penalty', 'logisticregression__random_state',
'logisticregression__solver', 'logisticregression__tol', 'logisticregression__verbose',
And to find the available scoreres:
from sklearn.metrics import SCORERS
>>dict_keys(['explained_variance', 'r2', 'max_error', 'neg_median_absolute_error',
'neg_mean_absolute_error', 'neg_mean_absolute_percentage_error', 'neg_mean_squared_error',
'neg_mean_squared_log_error', 'neg_root_mean_squared_error', 'neg_mean_poisson_deviance',
'neg_mean_gamma_deviance', 'accuracy', 'top_k_accuracy', 'roc_auc', 'roc_auc_ovr',
'roc_auc_ovo', 'roc_auc_ovr_weighted', 'roc_auc_ovo_weighted', 'balanced_accuracy',
'average_precision', 'neg_log_loss', 'neg_brier_score', 'adjusted_rand_score', 'rand_score',
'homogeneity_score', 'completeness_score', 'v_measure_score', 'mutual_info_score',
'adjusted_mutual_info_score', 'normalized_mutual_info_score', 'fowlkes_mallows_score',
'precision', 'precision_macro', 'precision_micro', 'precision_samples', 'precision_weighted',
'recall', 'recall_macro', 'recall_micro', 'recall_samples', 'recall_weighted', 'f1',
'f1_macro', 'f1_micro', 'f1_samples', 'f1_weighted', 'jaccard', 'jaccard_macro', 'jaccard_micro',
'jaccard_samples', 'jaccard_weighted'])
Now comes the time for fitting:, y_train)
The grid search a while, depending on how many fits to be done. After convergence, you can access the best scores and parameters of the model as:
And the best parameters:
>>{'logisticregression__C': 200}
The final model can be made with this information and fitted on the test data. I used five models in total, but next I will show the complete grid search and final model selection with XGBoost model to save some time and not to make the stuff boring. Procedure to find out pipeline parameters are the same (name_of_pipeline.get_params().keys()), so you can use any relevant model of your choice. Let's get started with the XGBoost model:
xgb_pipe = make_pipeline(preproc, xgb)
xgb_params = {'xgbclassifier__max_depth' : [5, 7, 10, 15, 20],\
? ? ? ? ? ? ? ?'xgbclassifier__min_child_weight' : [3, 5, 10, 15, 20],\
? ? ? ? ? ? ? ?'xgbclassifier__n_estimators': [1,3,5, 10, 15, 20, 25, 30]\
? ? ? ? ? ? ? }?
xgb_grid = GridSearchCV(xgb_pipe, param_grid = xgb_params, \
scoring = 'f1',cv = 5, n_jobs = -2, verbose = 1), y_train)
Then we find out the best scores and parameters:
>>{'xgbclassifier__max_depth': 5
'xgbclassifier__min_child_weight': 20,
'xgbclassifier__n_estimators': 10},
Time to build the final model:
xgb_fin = XGBClassifier(max_depth = 5, min_child_weight = 20,\
n_estimators = 10
xgb_pipe_fin = make_pipeline(preproc, xgb_fin)), y_train)
Let's check the fit on the test data:
The score is slightly less than validation score, so there is possibly a mild overfitting issue. In the next write-up, we shall see more visual ways to address overfitting.
We got a reasonably good balance between precision and recall with grid search method. Now we will have a quick look at what the company will learn from our final model. For this purpose, we will employ a technique that drops a column for each run to see how much the f1-score drops. More the drop, more important the feature.
from sklearn.inspection import permutation_importance
result_train = permutation_importance(xgb_pipe_fin, X_train, y_train, \
scoring = 'f1', n_repeats = 20, random_state = 42, n_jobs = -2)
sorted_feature_idx_train = result_train.importances_mean.argsort()
feature_df_train = pd.DataFrame(result_train.importances[sorted_feature_idx_train].T,\?
? ? ? ? ? ? ? ? ? ?columns = X_train.columns[sorted_feature_idx_train])
result_test = permutation_importance(xgb_pipe_fin, X_test, y_test, \
scoring = 'f1', n_repeats = 20, random_state = 42, n_jobs = -2)
feature_df_test = pd.DataFrame(result_test.importances[sorted_feature_idx_test].T,\
? ? ? ? ? ? ? ? ? ? ? ? ?columns = X_test.columns[sorted_feature_idx_test])?
We are plotting both train and test datasets for checking any changes in feature importance order, in the same order as the sorted training set feature importance. This will give an idea if there is any over or underfitting issue:
import matplotlib.pyplot as plt
feature_df_train.mean(axis = 0).plot(kind='barh', title = 'reduction of f1 score', figsize=(20,20), color = 'red', alpha = 0.5)
feature_df_test.mean(axis = 0).plot(kind='barh', title = 'reduction of f1 score', figsize=(20,20), color = 'blue', alpha = 0.5)
As expected, tenure length and contract type are the most important criteria for churn. There is a difference in train and test feature importance. In the future, with more visual parameter tuning criteria, we will try to minimise this. on the bright side, the above plot can be used as a diagnostic tool to check model drift in a dynamic model.
