ML Pipelines for Model Tuning
image credit: Generated by the author with DALL-E

ML Pipelines for Model Tuning

It has happened many times to me, staring blank at a dataset and a collection of models and wondering how to find the one that fits best, before Scikit-learn ML pipelines were a thing in my life. Pipelined were a bit confusing in the beginning, but after using it in several projects, I cannot do without it. ML pipelines are remarkable for their simplicity and neatness, and if used properly, can also increase your (and your model's) productivity by leaps and bounds.

To demonstrate, I model the telco customer churn dataset. This data is freely available on Kaggle. This data is regarding the phenomenon of 'churning' of customers. In simple terms, the company needs to understand why and when a customer may choose to leave them. First, let's import some necessary modules and import the data:

import pandas as pd
import numpy as np

df = pd.read_csv('path/data')        

Now have a look at the dataset and try to understand it. The quick way to check columns and data types with pandas dataframe is:

df.info()        

the output is:


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7043 entries, 0 to 7042
Data columns (total 21 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   customerid        7043 non-null   object 
 1   gender            7043 non-null   object 
 2   seniorcitizen     7043 non-null   int64  
 3   partner           7043 non-null   object 
 4   dependents        7043 non-null   object 
 5   tenure            7043 non-null   float64
 6   phoneservice      7043 non-null   object 
 7   multiplelines     7043 non-null   object 
 8   internetservice   7043 non-null   object 
 9   onlinesecurity    7043 non-null   object 
 10  onlinebackup      7043 non-null   object 
 11  deviceprotection  7043 non-null   object 
 12  techsupport       7043 non-null   object 
 13  streamingtv       7043 non-null   object 
 14  streamingmovies   7043 non-null   object 
 15  contract          7043 non-null   object 
 16  paperlessbilling  7043 non-null   object 
 17  paymentmethod     7043 non-null   object 
 18  monthlycharges    7043 non-null   float64
 19  totalcharges      7043 non-null   float64
 20  churn             7043 non-null   object 
dtypes: float64(3), int64(1), object(17)memory usage: 1.1+ MB        

The data fields are mostly self-explanatory, but I'll introduce some here:

  • customerid - a unique iD proof of the customer
  • gender - gender of the customer
  • seniorcitizen - whether the customer is a senior citizen, as 0 or 1
  • tenure - how long the customer has stayed with the company, in months
  • phoneservice, internetservice etc. - extra services with phone contract
  • contract - type of contract (monthly or yearly or bi-yearly)
  • monthlycharges - monthly charges
  • totalcharges - total amount paid by the customer
  • churn - our target variable as Yes or No

I'll change some of the data types (seniorcitizen, tenure and totalcharges) to make it suitable for modeling:

df = df.astype({"seniorcitizen": int}
df = df.astype({"tenure": 'float64'}))        

The totalcharges column has some empty strings so we need to put a little more effort for that:

df['totalcharges'] = df.totalcharges.apply(lambda x: pd.to_numeric(x) if x != ' ' else 0)        

Now It's time for some quick exploratory analysis. One good thing about this dataset is of course no missing values, but there can be hidden problems too, like outliers. We'll see that with the help of pandas boxplot:


df.boxplot(column = ['monthlycharges', 'tenure'], by = ['churn', 'gender'], figsize=(15, 5))        
No alt text provided for this image

This gives a few of quick observations:

  • Those churning have higher median monthly charges.
  • Median tenure of churning customers are lower.
  • Gender does not affect significantly the churning population.

For now, we accept the outliers in data and jump directly to feature correlations. The idea is to drop out correlated and/or redundant features.

import seaborn as sns
sns.heatmap(df.corr(), cmap = 'coolwarm', annot = True)        
No alt text provided for this image

The outcome: tenure and totalcharges are correlated; monthlycharges and totalcharges are correlated. We drop totalcharges from the dataset and proceed.

df.copy()
df['target'] = df.churn.apply(lambda x: 0 if x == 'Yes' else 1)
X = df.drop(columns = ['customerid','totalcharges','churn', 'target'])
y = df.target        

Importing required modules:

from sklearn.model_selection import train_test_split
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import RandomizedSearchCV
from sklearn.model_selection import cross_validate

from sklearn.pipeline import make_pipeline


from sklearn.metrics import make_scorer
from sklearn.metrics import accuracy_score
from sklearn.metrics import f1_score

from sklearn.compose import make_column_transformer
from sklearn.compose import make_column_selector


from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import StandardScaler


from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from xgboost import XGBClassifier

from sklearn.inspection import permutation_importance
from sklearn import set_confi# classification models loading        

I will use the make_pipeline shortcut instead of regular Pipeline for the former's utter simplicity. I have chosen 5 classification models, e.g. linear regression, k-nearest neighbors, decision tree, random forest classifier (ensemble method) and finally XGBoost (another ensemble method).

clf = LogisticRegression()
knn = KNeighborsClassifier()
tree = DecisionTreeClassifier()
rfcl = RandomForestClassifier()
xgb = XGBClassifier()        

In our dataset, we have both categorical as well as numerical data. Current classification models do not understand categorical data as 'categories', so there is an obvious need to address that. We will do that via one-hot encoding (OHE). The numerical data also need to be transformed by normalization and standardization. I used StandardScaler to do that. Notice, that for binary variables, OHE creates redundant and correlated feature. It is addressed by setting drop = 'if_binary' in the function.

num_transformer = StandardScaler()
num_col = make_column_selector(dtype_include = ['float64'])
cat_transformer = OneHotEncoder( drop = 'if_binary')
cat_col = make_column_selector(dtype_include = ['object'])
preproc = make_column_transformer(?
                               ? (num_transformer, num_col),? ? 
                                 (cat_transformer, cat_col),? ? 
                                 remainder = 'passthrough'
)        

Before diving into the pipelines, let's have a quick look at our data and moderate it with some trimming and transformations. It's necessary to have a sensible data before splitting into training and testing sets. For example, we definitely need to drop a few columns like customerid (totally useless here for modeling purpose) and monthlycharges (high correlation with tenure) as obvious cases. The target column, 'churn' also needs to be transformed to integer values 0 and 1.

df.copy()
df['target'] = df.churn.apply(lambda x: 0 if x == 'Yes' else 1)
X = df.drop(columns = ['customerid','totalcharges','churn', 'target'])
y = df.target

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state = 42)
        

The much-awaited pipeline started taking form above. numerical and categorical columns are chosen with make_column_selector function (additional perk of the make_pipeline shortcut) and numerical or categorical transformations applied accordingly. Next the preprocessor pipeline is constructed by joining parallel transformations (categorical and numerical) using make_column_transformer shortcut. Note that this method does not require the explicit mention of each step. Then how do we find out parameter names while tuning the model? We'll find out soon. Let's complete the pipeline with a model:

clf_pipe = make_pipeline(preproc, clf)        

Let's make the pipeline a little more extensive with cross validation of train and validation dataset, and parameter tuning capabilities. This can be achieved in different ways. The simplest way is to apply a grid search CV with pre-defined parameter grid. Random search CV can also be used, but for this example let's stick to grid search.

clf_params = {'logisticregression__C' : [1, 10, 50, 100, 200
? ? ? ? ? ? ? }

clf_grid = GridSearchCV(clf_pipe, param_grid = clf_params, \
            cv = 5, scoring='f1', n_jobs = -2, verbose = 1)]        

The tuning parameters and scorings are there. Scoring parameter used is f1-score because our target is imbalanced and we need a balance between precision and recall. The data imbalance can be quickly checked by using

df.churn.value_counts()        

Which gives the output as:

No     5174
Yes    1869
Name: churn, dtype: int64        

This is a good thing that more customers stay back, but you get the idea. You've to be extra cautious when data has imbalance.

Also here is a word on how to construct parameter grid. For make_pipeline shortcut, pipeline parameters are not explicitly mentioned. To find the available parameter keys, you've to type

clf_pipe.get_params().keys()

>>dict_keys(['memory', 'steps', 'verbose', 
'columntransformer', 'logisticregression', 
'columntransformer__n_jobs', 'columntransformer__remainder', 
'columntransformer__sparse_threshold', 'columntransformer__transformer_weights', 
'columntransformer__transformers', 'columntransformer__verbose', 
'columntransformer__verbose_feature_names_out', 'columntransformer__standardscaler', 
'columntransformer__onehotencoder', 'columntransformer__standardscaler__copy', 
'columntransformer__standardscaler__with_mean', 'columntransformer__standardscaler__with_std',
 'columntransformer__onehotencoder__categories', 'columntransformer__onehotencoder__drop', 
'columntransformer__onehotencoder__dtype', 'columntransformer__onehotencoder__handle_unknown',
 'columntransformer__onehotencoder__sparse', 'logisticregression__C', 'logisticregression__class_weight',
 'logisticregression__dual', 'logisticregression__fit_intercept', 'logisticregression__intercept_scaling',
 'logisticregression__l1_ratio', 'logisticregression__max_iter', 'logisticregression__multi_class', 
'logisticregression__n_jobs', 'logisticregression__penalty', 'logisticregression__random_state', 
'logisticregression__solver', 'logisticregression__tol', 'logisticregression__verbose',
 'logisticregression__warm_start'])        

And to find the available scoreres:

from sklearn.metrics import SCORERS
SCORERS.keys()

>>dict_keys(['explained_variance', 'r2', 'max_error', 'neg_median_absolute_error', 
'neg_mean_absolute_error', 'neg_mean_absolute_percentage_error', 'neg_mean_squared_error', 
'neg_mean_squared_log_error', 'neg_root_mean_squared_error', 'neg_mean_poisson_deviance', 
'neg_mean_gamma_deviance', 'accuracy', 'top_k_accuracy', 'roc_auc', 'roc_auc_ovr', 
'roc_auc_ovo', 'roc_auc_ovr_weighted', 'roc_auc_ovo_weighted', 'balanced_accuracy', 
'average_precision', 'neg_log_loss', 'neg_brier_score', 'adjusted_rand_score', 'rand_score',
 'homogeneity_score', 'completeness_score', 'v_measure_score', 'mutual_info_score',
'adjusted_mutual_info_score', 'normalized_mutual_info_score', 'fowlkes_mallows_score',
 'precision', 'precision_macro', 'precision_micro', 'precision_samples', 'precision_weighted',
 'recall', 'recall_macro', 'recall_micro', 'recall_samples', 'recall_weighted', 'f1', 
'f1_macro', 'f1_micro', 'f1_samples', 'f1_weighted', 'jaccard', 'jaccard_macro', 'jaccard_micro',
 'jaccard_samples', 'jaccard_weighted'])        

Now comes the time for fitting:

clf_grid.fit(X_train, y_train)        

The grid search a while, depending on how many fits to be done. After convergence, you can access the best scores and parameters of the model as:

clf_grid.best_score_
>>0.8689816906667585        

And the best parameters:

clf_grid.best_params_
>>{'logisticregression__C': 200}        

The final model can be made with this information and fitted on the test data. I used five models in total, but next I will show the complete grid search and final model selection with XGBoost model to save some time and not to make the stuff boring. Procedure to find out pipeline parameters are the same (name_of_pipeline.get_params().keys()), so you can use any relevant model of your choice. Let's get started with the XGBoost model:

xgb_pipe = make_pipeline(preproc, xgb)
xgb_params = {'xgbclassifier__max_depth' : [5, 7, 10, 15, 20],\
? ? ? ? ? ? ? ?'xgbclassifier__min_child_weight' : [3, 5, 10, 15, 20],\
? ? ? ? ? ? ? ?'xgbclassifier__n_estimators': [1,3,5, 10, 15, 20, 25, 30]\
? ? ? ? ? ? ? }?
xgb_grid = GridSearchCV(xgb_pipe, param_grid = xgb_params, \
           scoring = 'f1',cv = 5, n_jobs = -2, verbose = 1)

xgb_grid.fit(X_train, y_train)        

Then we find out the best scores and parameters:

xgb_grid.best_score_
>>0.8734928830131871

xgb_grid.best_params_
>>{'xgbclassifier__max_depth': 5
 'xgbclassifier__min_child_weight': 20,
 'xgbclassifier__n_estimators': 10},        

Time to build the final model:

xgb_fin = XGBClassifier(max_depth = 5, min_child_weight = 20,\
          n_estimators = 10

xgb_pipe_fin = make_pipeline(preproc, xgb_fin))
xgb_pipe_fin.fit(X_train, y_train)        

Let's check the fit on the test data:

xgb_fin_score
>>0.8732710280373832        

The score is slightly less than validation score, so there is possibly a mild overfitting issue. In the next write-up, we shall see more visual ways to address overfitting.

We got a reasonably good balance between precision and recall with grid search method. Now we will have a quick look at what the company will learn from our final model. For this purpose, we will employ a technique that drops a column for each run to see how much the f1-score drops. More the drop, more important the feature.

from sklearn.inspection import permutation_importance
result_train = permutation_importance(xgb_pipe_fin, X_train, y_train, \
               scoring = 'f1', n_repeats = 20, random_state = 42, n_jobs = -2)
sorted_feature_idx_train = result_train.importances_mean.argsort()
feature_df_train = pd.DataFrame(result_train.importances[sorted_feature_idx_train].T,\?
? ? ? ? ? ? ? ? ? ?columns = X_train.columns[sorted_feature_idx_train])

result_test = permutation_importance(xgb_pipe_fin, X_test, y_test, \
              scoring = 'f1', n_repeats = 20, random_state = 42, n_jobs = -2)
feature_df_test = pd.DataFrame(result_test.importances[sorted_feature_idx_test].T,\
? ? ? ? ? ? ? ? ? ? ? ? ?columns = X_test.columns[sorted_feature_idx_test])?        

We are plotting both train and test datasets for checking any changes in feature importance order, in the same order as the sorted training set feature importance. This will give an idea if there is any over or underfitting issue:

import matplotlib.pyplot as plt
feature_df_train.mean(axis = 0).plot(kind='barh', title = 'reduction of f1 score', figsize=(20,20), color = 'red', alpha = 0.5)
feature_df_test.mean(axis = 0).plot(kind='barh', title = 'reduction of f1 score', figsize=(20,20), color = 'blue', alpha = 0.5)

plt.show()        
No alt text provided for this image

As expected, tenure length and contract type are the most important criteria for churn. There is a difference in train and test feature importance. In the future, with more visual parameter tuning criteria, we will try to minimise this. on the bright side, the above plot can be used as a diagnostic tool to check model drift in a dynamic model.










dhcoihddg

Katarina Sekelez

Data Science | MSc Student at International University of Applied Sciences, Germany ? Owner | Financial Analyst at Aplikat 28 Ltd. [aplikat28.hr]

2 年

Great! :)

要查看或添加评论,请登录

Shreetama Karmakar, Ph.D.的更多文章

  • How to anonymyze data fast: the faker library

    How to anonymyze data fast: the faker library

    If you are a budding data scientist interested in e-commerce data, you may have come across amazon customer reviews, or…

    1 条评论

社区洞察

其他会员也浏览了