Data Science Quick Tip #004: Using Custom Transformers in Scikit-Learn Pipelines!
David Hundley
Staff Machine Learning Engineer at State Farm ? | AI/ML Blogger | Livecoding Streamer
Hi there all. We’re back again with a follow up post to the last post’s tip regarding how to create Scikit-Learn pipelines in general. In case you missed that, you can now check it out at this link. (Where it is now officially published to Towards Data Science. w00t!) And as always, if you want to directly follow along with this post’s code, you can find that here at my personal GitHub.
To quickly cap where we left off from the last post, we had successfully created a Scikit-Learn pipeline that does all the data transformation, scaling, and inference all in one clean little package. But as of yet, we had to make use of Scikit-Learn’s default transformers within our pipeline. As great as those transformers are, wouldn’t it be great if we could make use of our own custom transformations? Well of course! I’d say it’s not only great, but it’s also necessary. If you recall from last week’s post, we built a model based off a single feature. That’s not very predictable!
So we’re going to remedy that by adding two transformers to transform two additional fields from the training dataset. (I know, going from 1 to 3 features still isn’t great. But hey, at least we’re increasing by 300%?) The original variable we started with was “Sex” (aka gender), and now we’re going to add in transformers for the appropriate “Ages” column and “Embarked” column.
Before we jump into our new custom transformers, let’s do our library imports. You might recall a lot of these from the last post, but we’re adding a couple extras. Don’t worry too much about what they are now as we’ll cover that further on down the post.
# Importing the libraries we’ll be using for this project import pandas as pd import joblibfrom sklearn.preprocessing import OneHotEncoder, StandardScaler, FunctionTransformer from sklearn.impute import SimpleImputer from sklearn.compose import ColumnTransformer from sklearn.pipeline import Pipeline from sklearn.ensemble import RandomForestClassifier from sklearn.model_selection import train_test_split from sklearn.metrics import roc_auc_score, accuracy_score, confusion_matrix
And we’ll go ahead and do a quick import of our training data.
# Importing the training dataset raw_train = pd.read_csv(‘../data/titanic/train.csv’) # Splitting the training data into appropriate training and validation sets X = raw_train.drop(columns = [‘Survived’]) y = raw_train[[‘Survived’]] X_train, X_val, y_train, y_val = train_test_split(X, y, random_state = 42)
Okay, from here on out, we actually won’t be altering the Scikit-Learn pipeline itself. Sure, we’ll be adding to it, but remember, I intentionally designed my data preprocessor in such a way that would be easy to add onto. Just to quickly recap from the last post, here’s what the code to build the original pipeline looked like.
# Creating a preprocessor to transform the ‘Sex’ column data_preprocessor = ColumnTransformer(transformers = [ (‘sex_transformer’, OneHotEncoder(), [‘Sex’]) ]) # Creating our pipeline that first preprocesses the data, then scales the data, then fits the data to a RandomForestClassifier rfc_pipeline = Pipeline(steps = [ (‘data_preprocessing’, data_preprocessor), (‘data_scaling’, StandardScaler()), (‘model’, RandomForestClassifier(max_depth = 10, min_samples_leaf = 3, min_samples_split = 4, n_estimators = 200)) ])
The first thing we can do before adding our custom transformers to the pipeline is to create the function transformers, of course! So as you might have been able to guess, the custom transformers are built right on top of regular functions, so you can write any Python function you want for the transformer.**** (We’ll get to all those asterisks later…)
Alright, so we talked about adding two transformers for two new variables, so let’s get to creating our two custom Python functions! Touching on the “Ages” column first, we’re going to have a little extra fun with this variable. Now, I genuinely do not know if age itself is a predictive variable here, but I guessed that if “Age” is predictable in any meaningful way, it would be as age categories / age bins. That said, I segregated the ages into categories like “Child”, “Adult”, “Elder”, and more. Again, I have no idea if this will be more performant than using the straight integers, but it allows us to do some fun! Here’s what the code to do that looks like:
# Creating a function to appropriately engineer the ‘Age’ column def create_age_bins(col): ‘’’Engineers age bin variables for pipeline’’’ # Defining / instantiating the necessary variables age_bins = [-1, 12, 18, 25, 50, 100] age_labels = [‘child’, ‘teen’, ‘young_adult’, ‘adult’, ‘elder’] age_imputer = SimpleImputer(strategy = ‘median’) age_ohe = OneHotEncoder() # Performing basic imputation for nulls imputed = age_imputer.fit_transform(col) ages_filled = pd.DataFrame(data = imputed, columns = [‘Age’]) # Segregating ages into age bins age_cat_cols = pd.cut(ages_filled[‘Age’], bins = age_bins, labels = age_labels) age_cats = pd.DataFrame(data = age_cat_cols, columns = [‘Age’]) # One hot encoding new age bins ages_encoded = age_ohe.fit_transform(age_cats[[‘Age’]]) ages_encoded = pd.DataFrame(data = ages_encoded.toarray()) return ages_encoded
Alright, next up is the “Embarked” column. Now, this is *almost* ready for a straight one hot encoding, but the reason we could not jump straight to there is because this column has some nulls in it. Those need to be addressed first, so here’s the custom transformer we’ll be making use of here.
# Creating function to appropriately engineer the ‘Embarked’ column def create_embarked_columns(col): ‘’’Engineers the embarked variables for pipeline’’’ # Instantiating the transformer objects embarked_imputer = SimpleImputer(strategy = ‘most_frequent’) embarked_ohe = OneHotEncoder() # Performing basic imputation for nulls imputed = embarked_imputer.fit_transform(col) embarked_filled = pd.DataFrame(data = imputed, columns = [‘Embarked’]) # Performing OHE on the col data embarked_columns = embarked_ohe.fit_transform(embarked_filled[[‘Embarked’]]) embarked_columns_df = pd.DataFrame(data = embarked_columns.toarray()) return embarked_columns_df
Now that we have our custom functions written, we can finally get them added to our pipeline. And wouldn’t you know it, but Scikit-Learn has a special method just for handling these special custom transformers called FunctionTransformer. It’s pretty easy to implement, so let’s see how that looks when we add it to our original pipeline.
# Creating a preprocessor to transform the ‘Sex’ column data_preprocessor = ColumnTransformer(transformers = [ (‘sex_transformer’, OneHotEncoder(), [‘Sex’]), (‘age_transformer’, FunctionTransformer(create_age_bins, validate = False), [‘Age’]), (‘embarked_transformer’, FunctionTransformer(create_embarked_columns, validate = False), [‘Embarked’]) ]) # Creating our pipeline that first preprocesses the data, then scales the data, then fits the data to a RandomForestClassifier rfc_pipeline = Pipeline(steps = [ (‘data_preprocessing’, data_preprocessor), (‘data_scaling’, StandardScaler()), (‘model’, RandomForestClassifier(max_depth = 10, min_samples_leaf = 3, min_samples_split = 4, n_estimators = 200)) ])
Easy peasy, right? It’s just a simple matter of using that Scikit-Learn FunctionTransformer to point to your correct custom function and make use of it on the designated column. From here on out, it’s a simple export of the model.
# Fitting the training data to our pipeline rfc_pipeline.fit(X_train, y_train) # Saving our pipeline to a binary pickle file joblib.dump(rfc_pipeline, ‘model/rfc_pipeline.pkl’)
****RETURN TO THE ASTERISKS TIME!!!
So…….. there’s sort of a downside to using custom transformers….
The serialized model does NOT store the code itself for ANY custom Python function. (At least… not in a way that I’ve figured out yet.) That said, in order to make use of this deserialized model, the pickle must be able to reference the same code written for the function transformer outside of its own binary values. Or in layman’s terms, you need to add your custom Python functions to whatever deployment script you write for a model like this.
Now, is this sort of annoying? Yes. But does this give me a reason to *not* use custom transformations? That’s an easy and firm NO. I recognize it is not convenient to have to provide extra custom code for your pipeline to run, but the trade off is having a transformation that is likely going to make the performance of your model much better than what it would be otherwise.
So yeah, that sort of stinks, but hey, I would choose to include custom transformers every time most likely. Most datasets contain a wide breadth of features that certainly will not break down into easy, simple transformations like imputation or one hot encoding. Real data is messy and often requires a lot of special cleaning, and these custom transformers are just the fit for the job.
And that wraps it up for this post! Hope you enjoyed it. If you’d like me to cover anything specific in a future post, please let me know! I have some more ideas rolling around in my head, so definitely stay tuned. ??