Data Science Quick Tip #004: Using Custom Transformers in Scikit-Learn Pipelines!

Data Science Quick Tip #004: Using Custom Transformers in Scikit-Learn Pipelines!

Hi there all. We’re back again with a follow up post to the last post’s tip regarding how to create Scikit-Learn pipelines in general. In case you missed that, you can now check it out at this link. (Where it is now officially published to Towards Data Science. w00t!) And as always, if you want to directly follow along with this post’s code, you can find that here at my personal GitHub.

To quickly cap where we left off from the last post, we had successfully created a Scikit-Learn pipeline that does all the data transformation, scaling, and inference all in one clean little package. But as of yet, we had to make use of Scikit-Learn’s default transformers within our pipeline. As great as those transformers are, wouldn’t it be great if we could make use of our own custom transformations? Well of course! I’d say it’s not only great, but it’s also necessary. If you recall from last week’s post, we built a model based off a single feature. That’s not very predictable!

So we’re going to remedy that by adding two transformers to transform two additional fields from the training dataset. (I know, going from 1 to 3 features still isn’t great. But hey, at least we’re increasing by 300%?) The original variable we started with was “Sex” (aka gender), and now we’re going to add in transformers for the appropriate “Ages” column and “Embarked” column.

Before we jump into our new custom transformers, let’s do our library imports. You might recall a lot of these from the last post, but we’re adding a couple extras. Don’t worry too much about what they are now as we’ll cover that further on down the post.

# Importing the libraries we’ll be using for this project
import pandas as pd
import joblibfrom sklearn.preprocessing import OneHotEncoder, StandardScaler, FunctionTransformer
from sklearn.impute import SimpleImputer
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import roc_auc_score, accuracy_score, confusion_matrix

And we’ll go ahead and do a quick import of our training data.

# Importing the training dataset
raw_train = pd.read_csv(‘../data/titanic/train.csv’)
# Splitting the training data into appropriate training and validation sets
X = raw_train.drop(columns = [‘Survived’])
y = raw_train[[‘Survived’]]
X_train, X_val, y_train, y_val = train_test_split(X, y, random_state = 42)

Okay, from here on out, we actually won’t be altering the Scikit-Learn pipeline itself. Sure, we’ll be adding to it, but remember, I intentionally designed my data preprocessor in such a way that would be easy to add onto. Just to quickly recap from the last post, here’s what the code to build the original pipeline looked like.

# Creating a preprocessor to transform the ‘Sex’ column
data_preprocessor = ColumnTransformer(transformers = [
   (‘sex_transformer’, OneHotEncoder(), [‘Sex’])
])
# Creating our pipeline that first preprocesses the data, then scales the data, then fits the data to a RandomForestClassifier
rfc_pipeline = Pipeline(steps = [
   (‘data_preprocessing’, data_preprocessor),
   (‘data_scaling’, StandardScaler()),
   (‘model’, RandomForestClassifier(max_depth = 10,
                                    min_samples_leaf = 3,
                                    min_samples_split = 4,
                                    n_estimators = 200))
])

The first thing we can do before adding our custom transformers to the pipeline is to create the function transformers, of course! So as you might have been able to guess, the custom transformers are built right on top of regular functions, so you can write any Python function you want for the transformer.**** (We’ll get to all those asterisks later…)

Alright, so we talked about adding two transformers for two new variables, so let’s get to creating our two custom Python functions! Touching on the “Ages” column first, we’re going to have a little extra fun with this variable. Now, I genuinely do not know if age itself is a predictive variable here, but I guessed that if “Age” is predictable in any meaningful way, it would be as age categories / age bins. That said, I segregated the ages into categories like “Child”, “Adult”, “Elder”, and more. Again, I have no idea if this will be more performant than using the straight integers, but it allows us to do some fun! Here’s what the code to do that looks like:

# Creating a function to appropriately engineer the ‘Age’ column
def create_age_bins(col):
    ‘’’Engineers age bin variables for pipeline’’’
 
    # Defining / instantiating the necessary variables
    age_bins = [-1, 12, 18, 25, 50, 100]
    age_labels = [‘child’, ‘teen’, ‘young_adult’, ‘adult’, ‘elder’]
    age_imputer = SimpleImputer(strategy = ‘median’)
    age_ohe = OneHotEncoder()
 
    # Performing basic imputation for nulls
    imputed = age_imputer.fit_transform(col)
    ages_filled = pd.DataFrame(data = imputed, columns = [‘Age’])
 
    # Segregating ages into age bins
    age_cat_cols = pd.cut(ages_filled[‘Age’], bins = age_bins, labels = age_labels)
    age_cats = pd.DataFrame(data = age_cat_cols, columns = [‘Age’])
 
    # One hot encoding new age bins
    ages_encoded = age_ohe.fit_transform(age_cats[[‘Age’]])
    ages_encoded = pd.DataFrame(data = ages_encoded.toarray())
 
    return ages_encoded

Alright, next up is the “Embarked” column. Now, this is *almost* ready for a straight one hot encoding, but the reason we could not jump straight to there is because this column has some nulls in it. Those need to be addressed first, so here’s the custom transformer we’ll be making use of here.

# Creating function to appropriately engineer the ‘Embarked’ column
def create_embarked_columns(col):
    ‘’’Engineers the embarked variables for pipeline’’’
 
    # Instantiating the transformer objects
    embarked_imputer = SimpleImputer(strategy = ‘most_frequent’)
    embarked_ohe = OneHotEncoder()
 
    # Performing basic imputation for nulls
    imputed = embarked_imputer.fit_transform(col)
    embarked_filled = pd.DataFrame(data = imputed, columns = [‘Embarked’])
 
    # Performing OHE on the col data
    embarked_columns = embarked_ohe.fit_transform(embarked_filled[[‘Embarked’]])
    embarked_columns_df = pd.DataFrame(data = embarked_columns.toarray())
 
 return embarked_columns_df

Now that we have our custom functions written, we can finally get them added to our pipeline. And wouldn’t you know it, but Scikit-Learn has a special method just for handling these special custom transformers called FunctionTransformer. It’s pretty easy to implement, so let’s see how that looks when we add it to our original pipeline.

# Creating a preprocessor to transform the ‘Sex’ column
data_preprocessor = ColumnTransformer(transformers = [
    (‘sex_transformer’, OneHotEncoder(), [‘Sex’]),
    (‘age_transformer’, FunctionTransformer(create_age_bins, validate = False), [‘Age’]),
    (‘embarked_transformer’, FunctionTransformer(create_embarked_columns, validate = False), [‘Embarked’])
])
# Creating our pipeline that first preprocesses the data, then scales the data, then fits the data to a RandomForestClassifier
rfc_pipeline = Pipeline(steps = [
    (‘data_preprocessing’, data_preprocessor),
    (‘data_scaling’, StandardScaler()),
    (‘model’, RandomForestClassifier(max_depth = 10,
                                     min_samples_leaf = 3,
                                     min_samples_split = 4,
                                     n_estimators = 200))
])

Easy peasy, right? It’s just a simple matter of using that Scikit-Learn FunctionTransformer to point to your correct custom function and make use of it on the designated column. From here on out, it’s a simple export of the model.

# Fitting the training data to our pipeline
rfc_pipeline.fit(X_train, y_train)
# Saving our pipeline to a binary pickle file
joblib.dump(rfc_pipeline, ‘model/rfc_pipeline.pkl’)

****RETURN TO THE ASTERISKS TIME!!!

So…….. there’s sort of a downside to using custom transformers….

The serialized model does NOT store the code itself for ANY custom Python function. (At least… not in a way that I’ve figured out yet.) That said, in order to make use of this deserialized model, the pickle must be able to reference the same code written for the function transformer outside of its own binary values. Or in layman’s terms, you need to add your custom Python functions to whatever deployment script you write for a model like this.

Now, is this sort of annoying? Yes. But does this give me a reason to *not* use custom transformations? That’s an easy and firm NO. I recognize it is not convenient to have to provide extra custom code for your pipeline to run, but the trade off is having a transformation that is likely going to make the performance of your model much better than what it would be otherwise.

So yeah, that sort of stinks, but hey, I would choose to include custom transformers every time most likely. Most datasets contain a wide breadth of features that certainly will not break down into easy, simple transformations like imputation or one hot encoding. Real data is messy and often requires a lot of special cleaning, and these custom transformers are just the fit for the job.

And that wraps it up for this post! Hope you enjoyed it. If you’d like me to cover anything specific in a future post, please let me know! I have some more ideas rolling around in my head, so definitely stay tuned. ??

要查看或添加评论,请登录

David Hundley的更多文章

社区洞察

其他会员也浏览了