登录查看更多内容

Data Science Quick Tip #004: Using Custom Transformers in Scikit-Learn Pipelines!

David Hundley

Staff Machine Learning Engineer at State Farm ? | AI/ML Blogger | Livecoding Streamer

发布日期: 2020年8月30日

Hi there all. We’re back again with a follow up post to the last post’s tip regarding how to create Scikit-Learn pipelines in general. In case you missed that, you can now check it out at this link. (Where it is now officially published to Towards Data Science. w00t!) And as always, if you want to directly follow along with this post’s code, you can find that here at my personal GitHub.

To quickly cap where we left off from the last post, we had successfully created a Scikit-Learn pipeline that does all the data transformation, scaling, and inference all in one clean little package. But as of yet, we had to make use of Scikit-Learn’s default transformers within our pipeline. As great as those transformers are, wouldn’t it be great if we could make use of our own custom transformations? Well of course! I’d say it’s not only great, but it’s also necessary. If you recall from last week’s post, we built a model based off a single feature. That’s not very predictable!

So we’re going to remedy that by adding two transformers to transform two additional fields from the training dataset. (I know, going from 1 to 3 features still isn’t great. But hey, at least we’re increasing by 300%?) The original variable we started with was “Sex” (aka gender), and now we’re going to add in transformers for the appropriate “Ages” column and “Embarked” column.

Before we jump into our new custom transformers, let’s do our library imports. You might recall a lot of these from the last post, but we’re adding a couple extras. Don’t worry too much about what they are now as we’ll cover that further on down the post.

# Importing the libraries we’ll be using for this project
import pandas as pd
import joblibfrom sklearn.preprocessing import OneHotEncoder, StandardScaler, FunctionTransformer
from sklearn.impute import SimpleImputer
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import roc_auc_score, accuracy_score, confusion_matrix

And we’ll go ahead and do a quick import of our training data.

# Importing the training dataset
raw_train = pd.read_csv(‘../data/titanic/train.csv’)
# Splitting the training data into appropriate training and validation sets
X = raw_train.drop(columns = [‘Survived’])
y = raw_train[[‘Survived’]]
X_train, X_val, y_train, y_val = train_test_split(X, y, random_state = 42)

Okay, from here on out, we actually won’t be altering the Scikit-Learn pipeline itself. Sure, we’ll be adding to it, but remember, I intentionally designed my data preprocessor in such a way that would be easy to add onto. Just to quickly recap from the last post, here’s what the code to build the original pipeline looked like.

# Creating a preprocessor to transform the ‘Sex’ column
data_preprocessor = ColumnTransformer(transformers = [
   (‘sex_transformer’, OneHotEncoder(), [‘Sex’])
])
# Creating our pipeline that first preprocesses the data, then scales the data, then fits the data to a RandomForestClassifier
rfc_pipeline = Pipeline(steps = [
   (‘data_preprocessing’, data_preprocessor),
   (‘data_scaling’, StandardScaler()),
   (‘model’, RandomForestClassifier(max_depth = 10,
                                    min_samples_leaf = 3,
                                    min_samples_split = 4,
                                    n_estimators = 200))
])

The first thing we can do before adding our custom transformers to the pipeline is to create the function transformers, of course! So as you might have been able to guess, the custom transformers are built right on top of regular functions, so you can write any Python function you want for the transformer.**** (We’ll get to all those asterisks later…)

Alright, so we talked about adding two transformers for two new variables, so let’s get to creating our two custom Python functions! Touching on the “Ages” column first, we’re going to have a little extra fun with this variable. Now, I genuinely do not know if age itself is a predictive variable here, but I guessed that if “Age” is predictable in any meaningful way, it would be as age categories / age bins. That said, I segregated the ages into categories like “Child”, “Adult”, “Elder”, and more. Again, I have no idea if this will be more performant than using the straight integers, but it allows us to do some fun! Here’s what the code to do that looks like:

# Creating a function to appropriately engineer the ‘Age’ column
def create_age_bins(col):
    ‘’’Engineers age bin variables for pipeline’’’
 
    # Defining / instantiating the necessary variables
    age_bins = [-1, 12, 18, 25, 50, 100]
    age_labels = [‘child’, ‘teen’, ‘young_adult’, ‘adult’, ‘elder’]
    age_imputer = SimpleImputer(strategy = ‘median’)
    age_ohe = OneHotEncoder()
 
    # Performing basic imputation for nulls
    imputed = age_imputer.fit_transform(col)
    ages_filled = pd.DataFrame(data = imputed, columns = [‘Age’])
 
    # Segregating ages into age bins
    age_cat_cols = pd.cut(ages_filled[‘Age’], bins = age_bins, labels = age_labels)
    age_cats = pd.DataFrame(data = age_cat_cols, columns = [‘Age’])
 
    # One hot encoding new age bins
    ages_encoded = age_ohe.fit_transform(age_cats[[‘Age’]])
    ages_encoded = pd.DataFrame(data = ages_encoded.toarray())
 
    return ages_encoded

Alright, next up is the “Embarked” column. Now, this is *almost* ready for a straight one hot encoding, but the reason we could not jump straight to there is because this column has some nulls in it. Those need to be addressed first, so here’s the custom transformer we’ll be making use of here.

# Creating function to appropriately engineer the ‘Embarked’ column
def create_embarked_columns(col):
    ‘’’Engineers the embarked variables for pipeline’’’
 
    # Instantiating the transformer objects
    embarked_imputer = SimpleImputer(strategy = ‘most_frequent’)
    embarked_ohe = OneHotEncoder()
 
    # Performing basic imputation for nulls
    imputed = embarked_imputer.fit_transform(col)
    embarked_filled = pd.DataFrame(data = imputed, columns = [‘Embarked’])
 
    # Performing OHE on the col data
    embarked_columns = embarked_ohe.fit_transform(embarked_filled[[‘Embarked’]])
    embarked_columns_df = pd.DataFrame(data = embarked_columns.toarray())
 
 return embarked_columns_df

Now that we have our custom functions written, we can finally get them added to our pipeline. And wouldn’t you know it, but Scikit-Learn has a special method just for handling these special custom transformers called FunctionTransformer. It’s pretty easy to implement, so let’s see how that looks when we add it to our original pipeline.

# Creating a preprocessor to transform the ‘Sex’ column
data_preprocessor = ColumnTransformer(transformers = [
    (‘sex_transformer’, OneHotEncoder(), [‘Sex’]),
    (‘age_transformer’, FunctionTransformer(create_age_bins, validate = False), [‘Age’]),
    (‘embarked_transformer’, FunctionTransformer(create_embarked_columns, validate = False), [‘Embarked’])
])
# Creating our pipeline that first preprocesses the data, then scales the data, then fits the data to a RandomForestClassifier
rfc_pipeline = Pipeline(steps = [
    (‘data_preprocessing’, data_preprocessor),
    (‘data_scaling’, StandardScaler()),
    (‘model’, RandomForestClassifier(max_depth = 10,
                                     min_samples_leaf = 3,
                                     min_samples_split = 4,
                                     n_estimators = 200))
])

Easy peasy, right? It’s just a simple matter of using that Scikit-Learn FunctionTransformer to point to your correct custom function and make use of it on the designated column. From here on out, it’s a simple export of the model.

# Fitting the training data to our pipeline
rfc_pipeline.fit(X_train, y_train)
# Saving our pipeline to a binary pickle file
joblib.dump(rfc_pipeline, ‘model/rfc_pipeline.pkl’)

****RETURN TO THE ASTERISKS TIME!!!

So…….. there’s sort of a downside to using custom transformers….

The serialized model does NOT store the code itself for ANY custom Python function. (At least… not in a way that I’ve figured out yet.) That said, in order to make use of this deserialized model, the pickle must be able to reference the same code written for the function transformer outside of its own binary values. Or in layman’s terms, you need to add your custom Python functions to whatever deployment script you write for a model like this.

Now, is this sort of annoying? Yes. But does this give me a reason to *not* use custom transformations? That’s an easy and firm NO. I recognize it is not convenient to have to provide extra custom code for your pipeline to run, but the trade off is having a transformation that is likely going to make the performance of your model much better than what it would be otherwise.

So yeah, that sort of stinks, but hey, I would choose to include custom transformers every time most likely. Most datasets contain a wide breadth of features that certainly will not break down into easy, simple transformations like imputation or one hot encoding. Real data is messy and often requires a lot of special cleaning, and these custom transformers are just the fit for the job.

And that wraps it up for this post! Hope you enjoyed it. If you’d like me to cover anything specific in a future post, please let me know! I have some more ideas rolling around in my head, so definitely stay tuned. ??

要查看或添加评论，请登录

David Hundley的更多文章

An Extremely Simple Way to Think About Business

2024年8月18日

An Extremely Simple Way to Think About Business

Though I now work professionally in the artificial intelligence space, some folks out there may not know that I began…
Six Ways to Harden Your Model-Serving API with Tests & Scans

2021年8月9日

Six Ways to Harden Your Model-Serving API with Tests & Scans

Hello there friends! I’m back! Apologies for the two month hiatus. I always seem to lose the drive to post new things…
Seven Tips for Crafting a Great Data Science Resume

2021年6月11日

Seven Tips for Crafting a Great Data Science Resume

Hello there friends! If you’ve paid attention to how the data science industry has moved recently, you might notice…
Five Tips for Overcoming Imposter Syndrome in the Data Science World

2021年5月28日

Five Tips for Overcoming Imposter Syndrome in the Data Science World

Hello there, friends! I thought I’d take a small break from the typical tutorial-oriented posts to cover a different…

1 条评论
Terraform + SageMaker Part 2a: Creating a Custom SageMaker Notebook Instance

2021年5月23日

Terraform + SageMaker Part 2a: Creating a Custom SageMaker Notebook Instance

Hello there, all! We are back again with the Terraform + SageMaker series. Now I know, I know… if you’ve been following…
Terraform + SageMaker Part 1b: Initialization with Terraform Cloud

2021年5月15日

Terraform + SageMaker Part 1b: Initialization with Terraform Cloud

Hello there all! We are back again with an admittedly unexpected post with a bit of a funny backstory. As you might…

1 条评论
Data Science Quick Tips #012: Creating a Machine Learning Inference API with FastAPI

2021年4月25日

Data Science Quick Tips #012: Creating a Machine Learning Inference API with FastAPI

Hello there friends! We’re back again with another semi-quick post on creating a machine learning inference API with…
Four Skills to Start Your Data Science Learning Path

2021年4月19日

Four Skills to Start Your Data Science Learning Path

At least twice a week, I’m approached by technical and non-technical folks alike asking my thoughts on where to begin…

1 条评论
Terraform + SageMaker Part 1: Terraform Initialization

2021年4月2日

Terraform + SageMaker Part 1: Terraform Initialization

Hello there, folks! Today, we’re starting a new series on using Terraform to create resources on AWS SageMaker. I…

2 条评论
iPad Pro + Raspberry Pi for Data Science Part 4: Installing Kubernetes for Learning Purposes

2021年3月26日

iPad Pro + Raspberry Pi for Data Science Part 4: Installing Kubernetes for Learning Purposes

Hello there friends! We’re back again with a fourth part in our series for enabling a Raspberry Pi to work directly…

See all articles

Data Science Quick Tip #004: Using Custom Transformers in Scikit-Learn Pipelines!

David Hundley

Staff Machine Learning Engineer at State Farm ? | AI/ML Blogger | Livecoding Streamer

David Hundley的更多文章

社区洞察

其他会员也浏览了

Mastering Row-Level Transformations in Pandas with apply()

Cleaning the DATA

How to become an Expert at Pandas for Data Analysis for FREE

?? Unlock Time Series Insights Using Python’s KPSS Test ??

How to index data into Vector DB from highly unstructured pdfs

AIML23- Handling Large Data in Less Memory- Part-01

The Ultimate Data Scientist Roadmap: From Beginner to Mastery

A Reference Notebook for (+30) Statistical Charts in Seaborn

A Reference Notebook for (+30) Statistical Charts in?Seaborn

David Hundley的更多文章

An Extremely Simple Way to Think About Business

Six Ways to Harden Your Model-Serving API with Tests & Scans

Seven Tips for Crafting a Great Data Science Resume

Five Tips for Overcoming Imposter Syndrome in the Data Science World

Terraform + SageMaker Part 2a: Creating a Custom SageMaker Notebook Instance

Terraform + SageMaker Part 1b: Initialization with Terraform Cloud

Data Science Quick Tips #012: Creating a Machine Learning Inference API with FastAPI

Four Skills to Start Your Data Science Learning Path

Terraform + SageMaker Part 1: Terraform Initialization

iPad Pro + Raspberry Pi for Data Science Part 4: Installing Kubernetes for Learning Purposes

社区洞察

其他会员也浏览了

Mastering Row-Level Transformations in Pandas with apply()

Cleaning the DATA

How to become an Expert at Pandas for Data Analysis for FREE

?? Unlock Time Series Insights Using Python’s KPSS Test ??

How to index data into Vector DB from highly unstructured pdfs

AIML23- Handling Large Data in Less Memory- Part-01

The Ultimate Data Scientist Roadmap: From Beginner to Mastery

A Reference Notebook for (+30) Statistical Charts in Seaborn

A Reference Notebook for (+30) Statistical Charts in?Seaborn