Quickly Reverse a One-Hot Encoding
a cat prior to catboosting

Quickly Reverse a One-Hot Encoding

An expanded oversampling technique: SMOTe with a One-Hot-Reversal

Oversampling involves mimicking existing data and adding it to the dataset. Synthetic Minority Oversampling Technique (SMOTe) is a means to create similar examples of the non-majority class based on the density of observations. It looks at nearby data points and computes values so that unwarranted randomness is not added to the training set.?

This process is simple for numeric-only data, since distances can be computed. With categorical data, numeric representations must be created. In the following example I will use a one-hot-encoding of categorical features. In addition, I will show how to return to the original dataframe shape once SMOTe has been run.

I will be using a hypothetical binary classification problem - multiclass classification can be accomplished in a similar vein.?

SMOTe with numeric data

No alt text provided for this image

SMOTe with mixed data

I use an sklearn.Pipeline to create the one-hot encodings in a way that is repeatable for the model construction process.

No alt text provided for this image

One-Hot-Reversal

No alt text provided for this image

Why would you do this?

Some model architectures have a built in one-hot encoding process that is performance optimized beyond what sklearn can offer. An initial one-hot encoding process is needed to create the synthetic data. Leveraging the built in one-hot encoding engine requires returning to the original data format. CatBoost is one such architecture.

要查看或添加评论,请登录

社区洞察

其他会员也浏览了