Quickly Reverse a One-Hot Encoding
Jacqueline Rollins
Data Scientist, M.S. | Fraud Detection | Payment Services
An expanded oversampling technique: SMOTe with a One-Hot-Reversal
Oversampling involves mimicking existing data and adding it to the dataset. Synthetic Minority Oversampling Technique (SMOTe) is a means to create similar examples of the non-majority class based on the density of observations. It looks at nearby data points and computes values so that unwarranted randomness is not added to the training set.?
This process is simple for numeric-only data, since distances can be computed. With categorical data, numeric representations must be created. In the following example I will use a one-hot-encoding of categorical features. In addition, I will show how to return to the original dataframe shape once SMOTe has been run.
I will be using a hypothetical binary classification problem - multiclass classification can be accomplished in a similar vein.?
SMOTe with numeric data
SMOTe with mixed data
I use an sklearn.Pipeline to create the one-hot encodings in a way that is repeatable for the model construction process.
One-Hot-Reversal
Why would you do this?
Some model architectures have a built in one-hot encoding process that is performance optimized beyond what sklearn can offer. An initial one-hot encoding process is needed to create the synthetic data. Leveraging the built in one-hot encoding engine requires returning to the original data format. CatBoost is one such architecture.