When Categorical Data Goes Wrong

When Categorical Data Goes Wrong

I ran into an issue whilst doing a machine learning project involving some categorical data and thought I would write a brief tutorial about what I learned. I was working on a model which had a considerable amount of categorical data and I ran into several issues which can briefly be summarized as:

  • Categories that were present in the training set were not always present in the testing data
  • Categories that were present in the testing set were not always present in the training data
  • Categories from “real world” (IE non testing or training) data were not present in the training or testing data

Handling Categorial Data: A brief tutorial

In Python, one of the unfortunate things about the scikit-learn/pandas modules is that they don’t really deal with categorical data very well. In the last few years, the Pandas community has introduced a “categorical” datatype. Unfortunately, this datatype does not carry over to scikit-learn, so if you have categorical data, you still have to encode it. Now there are tons of tutorials on the interweb about how to do this, so in the interests of time, I’ll show you the main methods:

GetDummies in Pandas

The most conventional approach and perhaps the easiest is pandas get_dummies() function which takes the input of a given column or columns and returns dummy columns for each category value. (Full docs here). Thus you can do the following:

df = pd.get_dummies(df)
No alt text provided for this image
No alt text provided for this image

Which turns the table on the left into the table on the right. 

As you can see, each category is encoded into a separate column with the column name followed by an underscore and the category variable. If the data is a member of that category, the column has a value of 1 otherwise the value is zero, hence the name One Hot Encoding. 

In general this works, but the pandas method has the problem of not working as a part of a Scikit-Learn pipeline. As such scikit-learn also has a OneHotEncoder which you can use to do basically the same thing. 

Personally, I find scikit’s OneHotEncoder to be a bit more difficult to use, so I didn’t really use it much, however, in my recent project I realized that I actually had to for a reason I’ll get to in a bit. 

Scikit Learn’s OneHotEncoder

Scikit-Learn has the OneHotEncoder() (Docs here) which does more or less the same thing as the pandas version. It does have several limitations, and quirks. The first being that the data types of your categories must be the same. IE if you have ints and strings, no go..

Secondly, scikit’s encoder returns either a numpy array or a sparse matrix as a result. Personally, this was annoying for me as I wanted to see what categories were useful as features, and in order to do so, you have to reconstruct a dataframe, which is a headache. In general, the code follows scikit’s general pattern of fit(), transform(). Here is example code of how to use scikit’s one hot encoder:

from sklearn.preprocessing import OneHotEncoder

encoder = OneHotEncoder(handle_unknown='ignore')

encoded_data = encoder.fit_transform(df[<category_columns>])

There are two advantages that I see that scikit’s method has over pandas. The first is that when you fit the scikit encoder, it now “remembers” what categories it has seen, and you can set it to ignore unknown categories. Whereas pandas does not have any recall and will just automatically convert all columns to dummy variables. The second is that you could include the OneHotEncoder in a pipeline which seemed like that would be advantageous as well. However, these advantages did not outweigh the difficulty of getting the data back into a dataframe with column labels. Also, I kept getting errors relating to datatypes and got really frustrated.

The original problem I was having was that you couldn’t guarantee that all categories would be present in both the training and testing set, so the solution I came up with was to write a function that switched the category value to “OTHER” if the category was not one of the top few. But I didn’t like this approach because it required me to maintain a list of categories and what happened if that list changed over time? Surely there’s a better way… 

Feature-Engine: A Better Solution

So what if I told you there was a way to encode categorical data such that you could:

  • Handle missing categories in either testing, training or real world data
  • Export the data to a DataFrame for easy analysis of the newly created features
  • Automatically aggregate categories with few values into an “other” category

Well you can’t so get over it. Ok, just kidding. I wouldn’t write a whole blog post to have it end like that… or would I? As it turns out, I stumbled upon a really useful module called feature-engine which contains some extremely useful tools for feature engineering that frankly should be included in Scikit-Learn. This module contains a collection of really useful stuff, but I’m just going to focus on the OneHotCategoryEncoder. (Docs here)

Let’s say you wanted to encode the data above, using the OneHotCategoryEncoder() you could create an encoder object as shown below:

from feature_engine import categorical_encoders as ce

import pandas as pd

 

# set up the encoder

encoder = ce.OneHotCategoricalEncoder(

    top_categories=3,

    drop_last=False)

 

# fit the encoder

encoder.fit(df)

encoder.transform(df)

Now, once we have the encoder object, we can encode our data using the fit()/transform() or the fit_transform()methods as shown above. Our toy data set above only has 3 categories, but what if it had 300?  Feature-Engineprovides an option in the constructor, top_categories, which has the effect of collapsing the into a more manageable number. For example, you could set the top_categories to 10 and that would get you the 10 most frequently occurring category columns and all others would be collapsed into an “other” column. That’s a nice feature! Well done!

There’s more. In our previous example, we had three categories when we fit the data, ‘A’, ‘B’ and ‘C’. So what happens if we have another category in the data that did not appear in the training data? Good question, and one that is not explicitly addressed in the documentation. So I tried this out and if you have the top_categories set, the encoder will ignore the unknown categories. This is debatable as to whether this is good design or not, but what it does mean is that it will work much better in real world applications. 

Since the OneHotCategoricalEncoder uses the fit()/fit_transform()/transform() from scikit-learn, it can be used in a Pipeline object. Finally, and perhaps most important to me, is that the OneHotCategoricalEncoder returns a pandas DataFrame rather than numpy arrays or other sparse matrices. The reason this mattered to me was that I wanted to see which categorical columns actually are adding value to the model and which are not. Doing this from a numpy array without column references is exceedingly difficult. 

TL;DR

In conclusion, both scikit-learn and Pandas traditional ways of encoding categorical variables have significant disadvantages, so if you have categorical data in your model, I would strongly recommend taking a look at Feature-Engine’s OneHotCategoricalEncoder.

要查看或添加评论,请登录

Charles Givre的更多文章

  • All Great Things Part 2: The Founder's Dilemma

    All Great Things Part 2: The Founder's Dilemma

    I recently posted an article about the demise of DataDistillr.?It was painful to write and I was worried that by doing…

    4 条评论
  • All Great Things...

    All Great Things...

    Well, this is the post I’d hoped to never write, but alas, we’ve reached the conclusion that it’s time to shut down…

    65 条评论
  • Why You Shouldn't Rely on GPT to Write Code

    Why You Shouldn't Rely on GPT to Write Code

    A lot of people have tried out ChatGPT and other LLMs for code their code writing abilities. My theory was that the…

    20 条评论
  • Tests in a GenAI World

    Tests in a GenAI World

    I teach a graduate level data management class at the University of Maryland, Baltimore County (UMBC). Let me preface…

    5 条评论
  • Five Things I Learned Writing SQL with Gen AI

    Five Things I Learned Writing SQL with Gen AI

    ChatGPT has been all over the news for the last few months and again with the release of GPT-4. At DataDistillr, we…

    7 条评论
  • It's The Assumptions That Get You

    It's The Assumptions That Get You

    I’ve had a number of conversations recently that have highlighted to me how not understanding people’s assumptions can…

    4 条评论
  • ChatGPT, Meet DataDistillr! You’ll have lots to discuss!

    ChatGPT, Meet DataDistillr! You’ll have lots to discuss!

    Happy New Year everyone! I’m pretty excited about this. Like every other tech geek out there, I was experimenting with…

    24 条评论
  • Five Technologies That I Think Are Bullshit

    Five Technologies That I Think Are Bullshit

    This is going to piss people off. I took a road trip a few weeks ago to New York and listened to an interview with Mark…

    49 条评论
  • We Launched! (Beta)

    We Launched! (Beta)

    Well, that day has finally come! After months of testing, speaking with customers and investors, our public beta is…

    13 条评论
  • Joining Difficult Data: How to Join Data on Extracted Domains

    Joining Difficult Data: How to Join Data on Extracted Domains

    2 条评论

社区洞察

其他会员也浏览了