?? Understanding the Dummy Variable Trap and How to Avoid It ??
Dummy Variable?
A dummy variable is a way to represent categories as numbers. Each category gets a 1 or a 0. For example, "apple" could be 1 and "not apple" could be 0.
What is the Dummy Variable Trap?
In machine learning, we often need to encode categorical variables as numerical values for models to process them. This is done using dummy variables, where each category in a variable is represented by a binary variable (0 or 1).
For instance, if we have a categorical variable "fruit" with three categories (apple, banana, and cherry), we can create dummy variables "BANANA" and "CHERRY". Here, the presence of "APPLE" is implied by the absence of "BANANA" and "CHERRY".
The dummy variable trap happens when one category can be predicted using the other categories, making some variables redundant. This can confuse your model.
How to Avoid Dummy Variable Trap?
To avoid this, we need to drop one of the dummy variables from each category. In the above example, we would drop "APPLE" and only use "BANANA" and "CHERRY". This way, the model can correctly distinguish between the different categories without redundancy.
Mathematical Explanation
The dummy variable trap occurs due to perfect multicollinearity, where one variable can be perfectly predicted using others.
Mathematically, if you have n categories, you need (n?1) dummy variables to avoid redundancy.
Example with Vehicle Types
Let's consider a different example with vehicle types. Suppose we have a categorical variable "vehicle_type" with four categories: car, bike, bus, and truck. We can create three dummy variables: "BIKE", "BUS", and "TRUCK". Here, "CAR" is implied by the absence of the other three.
Correct Encoding:
领英推荐
Solution:
To avoid the dummy variable trap, always drop one dummy variable from each categorical feature. In this case, drop "CAR":
import pandas as pd
# data
data = {'vehicle_type': ['car', 'bike', 'bus', 'truck']}
df = pd.DataFrame(data)
# Creating dummy variables
dummies = pd.get_dummies(df['vehicle_type'], drop_first=True)
df = pd.concat([df, dummies], axis=1)
df.drop('vehicle_type', axis=1, inplace=True)
print(df)
Output:
BIKE BUS TRUCK
0 0 0 0
1 1 0 0
2 0 1 0
3 0 0 1
By dropping "CAR", we avoid redundancy and ensure that the model can correctly interpret the data without falling into the dummy variable trap.
Simple Real-Life Examples
Example 1: Ice Cream Flavors
Imagine you have three ice cream flavors: chocolate, vanilla, and strawberry. If you create a dummy variable for each flavor, you don’t need all three because if a flavor is not chocolate or vanilla, it must be strawberry.
Example 2: T-Shirt Sizes
For t-shirt sizes small, medium, and large, you can use two dummy variables: medium and large. If neither medium nor large is present, the t-shirt must be small.