Encoding Techniques
Encoding is the process of converting categorical data (words, labels) into numerical form so that machine learning models can understand it.
Why is encoding needed?
Types of Encoding Techniques :-
Label Encoding (Integer Encoding) - Best for Binary Categories
from sklearn.preprocessing import LabelEncoder
data = ['New York', 'San Francisco', 'Los Angeles', 'Austin', 'San Francisco']
encoder = LabelEncoder()
encoded = encoder.fit_transform(data)
print(encoded) # Output: [2, 3, 1, 0, 3]
One-Hot Encoding (OHE) -
import pandas as pd
data = pd.DataFrame({'City': ['New York', 'San Francisco', 'Los Angeles', 'Austin', 'San Francisco']})
# One-Hot Encode
encoded = pd.get_dummies(data, columns=['City'])
print(encoded)
City_Austin City_Los Angeles City_New York City_San Francisco
0 0 0 1 0
1 0 0 0 1
2 0 1 0 0
But when applied to high-cardinality categorical variables (e.g., thousands of city names, product IDs, or customer names), it leads to:
1?? Too many features (High-Dimensionality)
2?? Sparsity (Most values will be 0s, increasing memory usage)
3?? Overfitting (Model memorizes noise instead of learning patterns)
?? Why Does One-Hot Encoding Cause Overfitting? Overfitting occurs when the model learns spurious correlations instead of general patterns. ?? Example: Predicting House Prices Based on City Names
City Price ($1000s)
New York 800
San Francisco 900
Austin 700
Boston 750
? If we one-hot encode City, we get:
City_NewYork City_SanFrancisco City_Austin City_Boston Price ($1000s)
1 0 0 0 800
0 1 0 0 900
0 0 1 0 700
0 0 0 1 750
What’s the problem?
Result: Overfitting → Model performs well on training data but fails on unseen data.
Ordinal Encoding - Best for Ordered Categories
from sklearn.preprocessing import OrdinalEncoder
data = [['Small'], ['Medium'], ['Large'], ['Small'], ['Large']]
encoder = OrdinalEncoder(categories=[['Small', 'Medium', 'Large']]) # Define order
encoded = encoder.fit_transform(data)
print(encoded) # Output: [[0], [1], [2], [0], [2]]
Target Encoding (Mean Encoding) - Best for High-Cardinality Categorical Data
data = pd.DataFrame({'City': ['New York', 'San Francisco', 'Los Angeles', 'Austin', 'San Francisco'],
'Price': [500000, 600000, 550000, 450000, 620000]})
# Compute Mean Price per City
target_encoded = data.groupby('City')['Price'].mean()
# Replace City with Encoded Value
data['City_Encoded'] = data['City'].map(target_encoded)
print(data)
?? Example Output:
City Price City_Encoded
0 New York 500000 500000
1 San Francisco 600000 610000
2 Los Angeles 550000 550000
3 Austin 450000 450000
But if done incorrectly, it can leak future information into the model, leading to artificially high accuracy.
?? Why Does This Happen? Let’s say we are predicting employee promotions and use Target Encoding on the Department feature.
领英推荐
Department Promoted (Target)
HR 1
HR 0
IT 1
IT 1
IT 0
? Target Encoding calculates the mean promotion rate per department:
Department Target Encoding Value
HR 0.5
IT 0.67
?? Problem: If we calculate these values using all data (including the test set), the model gets access to information it shouldn’t have during training.
?? How This Leads to Overfitting
1?? The model sees patterns from the target during training (it knows which departments have high promotions).
2?? Since test data is used in encoding, the test set is no longer truly "unseen."
3?? The model performs extremely well on training data but fails on new data.
???? Example: Incorrect vs. Correct Target Encoding ? Wrong Way (Causes Data Leakage)
import pandas as pd
from category_encoders import TargetEncoder
# Simulated Dataset
data = pd.DataFrame({'Department': ['HR', 'HR', 'IT', 'IT', 'IT'], 'Promoted': [1, 0, 1, 1, 0]})
# WRONG: Encoding using the entire dataset (leaks info from target)
encoder = TargetEncoder()
data['Department_Encoded'] = encoder.fit_transform(data['Department'], data['Promoted'])
print(data)
# The model learns from the entire dataset, including the test set. ?? Problem: Target means are calculated before splitting the data, so the model sees future info.
? Right Way (Avoids Data Leakage)
from sklearn.model_selection import train_test_split
# Split dataset FIRST (before encoding)
X_train, X_test, y_train, y_test = train_test_split(data[['Department']], data['Promoted'], test_size=0.2, random_state=42)
# Correct: Apply Target Encoding only on training data
encoder = TargetEncoder()
X_train['Department_Encoded'] = encoder.fit_transform(X_train['Department'], y_train)
# Apply SAME transformation on test set WITHOUT using test target values
X_test['Department_Encoded'] = encoder.transform(X_test['Department'])
print(X_train.head(), X_test.head())
? Now, the test set never "sees" the target values during encoding.
?? How to Prevent Data Leakage in Target Encoding
Frequency Encoding - Best for Large Datasets
data = pd.DataFrame({'Product': ['A', 'B', 'C', 'A', 'A', 'B', 'C', 'C', 'C']})
# Compute Frequency
freq_encoded = data['Product'].value_counts(normalize=True)
# Replace Product with Encoded Value
data['Product_Encoded'] = data['Product'].map(freq_encoded)
print(data)
?? Example Output:
Product Product_Encoded
0 A 0.333
1 B 0.222
2 C 0.444
?? 1?? Can Lose Category Meaning
?? 2?? Can Cause Overfitting
? Solution: Apply smoothing techniques (e.g., Laplace smoothing).
?? 3?? Doesn’t Work Well If Frequency Is Uniform
? Solution: Use One-Hot Encoding or Target Encoding if categories are evenly distributed.
Hash Encoding → Best for Large-Scale Categorical Data
import category_encoders as ce
data = pd.DataFrame({'User_ID': ['A123', 'B456', 'C789', 'A123', 'C789']})
encoder = ce.HashingEncoder(cols=['User_ID'], n_components=4)
data_encoded = encoder.fit_transform(data)
print(data_encoded)
Best for large-scale datasets in NLP & recommender systems.
Encoding Type
Best For
When to Avoid?
Label Encoding ???
Binary categories (Yes/No)
Non-ordered categories (e.g., Cities)
One-Hot Encoding ??
Nominal data (low unique values)
Too many unique categories
Ordinal Encoding ??
Ordered categories (e.g., Small < Medium < Large)
No natural order
Target Encoding ??
High-cardinality categorical features
Small datasets (overfitting risk)
Frequency Encoding ??
Large datasets with repetitive categories
No meaningful frequency
Hash Encoding ??
Huge datasets (User IDs, URLs)
When category mapping is needed