Encoding Techniques

Encoding Techniques

Encoding is the process of converting categorical data (words, labels) into numerical form so that machine learning models can understand it.

Why is encoding needed?

  1. ML models can’t work with text, only numbers.
  2. Encoding converts categorical features into numerical values while preserving meaning.


Types of Encoding Techniques :-

Label Encoding (Integer Encoding) - Best for Binary Categories

  1. Converts categories into numbers (0, 1, 2, ...).
  2. Simple but can cause problems in ML models (models might assume an order).
  3. Use when there are only 2 unique values (Yes/No, Male/Female).
  4. Avoid when the categories don’t have an order (e.g., Dog/Bird/Cat).

from sklearn.preprocessing import LabelEncoder

data = ['New York', 'San Francisco', 'Los Angeles', 'Austin', 'San Francisco']
encoder = LabelEncoder()
encoded = encoder.fit_transform(data)

print(encoded)  # Output: [2, 3, 1, 0, 3]        


One-Hot Encoding (OHE) -

  1. Creates new binary columns (0/1) for each category.
  2. Best for nominal (unordered) categories like cities, colors, product types.
  3. Use when categories are nominal (no order) & unique values are low (<10).
  4. Avoid when too many categories (hundreds of cities will create hundreds of columns).

import pandas as pd

data = pd.DataFrame({'City': ['New York', 'San Francisco', 'Los Angeles', 'Austin', 'San Francisco']})

# One-Hot Encode
encoded = pd.get_dummies(data, columns=['City'])
print(encoded)
        
	City_Austin	City_Los Angeles	City_New York	City_San Francisco
0	0	0	1	0
1	0	0	0	1
2	0	1	0	0        

  1. No ordering issue!
  2. More categories = More columns = High memory usage.
  3. Best for models like Logistic Regression, SVMs.

But when applied to high-cardinality categorical variables (e.g., thousands of city names, product IDs, or customer names), it leads to:

1?? Too many features (High-Dimensionality)

2?? Sparsity (Most values will be 0s, increasing memory usage)

3?? Overfitting (Model memorizes noise instead of learning patterns)

?? Why Does One-Hot Encoding Cause Overfitting? Overfitting occurs when the model learns spurious correlations instead of general patterns. ?? Example: Predicting House Prices Based on City Names


City	Price ($1000s)
New York	800
San Francisco	900
Austin	700
Boston	750

? If we one-hot encode City, we get:
City_NewYork	City_SanFrancisco	City_Austin	City_Boston	Price ($1000s)
1	0	0	0	800
0	1	0	0	900
0	0	1	0	700
0	0	0	1	750        

What’s the problem?

  • If there are 1,000 cities, we get 1,000 new columns! ??
  • Model memorizes city-to-price mapping instead of learning meaningful relationships.
  • New cities during prediction? Model struggles! (It hasn't seen those before).

Result: Overfitting → Model performs well on training data but fails on unseen data.


Ordinal Encoding - Best for Ordered Categories

  1. Best for ordered categories (e.g., Small < Medium < Large).
  2. Assigns increasing numbers to maintain order
  3. Avoid when order doesn’t matter (e.g., Animals: Cat/Dog/Bird).

from sklearn.preprocessing import OrdinalEncoder

data = [['Small'], ['Medium'], ['Large'], ['Small'], ['Large']]
encoder = OrdinalEncoder(categories=[['Small', 'Medium', 'Large']])  # Define order
encoded = encoder.fit_transform(data)

print(encoded)  # Output: [[0], [1], [2], [0], [2]]        

  1. Great for decision trees and gradient boosting models.


Target Encoding (Mean Encoding) - Best for High-Cardinality Categorical Data

  1. Replaces categories with the mean of the target variable (good for categorical variables in regression).
  2. Works well for high-cardinality features (too many categories).
  3. Avoid when dataset is small (overfitting risk).

data = pd.DataFrame({'City': ['New York', 'San Francisco', 'Los Angeles', 'Austin', 'San Francisco'],
                     'Price': [500000, 600000, 550000, 450000, 620000]})

# Compute Mean Price per City
target_encoded = data.groupby('City')['Price'].mean()

# Replace City with Encoded Value
data['City_Encoded'] = data['City'].map(target_encoded)
print(data)        
?? Example Output:

   	City	Price	City_Encoded
0	New York	500000	500000
1	San Francisco	600000	610000
2	Los Angeles	550000	550000
3	Austin	450000	450000        

  1. Works well for high-cardinality features (e.g., hundreds of cities).
  2. Can cause data leakage if not used properly.
  3. Best for tree-based models like XGBoost & CatBoost.

But if done incorrectly, it can leak future information into the model, leading to artificially high accuracy.

?? Why Does This Happen? Let’s say we are predicting employee promotions and use Target Encoding on the Department feature.


Department	Promoted (Target)
HR	               1
HR	               0
IT	               1
IT	               1
IT	               0

? Target Encoding calculates the mean promotion rate per department:

Department	Target Encoding Value
HR	                0.5
IT	               0.67        

?? Problem: If we calculate these values using all data (including the test set), the model gets access to information it shouldn’t have during training.

?? How This Leads to Overfitting

1?? The model sees patterns from the target during training (it knows which departments have high promotions).

2?? Since test data is used in encoding, the test set is no longer truly "unseen."

3?? The model performs extremely well on training data but fails on new data.

???? Example: Incorrect vs. Correct Target Encoding ? Wrong Way (Causes Data Leakage)


import pandas as pd
from category_encoders import TargetEncoder

# Simulated Dataset
data = pd.DataFrame({'Department': ['HR', 'HR', 'IT', 'IT', 'IT'], 'Promoted': [1, 0, 1, 1, 0]})

# WRONG: Encoding using the entire dataset (leaks info from target)
encoder = TargetEncoder()
data['Department_Encoded'] = encoder.fit_transform(data['Department'], data['Promoted'])

print(data)          

# The model learns from the entire dataset, including the test set. ?? Problem: Target means are calculated before splitting the data, so the model sees future info.

? Right Way (Avoids Data Leakage)


from sklearn.model_selection import train_test_split

# Split dataset FIRST (before encoding)
X_train, X_test, y_train, y_test = train_test_split(data[['Department']], data['Promoted'], test_size=0.2, random_state=42)

# Correct: Apply Target Encoding only on training data
encoder = TargetEncoder()
X_train['Department_Encoded'] = encoder.fit_transform(X_train['Department'], y_train)

# Apply SAME transformation on test set WITHOUT using test target values
X_test['Department_Encoded'] = encoder.transform(X_test['Department'])

print(X_train.head(), X_test.head())        

? Now, the test set never "sees" the target values during encoding.

?? How to Prevent Data Leakage in Target Encoding

  1. Always split data first before encoding.
  2. Use cross-validation (CV) to calculate mean target values per fold.
  3. Apply encoding only on the training set, then use the same transformation on the test set.


Frequency Encoding - Best for Large Datasets

  1. Replaces categories with how often they appear in the dataset.
  2. Useful for large datasets with many categories.
  3. Use when categories appear repeatedly in different amounts (e.g., Customer Purchases).
  4. Avoid when frequency has no meaning.

data = pd.DataFrame({'Product': ['A', 'B', 'C', 'A', 'A', 'B', 'C', 'C', 'C']})

# Compute Frequency
freq_encoded = data['Product'].value_counts(normalize=True)

# Replace Product with Encoded Value
data['Product_Encoded'] = data['Product'].map(freq_encoded)
print(data)        
?? Example Output:

        Product	        Product_Encoded
0	A	                0.333
1	B	                0.222
2	C	                0.444        

  1. Keeps the relationship between categories & data distribution.
  2. May not capture complex relationships.

?? 1?? Can Lose Category Meaning

  • Frequency Encoding doesn’t preserve category relationships.
  • Example: If "Tesla" and "Ford" have the same frequency, the model treats them as the same, even though they’re very different brands.

?? 2?? Can Cause Overfitting

  • If the dataset is small, frequency values may not generalize well.
  • The model might learn specific dataset patterns that don’t exist in new data.
  • Works best with large datasets to reduce noise.

? Solution: Apply smoothing techniques (e.g., Laplace smoothing).

?? 3?? Doesn’t Work Well If Frequency Is Uniform

  • If all categories appear almost equally, encoding becomes useless.
  • No meaningful variation for the model to learn from.

? Solution: Use One-Hot Encoding or Target Encoding if categories are evenly distributed.


Hash Encoding → Best for Large-Scale Categorical Data

  1. Use when there are thousands of unique values (e.g., User IDs, URLs).
  2. Avoid when interpretability is needed (you lose exact category mapping).
  3. Interpretability is how easily humans can understand why a model makes a decision.

import category_encoders as ce
data = pd.DataFrame({'User_ID': ['A123', 'B456', 'C789', 'A123', 'C789']})

encoder = ce.HashingEncoder(cols=['User_ID'], n_components=4)
data_encoded = encoder.fit_transform(data)
print(data_encoded)        

Best for large-scale datasets in NLP & recommender systems.

Encoding Type	
Best For	                                
When to Avoid?

Label Encoding ???	
Binary categories (Yes/No)	
Non-ordered categories (e.g., Cities)

One-Hot Encoding ??	
Nominal data (low unique values)	
Too many unique categories

Ordinal Encoding ??	
Ordered categories (e.g., Small < Medium < Large)	
No natural order

Target Encoding ??	
High-cardinality categorical features	
Small datasets (overfitting risk)

Frequency Encoding ??	
Large datasets with repetitive categories	
No meaningful frequency

Hash Encoding ??	
Huge datasets (User IDs, URLs)	
When category mapping is needed        


要查看或添加评论,请登录

Devender Kumar Chaursia的更多文章

社区洞察

其他会员也浏览了