登录查看更多内容

Encoding Techniques

Devender Kumar Chaursia

Senior Data Analyst at Incubit

发布日期: 2025年2月6日

Encoding is the process of converting categorical data (words, labels) into numerical form so that machine learning models can understand it.

Why is encoding needed?

ML models can’t work with text, only numbers.
Encoding converts categorical features into numerical values while preserving meaning.

Types of Encoding Techniques :-

Label Encoding (Integer Encoding) - Best for Binary Categories

Converts categories into numbers (0, 1, 2, ...).
Simple but can cause problems in ML models (models might assume an order).
Use when there are only 2 unique values (Yes/No, Male/Female).
Avoid when the categories don’t have an order (e.g., Dog/Bird/Cat).

from sklearn.preprocessing import LabelEncoder

data = ['New York', 'San Francisco', 'Los Angeles', 'Austin', 'San Francisco']
encoder = LabelEncoder()
encoded = encoder.fit_transform(data)

print(encoded)  # Output: [2, 3, 1, 0, 3]

One-Hot Encoding (OHE) -

Creates new binary columns (0/1) for each category.
Best for nominal (unordered) categories like cities, colors, product types.
Use when categories are nominal (no order) & unique values are low (<10).
Avoid when too many categories (hundreds of cities will create hundreds of columns).

import pandas as pd

data = pd.DataFrame({'City': ['New York', 'San Francisco', 'Los Angeles', 'Austin', 'San Francisco']})

# One-Hot Encode
encoded = pd.get_dummies(data, columns=['City'])
print(encoded)

	City_Austin	City_Los Angeles	City_New York	City_San Francisco
0	0	0	1	0
1	0	0	0	1
2	0	1	0	0

No ordering issue!
More categories = More columns = High memory usage.
Best for models like Logistic Regression, SVMs.

But when applied to high-cardinality categorical variables (e.g., thousands of city names, product IDs, or customer names), it leads to:

1?? Too many features (High-Dimensionality)

2?? Sparsity (Most values will be 0s, increasing memory usage)

3?? Overfitting (Model memorizes noise instead of learning patterns)

?? Why Does One-Hot Encoding Cause Overfitting? Overfitting occurs when the model learns spurious correlations instead of general patterns. ?? Example: Predicting House Prices Based on City Names


City	Price ($1000s)
New York	800
San Francisco	900
Austin	700
Boston	750

? If we one-hot encode City, we get:
City_NewYork	City_SanFrancisco	City_Austin	City_Boston	Price ($1000s)
1	0	0	0	800
0	1	0	0	900
0	0	1	0	700
0	0	0	1	750

What’s the problem?

If there are 1,000 cities, we get 1,000 new columns! ??
Model memorizes city-to-price mapping instead of learning meaningful relationships.
New cities during prediction? Model struggles! (It hasn't seen those before).

Result: Overfitting → Model performs well on training data but fails on unseen data.

Ordinal Encoding - Best for Ordered Categories

Best for ordered categories (e.g., Small < Medium < Large).
Assigns increasing numbers to maintain order
Avoid when order doesn’t matter (e.g., Animals: Cat/Dog/Bird).

from sklearn.preprocessing import OrdinalEncoder

data = [['Small'], ['Medium'], ['Large'], ['Small'], ['Large']]
encoder = OrdinalEncoder(categories=[['Small', 'Medium', 'Large']])  # Define order
encoded = encoder.fit_transform(data)

print(encoded)  # Output: [[0], [1], [2], [0], [2]]

Great for decision trees and gradient boosting models.

Target Encoding (Mean Encoding) - Best for High-Cardinality Categorical Data

Replaces categories with the mean of the target variable (good for categorical variables in regression).
Works well for high-cardinality features (too many categories).
Avoid when dataset is small (overfitting risk).

data = pd.DataFrame({'City': ['New York', 'San Francisco', 'Los Angeles', 'Austin', 'San Francisco'],
                     'Price': [500000, 600000, 550000, 450000, 620000]})

# Compute Mean Price per City
target_encoded = data.groupby('City')['Price'].mean()

# Replace City with Encoded Value
data['City_Encoded'] = data['City'].map(target_encoded)
print(data)

?? Example Output:

   	City	Price	City_Encoded
0	New York	500000	500000
1	San Francisco	600000	610000
2	Los Angeles	550000	550000
3	Austin	450000	450000

Works well for high-cardinality features (e.g., hundreds of cities).
Can cause data leakage if not used properly.
Best for tree-based models like XGBoost & CatBoost.

But if done incorrectly, it can leak future information into the model, leading to artificially high accuracy.

?? Why Does This Happen? Let’s say we are predicting employee promotions and use Target Encoding on the Department feature.

领英推荐

How can we prevent bias in machine learning models?

Machine Learning 2 年前

Feature Selection vs. Feature Extraction: Navigating…

Iain Brown PhD 1 年前

Handling Imbalanced Datasets in Machine Learning

RAMA GOPALA KRISHNA MASANI 2 个月前


Department	Promoted (Target)
HR	               1
HR	               0
IT	               1
IT	               1
IT	               0

? Target Encoding calculates the mean promotion rate per department:

Department	Target Encoding Value
HR	                0.5
IT	               0.67

?? Problem: If we calculate these values using all data (including the test set), the model gets access to information it shouldn’t have during training.

?? How This Leads to Overfitting

1?? The model sees patterns from the target during training (it knows which departments have high promotions).

2?? Since test data is used in encoding, the test set is no longer truly "unseen."

3?? The model performs extremely well on training data but fails on new data.

???? Example: Incorrect vs. Correct Target Encoding ? Wrong Way (Causes Data Leakage)


import pandas as pd
from category_encoders import TargetEncoder

# Simulated Dataset
data = pd.DataFrame({'Department': ['HR', 'HR', 'IT', 'IT', 'IT'], 'Promoted': [1, 0, 1, 1, 0]})

# WRONG: Encoding using the entire dataset (leaks info from target)
encoder = TargetEncoder()
data['Department_Encoded'] = encoder.fit_transform(data['Department'], data['Promoted'])

print(data)

# The model learns from the entire dataset, including the test set. ?? Problem: Target means are calculated before splitting the data, so the model sees future info.

? Right Way (Avoids Data Leakage)


from sklearn.model_selection import train_test_split

# Split dataset FIRST (before encoding)
X_train, X_test, y_train, y_test = train_test_split(data[['Department']], data['Promoted'], test_size=0.2, random_state=42)

# Correct: Apply Target Encoding only on training data
encoder = TargetEncoder()
X_train['Department_Encoded'] = encoder.fit_transform(X_train['Department'], y_train)

# Apply SAME transformation on test set WITHOUT using test target values
X_test['Department_Encoded'] = encoder.transform(X_test['Department'])

print(X_train.head(), X_test.head())

? Now, the test set never "sees" the target values during encoding.

?? How to Prevent Data Leakage in Target Encoding

Always split data first before encoding.
Use cross-validation (CV) to calculate mean target values per fold.
Apply encoding only on the training set, then use the same transformation on the test set.

Frequency Encoding - Best for Large Datasets

Replaces categories with how often they appear in the dataset.
Useful for large datasets with many categories.
Use when categories appear repeatedly in different amounts (e.g., Customer Purchases).
Avoid when frequency has no meaning.

data = pd.DataFrame({'Product': ['A', 'B', 'C', 'A', 'A', 'B', 'C', 'C', 'C']})

# Compute Frequency
freq_encoded = data['Product'].value_counts(normalize=True)

# Replace Product with Encoded Value
data['Product_Encoded'] = data['Product'].map(freq_encoded)
print(data)

?? Example Output:

        Product	        Product_Encoded
0	A	                0.333
1	B	                0.222
2	C	                0.444

Keeps the relationship between categories & data distribution.
May not capture complex relationships.

?? 1?? Can Lose Category Meaning

Frequency Encoding doesn’t preserve category relationships.
Example: If "Tesla" and "Ford" have the same frequency, the model treats them as the same, even though they’re very different brands.

?? 2?? Can Cause Overfitting

If the dataset is small, frequency values may not generalize well.
The model might learn specific dataset patterns that don’t exist in new data.
Works best with large datasets to reduce noise.

? Solution: Apply smoothing techniques (e.g., Laplace smoothing).

?? 3?? Doesn’t Work Well If Frequency Is Uniform

If all categories appear almost equally, encoding becomes useless.
No meaningful variation for the model to learn from.

? Solution: Use One-Hot Encoding or Target Encoding if categories are evenly distributed.

Hash Encoding → Best for Large-Scale Categorical Data

Use when there are thousands of unique values (e.g., User IDs, URLs).
Avoid when interpretability is needed (you lose exact category mapping).
Interpretability is how easily humans can understand why a model makes a decision.

import category_encoders as ce
data = pd.DataFrame({'User_ID': ['A123', 'B456', 'C789', 'A123', 'C789']})

encoder = ce.HashingEncoder(cols=['User_ID'], n_components=4)
data_encoded = encoder.fit_transform(data)
print(data_encoded)

Best for large-scale datasets in NLP & recommender systems.

Encoding Type	
Best For	                                
When to Avoid?

Label Encoding ???	
Binary categories (Yes/No)	
Non-ordered categories (e.g., Cities)

One-Hot Encoding ??	
Nominal data (low unique values)	
Too many unique categories

Ordinal Encoding ??	
Ordered categories (e.g., Small < Medium < Large)	
No natural order

Target Encoding ??	
High-cardinality categorical features	
Small datasets (overfitting risk)

Frequency Encoding ??	
Large datasets with repetitive categories	
No meaningful frequency

Hash Encoding ??	
Huge datasets (User IDs, URLs)	
When category mapping is needed

要查看或添加评论，请登录

Devender Kumar Chaursia的更多文章

Machine Learning - Cross Validation

2025年2月8日

Machine Learning - Cross Validation

Cross-validation (CV) helps us check if a machine learning model performs well on unseen data. Instead of using just…
The Role of Concatenation in Data Analysis: Exploring Varieties in SQL and Python

2023年8月14日

The Role of Concatenation in Data Analysis: Exploring Varieties in SQL and Python

In the realm of data analysis, the ability to combine and manipulate data from various sources is a fundamental skill…
Exploring Various Methods to Create Virtual Environments in Python: Common Errors and Solutions

2023年8月13日

Exploring Various Methods to Create Virtual Environments in Python: Common Errors and Solutions

Virtual environments are an essential tool for Python developers as they provide isolated and self-contained spaces to…

Encoding Techniques

Devender Kumar Chaursia

Senior Data Analyst at Incubit

领英推荐

?? How This Leads to Overfitting

?? 1?? Can Lose Category Meaning

?? 2?? Can Cause Overfitting

?? 3?? Doesn’t Work Well If Frequency Is Uniform

Devender Kumar Chaursia的更多文章

社区洞察

其他会员也浏览了

Encode-Categorical-Features

Class 20 - MODEL EVALUATION METRICS Notes from the AI Basic Course by Irfan Malik & Dr Sheraz Naseer (Xeven Solutions)

Decision Tree

Harness the Power of Retrieval-Augmented Generation (RAG) with SNA Technologies

Demystifying Model Results: Advanced Techniques for Interpreting Machine Learning Models

An Executive’s View: Overview of major Machine Learning Algorithms

Regularization..

Unleashing the Power of Feature Engineering and Selection in Machine Learning: A Comprehensive Guide

TOP 10 MACHINE LEARNING ALGORITHMS

Visual Insights into k-NN Classification for Iris Dataset

领英推荐

?? How This Leads to Overfitting

?? 1?? Can Lose Category Meaning

?? 2?? Can Cause Overfitting

?? 3?? Doesn’t Work Well If Frequency Is Uniform

Devender Kumar Chaursia的更多文章

Machine Learning - Cross Validation

The Role of Concatenation in Data Analysis: Exploring Varieties in SQL and Python

Exploring Various Methods to Create Virtual Environments in Python: Common Errors and Solutions

社区洞察

其他会员也浏览了

Encode-Categorical-Features

Class 20 - MODEL EVALUATION METRICS Notes from the AI Basic Course by Irfan Malik & Dr Sheraz Naseer (Xeven Solutions)

Decision Tree

Harness the Power of Retrieval-Augmented Generation (RAG) with SNA Technologies

Demystifying Model Results: Advanced Techniques for Interpreting Machine Learning Models

An Executive’s View: Overview of major Machine Learning Algorithms

Regularization..

Unleashing the Power of Feature Engineering and Selection in Machine Learning: A Comprehensive Guide

TOP 10 MACHINE LEARNING ALGORITHMS

Visual Insights into k-NN Classification for Iris Dataset