Data Encoding in Machine Learning - Part 08
Vinod Kumar G R
Co-founder of ApexIQ | Driving AI Innovation with LLMs & GenAI | Passionate about Transformative AI Solutions
Do machine learning algorithms read the text for training as we do? Have you ever wondered how the machine learning models will read the text, get trained, and later it will respond to the questions that you’re going to ask?
Feeding raw text directly to a machine learning model results in errors and tells you “What the hell! What is this? I can’t understand what you’re going to feed me”.
Let me tell you what is happening here, how they get trained. This is a significant step in machine learning, you need to convert the data into a numerical format and then feed it to your algorithm. It would help if you used some techniques in machine learning to convert it into numerical data.
The machine never understands the text, you need to convert it into the numerical format and make it understand.
In this article, we will discuss the different techniques that we use to convert text data into numerical data so that your algorithm can capture the patterns in the data for better performance.
What is Data Encoding?
Encoding data refers to the conversion of categorical or text data into a numerical format that can be easily understood and processed by algorithms.
Here we need to convert the categorical(text) data, but the question is which categorical data? all categorical data?
You know there are two different types of text data, and they are
We have seen the types in categorical data, you might wonder why categorical data has two different types.
Imagine you’re building a recommendation system for an online food delivery website. The system aims to provide users with restaurant recommendations based on their preferences. In this scenario, you’re dealing with ordinal categorical data.
For instance, users can rate restaurants on a scale of 1 to 5, with categories like “Poor,” “Fair,” “Good,” “Very Good,” and “Excellent.” The ratings have a clear order from the least favorable (Poor) to the most favorable (Excellent), but the intervals between these categories might not be precisely equal.
In this ordinal categorical data, the order matters — you’d want to recommend restaurants with higher ratings before those with lower ratings.
The system needs to understand that “Very Good” is a better recommendation than “Fair” but may not precisely quantify the difference between the two.
This is where the ordinal nature of the categorical data becomes essential in building an effective recommendation model for restaurant searches.
Different Encoding techniques
Let’s discuss the usage of these techniques one by one You can encode with the pandas library also, we see this pandas encoding in one-hot-encoding itself.
1. One Hot Encoding
This technique we used to apply to nominal categorical data. One-hot encoding converts categorical data into a binary matrix, where each category is represented by a unique binary digit (0 or 1). This technique creates a binary column for each category and marks the presence or absence of that category with a 1 or 0, respectively. This is the most common technique used for categorical data.
You can see in the above image, how the categorical column is converted into numerical data.
In pandas libraries also happen the same thing, you can compare both encodings in practical examples. (Below given colab notebook link)
Let’s see a sample code of using one-hot-encoder:
# import the required libraries
import pandas as pd
import numpy as np
import seaborn as sns
# this library is used for one hot encoding
from sklearn.preprocessing import OneHotEncoder
# Load the iris dataset from Seaborn
iris = sns.load_dataset("iris")
# Display the first few rows of the dataset
iris.head()
# Select the categorical columns (species is included as an example)
categorical_columns = iris.select_dtypes(include=['object'])
# Initialize the OneHotEncoder
one_hot_encoder = OneHotEncoder(sparse=False)
# Fit and transform the categorical columns using one-hot encoding
one_hot_encoded = one_hot_encoder.fit_transform(categorical_columns)
# Convert the one-hot encoded result to a DataFrame for better visualization
one_hot_df = pd.DataFrame(one_hot_encoded, columns=one_hot_encoder.get_feature_names_out())
# Display the one-hot encoded DataFrame
one_hot_df.head()
In the above code,
Google Colaboratory
I have written code with detailed explanations in the colab notebook. Click the above link.
2. Ordinal Encoding
Ordinal encoding is another technique used in machine learning to represent categorical variables, particularly when the categories have a meaningful order or hierarchy. Unlike one-hot encoding, which creates binary vectors for each category, ordinal encoding assigns a unique numerical value to each category based on its order or importance.
领英推荐
The image is an example of a customer rating, When you apply ordinal encoding on a column this is what happens. And remember when you’re applying ordinal encoding to a column using the Sklearn library, it’s essential to note that the library doesn’t inherently consider the order of the data on its own. Instead, you must explicitly specify the desired order for the categories within a particular column.
from sklearn.preprocessing import OrdinalEncoder
from sklearn.datasets import load_iris
import pandas as pd
# Load the Iris dataset
iris = load_iris()
iris_df = pd.DataFrame(data=iris.data, columns=iris.feature_names)
iris_df['species'] = iris.target_names[iris.target]
# Display a snippet of the original dataset
iris_df.head()
# Define the order for ordinal encoding
species_order = ['setosa', 'versicolor', 'virginica']
# Define the column name for Encoding
encode_column = iris_df['species']
# Create an instance of the OrdinalEncoder with the specified order
encoder = OrdinalEncoder(categories=[species_order])
# Fit and transform the 'species' column
iris_df['species_encoded'] = encoder.fit_transform(encode_column)
# Display the dataset with the encoded 'species' column
iris_df[['species', 'species_encoded']].head()
Google Colaboratory
I have written code with detailed explanations in the colab notebook. Click this above link.
The last encoding technique is,
3. Label Encoding
Label encoding is a technique used in machine learning to convert categorical data into numerical format. Unlike one-hot encoding or ordinal encoding, label encoding doesn’t necessarily consider any inherent order or hierarchy among the categories. Instead, it assigns a unique numerical label to each category, essentially converting them into numerical representations.
This Label encoding works similarly to one hot encoding but not the same. You apply this method only on target columns
NOTE: LabelEncoding is recommended to apply only on target feature(y), not on input features(x).
Simple code:
import pandas as pd
import seaborn as sns
from sklearn.preprocessing import LabelEncoder
# Load the tips dataset from seaborn
tips = sns.load_dataset("tips")
# Display a few rows of the original dataset
tips.head()
# Apply label encoding to the 'smoker' column
label_encoder = LabelEncoder()
tips['smoker_encoded'] = label_encoder.fit_transform(tips['smoker'])
# Display a few rows of the dataset after label encoding
tips[['smoker', 'smoker_encoded']].head()
Google Colaboratory
You can go through with this above-given colab notebook for better understanding.
I’ll recommend going through with this official sklearn libraries webpage. Even there are some other parameters you can adjust for better model performance.
Check out the official web pages by given below links.
These are the encoding techniques to discuss, you may take a sample public dataset like from Kaggle and try to apply these techniques.
That’s it for this article, we’ll discuss another topic which is Column Transformers in detail in the next article.
Thank you for taking the time to read this article.
I hope it has provided you with valuable insights into the world of data encodings and how it can be used to enhance the performance of machine learning models.
I’m excited to share these hands-on insights and make the content more engaging.
Stay tuned for upcoming articles.
Previous article: 7. Standardization and Normalization in ML.
Next article: 9. Data Transformations in ML.
YouTube Channel