Data Encoding in Machine Learning - Part 08

Data Encoding in Machine Learning - Part 08

Do machine learning algorithms read the text for training as we do? Have you ever wondered how the machine learning models will read the text, get trained, and later it will respond to the questions that you’re going to ask?

Feeding raw text directly to a machine learning model results in errors and tells you “What the hell! What is this? I can’t understand what you’re going to feed me”.

Let me tell you what is happening here, how they get trained. This is a significant step in machine learning, you need to convert the data into a numerical format and then feed it to your algorithm. It would help if you used some techniques in machine learning to convert it into numerical data.

The machine never understands the text, you need to convert it into the numerical format and make it understand.

In this article, we will discuss the different techniques that we use to convert text data into numerical data so that your algorithm can capture the patterns in the data for better performance.

What is Data Encoding?

Encoding data refers to the conversion of categorical or text data into a numerical format that can be easily understood and processed by algorithms.

Here we need to convert the categorical(text) data, but the question is which categorical data? all categorical data?

You know there are two different types of text data, and they are

  1. Nominal Categorical Data: It represents categories or labels with no inherent order or ranking. In this type of data, categories are distinct and there is no specific order among them. Examples of nominal categorical data include colors, gender categories, or types of animals. Nominal data is used to classify items into distinct groups based on their shared characteristics.
  2. Ordinal Categorical Data: It represents categories with a meaningful order or ranking. While the categories have a relative position, the intervals between them are not necessarily uniform. In ordinal data, the order matters, but the precise degree of difference between categories is not defined. Examples of ordinal categorical data include educational levels (like high school, college, and graduate school) or customer satisfaction ratings (such as “poor,” “average,” “good,” and “excellent”). In ordinal data, you can say one category is greater or less than another, but you can’t quantify the exact difference between them.

We have seen the types in categorical data, you might wonder why categorical data has two different types.

Imagine you’re building a recommendation system for an online food delivery website. The system aims to provide users with restaurant recommendations based on their preferences. In this scenario, you’re dealing with ordinal categorical data.

For instance, users can rate restaurants on a scale of 1 to 5, with categories like “Poor,” “Fair,” “Good,” “Very Good,” and “Excellent.” The ratings have a clear order from the least favorable (Poor) to the most favorable (Excellent), but the intervals between these categories might not be precisely equal.

In this ordinal categorical data, the order matters — you’d want to recommend restaurants with higher ratings before those with lower ratings.

The system needs to understand that “Very Good” is a better recommendation than “Fair” but may not precisely quantify the difference between the two.

This is where the ordinal nature of the categorical data becomes essential in building an effective recommendation model for restaurant searches.

Different Encoding techniques

  1. One Hot Encoding
  2. Ordinal Encoding
  3. Label Encoding

Let’s discuss the usage of these techniques one by one You can encode with the pandas library also, we see this pandas encoding in one-hot-encoding itself.

1. One Hot Encoding

This technique we used to apply to nominal categorical data. One-hot encoding converts categorical data into a binary matrix, where each category is represented by a unique binary digit (0 or 1). This technique creates a binary column for each category and marks the presence or absence of that category with a 1 or 0, respectively. This is the most common technique used for categorical data.

source:Google

You can see in the above image, how the categorical column is converted into numerical data.

  • Each unique category in the categorical column becomes a unique column in the one-hot encoding.
  • If a particular category is present in a data point (sample), the corresponding column for that category is set to 1.
  • All other columns are set to 0, indicating the absence of those categories in the given data point or row.

In pandas libraries also happen the same thing, you can compare both encodings in practical examples. (Below given colab notebook link)

Let’s see a sample code of using one-hot-encoder:

# import the required libraries
import pandas as pd
import numpy as np
import seaborn as sns
# this library is used for one hot encoding
from sklearn.preprocessing import OneHotEncoder

# Load the iris dataset from Seaborn
iris = sns.load_dataset("iris")

# Display the first few rows of the dataset
iris.head()

# Select the categorical columns (species is included as an example)
categorical_columns = iris.select_dtypes(include=['object'])

# Initialize the OneHotEncoder
one_hot_encoder = OneHotEncoder(sparse=False)

# Fit and transform the categorical columns using one-hot encoding
one_hot_encoded = one_hot_encoder.fit_transform(categorical_columns)

# Convert the one-hot encoded result to a DataFrame for better visualization
one_hot_df = pd.DataFrame(one_hot_encoded, columns=one_hot_encoder.get_feature_names_out())

# Display the one-hot encoded DataFrame
one_hot_df.head()        

In the above code,

  • First, we load the built-in dataset from the Seaborn library.
  • We then apply one hot encoding on categorical columns.
  • And again we store the encoded data back to a data frame.

Google Colaboratory

I have written code with detailed explanations in the colab notebook. Click the above link.

2. Ordinal Encoding

Ordinal encoding is another technique used in machine learning to represent categorical variables, particularly when the categories have a meaningful order or hierarchy. Unlike one-hot encoding, which creates binary vectors for each category, ordinal encoding assigns a unique numerical value to each category based on its order or importance.

The image is an example of a customer rating, When you apply ordinal encoding on a column this is what happens. And remember when you’re applying ordinal encoding to a column using the Sklearn library, it’s essential to note that the library doesn’t inherently consider the order of the data on its own. Instead, you must explicitly specify the desired order for the categories within a particular column.

from sklearn.preprocessing import OrdinalEncoder
from sklearn.datasets import load_iris
import pandas as pd

# Load the Iris dataset
iris = load_iris()
iris_df = pd.DataFrame(data=iris.data, columns=iris.feature_names)
iris_df['species'] = iris.target_names[iris.target]

# Display a snippet of the original dataset
iris_df.head()

# Define the order for ordinal encoding
species_order = ['setosa', 'versicolor', 'virginica']

# Define the column name for Encoding
encode_column = iris_df['species']

# Create an instance of the OrdinalEncoder with the specified order
encoder = OrdinalEncoder(categories=[species_order])

# Fit and transform the 'species' column
iris_df['species_encoded'] = encoder.fit_transform(encode_column)

# Display the dataset with the encoded 'species' column
iris_df[['species', 'species_encoded']].head()        

  • We load the Iris dataset and create a DataFrame (iris_df)from it.
  • We specify the order for ordinal encoding (species_order).
  • We create an instance of the OrdinalEncoder from scikit-learn, passing the order as the categories parameter.
  • We fit and transform the ‘species’ column of the DataFrame using the fit_transform method of the encoder.
  • The result is a new column, ‘species_encoded’, containing the ordinal encoded values.

Google Colaboratory

I have written code with detailed explanations in the colab notebook. Click this above link.

The last encoding technique is,

3. Label Encoding

Label encoding is a technique used in machine learning to convert categorical data into numerical format. Unlike one-hot encoding or ordinal encoding, label encoding doesn’t necessarily consider any inherent order or hierarchy among the categories. Instead, it assigns a unique numerical label to each category, essentially converting them into numerical representations.

This Label encoding works similarly to one hot encoding but not the same. You apply this method only on target columns

NOTE: LabelEncoding is recommended to apply only on target feature(y), not on input features(x).

Simple code:

import pandas as pd
import seaborn as sns
from sklearn.preprocessing import LabelEncoder

# Load the tips dataset from seaborn
tips = sns.load_dataset("tips")

# Display a few rows of the original dataset
tips.head()

# Apply label encoding to the 'smoker' column
label_encoder = LabelEncoder()
tips['smoker_encoded'] = label_encoder.fit_transform(tips['smoker'])

# Display a few rows of the dataset after label encoding
tips[['smoker', 'smoker_encoded']].head()        

Google Colaboratory

You can go through with this above-given colab notebook for better understanding.

I’ll recommend going through with this official sklearn libraries webpage. Even there are some other parameters you can adjust for better model performance.

Check out the official web pages by given below links.

Sklearn One Hot Encoder

Sklearn Ordinal Encoder

Sklearn Label Encoder

These are the encoding techniques to discuss, you may take a sample public dataset like from Kaggle and try to apply these techniques.

That’s it for this article, we’ll discuss another topic which is Column Transformers in detail in the next article.


Thank you for taking the time to read this article.

I hope it has provided you with valuable insights into the world of data encodings and how it can be used to enhance the performance of machine learning models.

I’m excited to share these hands-on insights and make the content more engaging.

Stay tuned for upcoming articles.

Previous article: 7. Standardization and Normalization in ML.

Next article: 9. Data Transformations in ML.


YouTube Channel


要查看或添加评论,请登录

Vinod Kumar G R的更多文章

社区洞察

其他会员也浏览了