Data Encoding in Machine Learning

Data Encoding in Machine Learning

Data encoding plays a crucial role in machine learning, especially when dealing with categorical data or text data that cannot be directly fed into a model. Proper data encoding ensures that the data is in a numerical format that the machine learning algorithm can understand and learn from effectively.

Nominal or One Hot Encoding :

  • It is technique used to represent categorical data as numerical data , which is more suitable for Machine Learning Algorithms
  • In this method each category is represented as binary vector where each bit corresponds to a unique category
  • For example ,There is a feature "Color" in which it has three categories "Red", "Green", "Blue" .When applying One Hot Encoding the three categories are divided into three features where if "Red" category comes in a row all the rows except "Red" become 0 in the new feature created separately for "Red"
  • The main disadvantage of using One Hot Encoding is that if we have large number of categories ,if we apply One Hot Encoding to this ,many number of Features are created .
  • Another Disadvantage is Sparse Matrix that is 1s and 0s when we have n number of categories, leads to overfitting of the model

Label Encoding :

  • Label encoding involves assigning unique numerical values to each categories in the feature
  • The labels are usually arranged in alphabetical order or based on frequency of the category
  • For example , Consider a feature "Color" which has category "Green", "Blue" , "Red" when applying Label encoding to it gives Green-2,Blue-1,Red-3
  • The main disadvantage of using Label encoding is that when we are analyzing ordinal data there is some ranking in the category .If we apply Label Encoding to them it will give unique values based on alphabetical or frequency of the category. It leads to the inaccuracy of the model output.

Ordinal Encoding :

  • Ordinal Encoding is used to encode categorical data that have an intrinsic order or ranking .
  • In this technique each category is assigned a numerical value based on the position in the order.
  • For example , If we have a feature "Educational Qualification" with categories "Graduate", "Post Graduate", "High School" when applying Ordinal Encoding to it gives "High school" - 1, "Graduate" - 2, "Post Graduate - 3"

Target Guided Ordinal Encoding :

  • It is technique used to encode categorical variable based on their relation with the target variable
  • It is useful when we have a Categorical feature with large number of unique categories.
  • In this, we replace the unique categories with a numerical value based on the mean or median of the target variable for that category
  • This create an monotonic relationship between categories / value and Target variable , which can improve the predictive power of the model

要查看或添加评论,请登录

GOKUL . S的更多文章

  • Understanding Support Vector Machines (SVM)

    Understanding Support Vector Machines (SVM)

    Support Vector Machines (SVM) is a powerful machine learning algorithm used for both classification and regression…

    2 条评论
  • Understanding Logistic Regression: A Fundamental Tool in Machine Learning

    Understanding Logistic Regression: A Fundamental Tool in Machine Learning

    Understanding Logistic Regression: A Fundamental Tool in Machine Learning In the world of machine learning…

    1 条评论
  • What is Linear Regression?

    What is Linear Regression?

    Imagine you’re a shopkeeper, and you notice that as the temperature outside increases, more people buy cold drinks from…

  • Supervised Machine Learning: A Comprehensive Overview

    Supervised Machine Learning: A Comprehensive Overview

    In the realm of artificial intelligence (AI) and data science, supervised machine learning stands as a cornerstone…

  • Navigating the Future: The Integration of Machine Learning in Self-Driving Cars

    Navigating the Future: The Integration of Machine Learning in Self-Driving Cars

    Introduction: Self-driving cars represent a paradigm shift in transportation, promising safer roads, increased…

  • PANDAS LIBRARY

    PANDAS LIBRARY

    In the realm of data science and analytics, the ability to efficiently manipulate and analyze data is paramount. Enter…

  • Exploring Data Visualization with Seaborn: A Powerful Python Library

    Exploring Data Visualization with Seaborn: A Powerful Python Library

    In the vast landscape of data science and analysis, visualization serves as a powerful tool for understanding…

  • Mongo DB

    Mongo DB

    MongoDB is a document-oriented NoSQL database, designed for ease of development, scalability, and performance. Unlike…

  • Space X

    Space X

    Founded by visionary entrepreneur Elon Musk in 2002, SpaceX has become synonymous with innovation in space exploration.…

  • AMAZON WEB SERVICES

    AMAZON WEB SERVICES

    Unleashing the Power of Cloud Computing: A Deep Dive into Amazon Web Services (AWS) In the ever-evolving landscape of…

社区洞察

其他会员也浏览了