Data encoding plays a crucial role in machine learning, especially when dealing with categorical data or text data that cannot be directly fed into a model. Proper data encoding ensures that the data is in a numerical format that the machine learning algorithm can understand and learn from effectively.
Nominal or One Hot Encoding :
- It is technique used to represent categorical data as numerical data , which is more suitable for Machine Learning Algorithms
- In this method each category is represented as binary vector where each bit corresponds to a unique category
- For example ,There is a feature "Color" in which it has three categories "Red", "Green", "Blue" .When applying One Hot Encoding the three categories are divided into three features where if "Red" category comes in a row all the rows except "Red" become 0 in the new feature created separately for "Red"
- The main disadvantage of using One Hot Encoding is that if we have large number of categories ,if we apply One Hot Encoding to this ,many number of Features are created .
- Another Disadvantage is Sparse Matrix that is 1s and 0s when we have n number of categories, leads to overfitting of the model
- Label encoding involves assigning unique numerical values to each categories in the feature
- The labels are usually arranged in alphabetical order or based on frequency of the category
- For example , Consider a feature "Color" which has category "Green", "Blue" , "Red" when applying Label encoding to it gives Green-2,Blue-1,Red-3
- The main disadvantage of using Label encoding is that when we are analyzing ordinal data there is some ranking in the category .If we apply Label Encoding to them it will give unique values based on alphabetical or frequency of the category. It leads to the inaccuracy of the model output.
- Ordinal Encoding is used to encode categorical data that have an intrinsic order or ranking .
- In this technique each category is assigned a numerical value based on the position in the order.
- For example , If we have a feature "Educational Qualification" with categories "Graduate", "Post Graduate", "High School" when applying Ordinal Encoding to it gives "High school" - 1, "Graduate" - 2, "Post Graduate - 3"
Target Guided Ordinal Encoding :
- It is technique used to encode categorical variable based on their relation with the target variable
- It is useful when we have a Categorical feature with large number of unique categories.
- In this, we replace the unique categories with a numerical value based on the mean or median of the target variable for that category
- This create an monotonic relationship between categories / value and Target variable , which can improve the predictive power of the model