Encode Categorical Variables to Numeric Variables: Label encoder v/s One hot encoder
MANIDIPA CHAKRAVARTI
Senior Manager | Procurement Professional | SAP Ariba | Data Science Enthusiast
Label encoder v/s One hot encoder
Typically, any structured data set includes multiple columns – a combination of numerical as well as categorical variables. A machine learning algorithm can only understand the numbers and not the text. This process is called categorical encoding. Categorical encoding is a process of converting categories to numbers.
Categorical data describes categories or groups. One example would be car brands like Mercedes, BMW and Audi – Another body types of cars like Hatchback, Convertible, Sedan.
In this article we do the detailed comparison with statistical analysis for 2 very popular encoding techniques
Label Encoding
Encode labels with a value between 0 and n_classes-1 where n is the number of distinct labels. If a label repeats it assigns the same value to as assigned earlier.
For example:0- Hatchback, 1- Convertible, 2- Sedan and so on
One-Hot Encoding
Add dummy variables for each unique category. Assign 0 or 1 in each category
We will infer the winner by comparing RMSE of both the models.
Below is the source code from git hub. Happy Encoding :)