Encode Categorical Variables to Numeric Variables:                                          Label encoder v/s One hot encoder

Encode Categorical Variables to Numeric Variables: Label encoder v/s One hot encoder

Label encoder v/s One hot encoder

Typically, any structured data set includes multiple columns – a combination of numerical as well as categorical variables. A machine learning algorithm can only understand the numbers and not the text. This process is called categorical encoding. Categorical encoding is a process of converting categories to numbers.

Categorical data describes categories or groups. One example would be car brands like Mercedes, BMW and Audi – Another body types of cars like Hatchback, Convertible, Sedan.

In this article we do the detailed comparison with statistical analysis for 2 very popular encoding techniques

Label Encoding

Encode labels with a value between 0 and n_classes-1 where n is the number of distinct labels. If a label repeats it assigns the same value to as assigned earlier.

For example:0- Hatchback, 1- Convertible, 2- Sedan and so on

One-Hot Encoding

Add dummy variables for each unique category. Assign 0 or 1 in each category

No alt text provided for this image

We will infer the winner by comparing RMSE of both the models.

Below is the source code from git hub. Happy Encoding :)


要查看或添加评论,请登录

MANIDIPA CHAKRAVARTI的更多文章

社区洞察

其他会员也浏览了