登录查看更多内容

Real World ML: The Hashing Trick: An Elegant Solution to Dynamic Category Encoding

Juan Carlos Olamendy Turruellas

Building & Telling Stories about AI/ML Systems | Software Engineer | AI/ML | Cloud Architect | Entrepreneur

发布日期: 2024年6月17日

Have you ever encountered a situation where your ML model is not doing well in production due to encountering a category it hasn't seen before?

Or perhaps your model treats new categories the same way it treats unpopular or unknown ones, leading to suboptimal performance?

These are common problems that arise when dealing with categorical features in real-world scenarios.

In this article, we'll explore a powerful technique called the hashing trick, which can help you tackle these challenges effectively.

Problem of Changing Categories in Production

Imagine you're building a recommender system for a big e-commerce platform like Amazon.

One of the features you want to incorporate is the product brand.

But, the number of brands can reach around the hundred of thousands.

To handle this, you might consider encoding each brand as a number or using one-hot encoding.

But, in production, your model crashes when it encounters a brand it hasn't seen before.

To mitigate this, you could create a catch-all category called "UNKNOWN" to handle unseen brands.

Now, your model doesn't crash anymore, but there is a new challenge.

If your model didn't see the "UNKNOWN" category in the training set, it may not recommend any products from the "UNKNOWN" brand, leading to complaints from sellers about their new brands not receiving traffic.

You might attempt to fix this by encoding only the top 99% most popular brands and categorizing the bottom 1% as "UNKNOWN."

This way, your model can at least handle "UNKNOWN" brands.

But, this solution is short-lived. Then, you notice that the click-through rate on product recommendations plummets.

New brands join your site regularly: some are new luxury brands, some are sketchy knockoffs, and others are established brands.

Unfortunately, your model treats all these new brands the same way it treats unpopular brands.

This scenario is common in various domains: predicting spam comments, analyzing new product types, identifying new website domains, or handling new user accounts.

In all these cases, the challenge of dynamically handling new categories without degrading model performance arises.

The Solution: The Hashing Trick

The hashing trick offers an elegant solution to this problem.

It involves using a hash function to generate a hashed value for each category, which then becomes the index of that category.

One potential issue with hash functions is collision, where two categories are assigned the same index.

New brands can share an index with any of the existing brands, rather than always sharing an index with unpopular brands.

To mitigate the impact of collisions, you can choose a large hash space large or a strong hash function such as MurmurHash.

This method can be particularly useful in continual learning settings where your model learns from incoming examples in production.

Benefits of the Hashing Trick

The hashing trick offers several benefits when dealing with categorical features in ML:

Naomi Simson 5 年前

Unlocking the Power of Personalization: How AI is…

Nicolas Babin 8 个月前

?? Meme-fy yourself

Product Hunt 1 年前

Handling new categories: new categories are automatically mapped to hashed indices. It eliminates the need for manual intervention or model retraining for every new category.
Reducing memory usage: By using hashed indices instead of one-hot encoding or storing the original categorical values, it can significantly reduce memory usage. Beneficial when dealing with a large number of categories.
Computational efficiency: It allows for efficient computation, as the hashed indices can be directly used as feature values.
Flexibility in continual learning: Useful in continual learning settings, as new categories emerge without need of retraining

Understanding the Hashing Trick

The hashing trick, also known as feature hashing, is a technique used to convert categorical data into numerical features.

This method leverages hash functions to map categories into a fixed number of indices in a hash table.

How It Works

Hash Function: A hash function takes an input (in this case, a category) and returns a fixed-size string of bytes. The output appears random but is deterministic.
Hash Space: Choose a hash space, which is the size of the hash table. The larger the hash space, the fewer the collisions.
Index Assignment: For each category, apply the hash function to generate an index in the hash table. This index is used as the feature for that category.

Example

Consider a scenario where you have a list of product brands: ["Nike", "Adidas", "Puma", "Reebok", "NewBrand"].

Using a hash function, you convert these brands into indices within a hash space of size 10.

For instance, if the hash function maps:

"Nike" to 3
"Adidas" to 7
"Puma" to 5
"Reebok" to 2
"NewBrand" to 3

In this case, "Nike" and "NewBrand" share the same index due to a collision.

Despite collisions, this approach is effective because it allows the model to handle new categories dynamically.

Implementation

Let's implement the hashing trick in Python using the FeatureHasher from sklearn.

Conclusion

The hashing trick provides a powerful and efficient solution to the challenge of dynamic category encoding in ML models.

Its ability to handle new categories without requiring extensive retraining makes it particularly valuable in production environments.

By choosing an appropriate hash function, configuring a suitable hash space, and scaling features effectively, you can leverage the hashing trick to improve your model's robustness and performance.

Incorporating this trick into your ML workflows ensures that your models remain adaptive and resilient, even as the underlying data evolves.

It is not only beneficial for recommender systems but also extends to various other applications, including spam detection, network security, and text classification.

By understanding and applying the hashing trick, you can address the limitations of traditional encoding methods and build more effective, scalable, and adaptive ML models.

In a world where data is constantly changing, the hashing trick stands out as a reliable and elegant solution to the problem of dynamic category encoding.

PR: If you like this article, share it with others ??

Would help a lot ??

And feel free to follow me for articles more like this.

要查看或添加评论，请登录

查看全部

Real World ML: The Hashing Trick: An Elegant Solution to Dynamic Category Encoding

Juan Carlos Olamendy Turruellas

Building & Telling Stories about AI/ML Systems | Software Engineer | AI/ML | Cloud Architect | Entrepreneur

Problem of Changing Categories in Production

The Solution: The Hashing Trick

Benefits of the Hashing Trick

领英推荐

Understanding the Hashing Trick

How It Works

Example

Implementation

Conclusion

更多精彩文章

社区洞察

其他会员也浏览了

Personalization: The New Marketing Frontier

Smarter Searches, Bigger Sales: The Impact of AI on eCommerce Navigation

THE POWER OF PERSONALIZATION. TO PLEASE YOU, WE’LL SCAN YOUR SMILE

Executing Personalisation? Try Central Leadership and Diversified Ownership

Google Unveils Generative AI, Twitter's Blue Checkmarks & TikTok Highlights Ecommerce Ambitions| 14 July 23

Artificial Intelligence (AI) and the Future of Brands

How to stop being ignored online after 7 seconds using personalization at scale on your website

Harnessing AI to Revolutionize E-commerce Product Recommendations

How Multimodal AI can transform search in e-Commerce

The fine line between personalization and behaviour change

Problem of Changing Categories in Production

The Solution: The Hashing Trick

Benefits of the Hashing Trick

领英推荐

Understanding the Hashing Trick

How It Works

Example

Implementation

Conclusion

Unlocking the Power of K-Nearest Neighbors: A Deep Dive into NumPy Implementation

2024年9月23日

Transforming Document Summarization: A Deep Dive into Sentence Embeddings, Clustering, and Summarization

2024年9月10日

Model Optimization Techniques in Neural Network: A Comprehensive Guide

2024年7月29日

Unlocking the Power of Active Learning: A Deep Dive into Smart Data Labeling

2024年7月22日

Backpropagation in Deep Learning: The Key to Optimizing Neural Networks

2024年7月15日

How to Align with Model's Prediction with Real World Outcomes

2024年6月24日

Practical ML: How to Find the Right Spot to Stop your Training Process to Save Money and Time

2024年6月21日

How to Select the Right Features: A Practical Guide

2024年6月18日

Real-world ML: Contrastive Learning, The Power of Grasping the Data Essence

2024年6月14日

Real-world ML: Contrastive Learning, The Power of Grasping the Data Essence

2024年6月12日

社区洞察

其他会员也浏览了

Personalization: The New Marketing Frontier

Smarter Searches, Bigger Sales: The Impact of AI on eCommerce Navigation

THE POWER OF PERSONALIZATION. TO PLEASE YOU, WE’LL SCAN YOUR SMILE

Executing Personalisation? Try Central Leadership and Diversified Ownership

Google Unveils Generative AI, Twitter's Blue Checkmarks & TikTok Highlights Ecommerce Ambitions| 14 July 23

Artificial Intelligence (AI) and the Future of Brands

How to stop being ignored online after 7 seconds using personalization at scale on your website

Harnessing AI to Revolutionize E-commerce Product Recommendations

How Multimodal AI can transform search in e-Commerce

The fine line between personalization and behaviour change