5 challenges you face when building cutting edge class recommender systems.
Krishna Yogi Kolluru
Data Science Architect | ML | GenAI | Speaker | ex-Microsoft | ex- Credit Suisse | IIT - NUS Alumni | AWS & Databricks Certified Data Engineer | T2 Skilled worker
Recommender systems play a pivotal role in shaping our digital experiences, guiding us through a sea of content on online platforms. From the allure of clickbait to the influence of popularity, these systems are not immune to biases impacting user recommendations.
In this blog, we’ll learn about the intricacies of five prevalent biases in recommender systems and explore recent research breakthroughs from industry giants like Google, YouTube, Netflix, Kuaishou, and more.
1 — Clickbait bias
The ubiquity of clickbait poses a significant challenge to recommender systems. If a model is trained using clicks as positives, it risks favouring clickbait content. Covington et al (2016) propose weighted logistic regression to combat this. This technique, applied to YouTube video recommendations, leverages watch time to prioritize content with higher expected watch times, ultimately pushing clickbait lower in the recommendations.
Mathematically, it can be shown that such a weighted logistic regression model learns odds that are approximately the expected watch time for a video. At serving time, videos are ranked by their predicted odds, resulting in videos with long expected watch times to be placed high on top of the recommendations, and clickbait (with the lowest expected watch times) at the bottom.
2 — Duration bias
While weighted logistic regression addresses clickbait, it introduces duration bias. Zhan et al (2022) present a solution in quantile-based watch-time prediction. By categorizing videos and their watch times into quantiles, this approach disentangles watch time from video duration. A/B testing reveals a 0.5% improvement over weighted logistic regression, highlighting the effectiveness of mitigating duration bias.
Think about a video catalogue that contains 10-second short-form videos along with 2-hour long-form videos. A watch time of 10 seconds means something completely different in the two cases: it’s a strong positive signal in the former and a weak positive (perhaps even a negative) signal in the latter. Yet, the Covington approach would not be able to distinguish between these two cases and would bias the model in favour of long-form videos (which generate longer watch times simply because they’re longer).
A solution to duration bias, proposed by Zhan et al (2022) from KuaiShou, is quantile-based watch-time prediction.
The key idea is to bucket all videos into duration quantiles, and then bucket all watch times within a duration bucket into quantiles as well. For example, with 10 quantiles, such an assignment could look like this:
(training example 1)
video duration = 120min --> video quantile 10
watch duration = 10s --> watch quantile 1
(training example 2)
video duration = 10s --> video quantile 1
watch duration = 10s --> watch quantile 10
...
By translating all time intervals into quantiles, the model understands that 10s is “high” in the latter example, but “low” in the former, so the author’s hypothesis. At training time, we’re providing the model with the video quantile, and task it with predicting the watch quantile. At inference time, we’re simply ranking all videos by their predicted watch time, which will now be de-confounded from the video duration itself.
And indeed, this approach appears to work. Using A/B testing, the authors report
The results show that removing duration bias can be a powerful approach on platforms that serve both long-form and short-form videos. Perhaps counter-intuitively, removing bias in favour of long videos improves overall user watch times.
3 — Position bias
Position bias occurs when higher-ranked items garner more engagement solely due to their position, not content quality. Techniques like rank randomization and intervention harvesting offer remedies. Crucially, monitoring diverse metrics, including user retention, becomes essential to counter the potential degradation of model quality over time.
Particularly problematic is that position bias will always make our models look better on paper than they are. Our models may be slowly degrading in quality, but we wouldn’t know what is happening until it’s too late (and users have churned away). It is therefore important, when working with recommender systems, to monitor multiple quality metrics about the system, including metrics that quantify user retention and the diversity of recommendations.
领英推荐
4 — Popularity bias
Popularity bias refers to the tendency of the model to give higher rankings to items that are more popular overall (because they’ve been rated by more users), rather than being based on their actual quality or relevance for a particular user. This can lead to a distorted ranking, where less popular or niche items that could be a better fit for the user’s preferences are not given adequate consideration.
logit(u,v) <-- logit(u,v) - log(P(v))
where
Of course, the right-hand side is equivalent to:
log[ odds(u,v)/P(v) ]
In other words, they simply normalize the predicted odds for a user/video pair by the video probability. Extremely high odds from popular videos count as much as moderately high odds from not-so-popular videos. And that’s the entire magic.
Indeed, the magic appears to work: in online A/B tests, the authors find a 0.37% improvement in overall user engagement with the de-biased ranking model.
5 — Single-interest bias
Suppose you watch mostly drama movies, but sometimes you like to watch a comedy, and from time to time a documentary. You have multiple interests, yet a ranking model trained to maximize your watch time may over-emphasize drama movies because that’s what you’re most likely to engage with. This is single-interest bias, the failure of a model to understand that users inherently have multiple interests and preferences.
To remove single-interest bias, a ranking model needs to be calibrated. Calibration simply means that, if you watch drama movies 80% of the time, then the model’s top 100 recommendations should include around 80 drama movies (and not 100).
Netflix’s Harald Steck (2018) demonstrates the benefits of model calibration with a simple post-processing technique called Platt scaling. He presents experimental results that demonstrate the effectiveness of the method in improving the calibration of Netflix recommendations, which he quantifies with KL divergence scores. The resulting movie recommendations are more diverse — in fact, as diverse as the actual user preferences — and result in improved overall watch times.
Takeaways
1. Clickbait Challenge: Addressed by weighted logistic regression, prioritizing content with higher expected watch times over sensational clickbait.
2. Duration Dilemma: Quantile-based watch-time prediction mitigates bias toward longer videos, showing a 0.5% improvement over previous methods.
3. Position Pitfall: Techniques like rank randomization counter position bias, ensuring recommendations reflect user preferences, not just rank.
4. Popularity Predicament: Normalizing odds based on video frequency combats popularity bias, leading to a 0.37% improvement in overall user engagements.
5. Diverse User Preferences: Platt scaling calibrates recommendations, acknowledging users’ multiple interests, resulting in more diverse and satisfying content suggestions.