Building the World’s Greatest Recommender System Part 21: Caching to Avoid Repeated Work
With every use of a 12-trillion-parameter deep learning recommendation model (DLRM), to match users with recommended content, we ask ourselves, “how can we avoid doing this again?” If a user just left their feed to check an email, when they come back, do we really need to go through the entire multi-stage ranking process, making requests to multiple machine learning models such as Two Towers Neural Network retrieval model and a Multi-Task Multi-Label ranking model, all over again?
The short answer is “No!” Now, let’s understand how we avoid this repeated work through caching. Caching is a well-established technique to improve system performance by storing and reusing previously computed results.?
In recommendation systems, a standard cache would save an ordered list of recommended items for a user. When the user returns, the system can use these cached recommendations instead of generating an entire new ordered list of recommendations, thus reducing computational load and latency. Under the hood, this standard approach to caching requires management of cache invalidation when cache data has gotten too old (stale). It also requires management of cache consistency, in which the nodes of a distributed cache need to be updated to all reflect the agreed-upon data in the “source of truth,” usually a database. While perhaps intriguing to some, the challenges of distributed caching are not unique to real-time machine learning systems and are generally abstracted (hidden) away from machine learning by most common caches (i.e. Redis or Meta’s TAO).?
However, for recommender systems, traditional caching has its drawbacks. The primary issue is the staleness of cached data, which can hurt user engagement. When cached recommendations become outdated, they may no longer align with the user's current interests, leading to decreased user satisfaction. Case in point: a user may have been happy with Halloween ads before and during Halloween, but if a cache serves them Halloween ads after Halloween ends, they may be very dissatisfied.
To address this issue, we can take advantage of a smart caching system. A smart cache would not only store items but also their ranking scores in a storage system and utilize a lightweight adjuster model that refreshes cached ranking scores before they are used, ensuring that recommendations remain relevant.
To delve into the adjuster model, it would predict the fresh ranking score for a cached item based on the stale score, the time gap since the score was cached, and standard model features. This model would need a significantly lighter and faster design than the full ranking model, perhaps using gradient boosted regression trees (or potentially a lightweight neural network), allowing for substantial reductions in computational cost and latency without sacrificing recommendation quality.
For the adjuster model used in the smart cache, we could also utilize the full model to help us update cached ranking scores, keeping them relevant. We could train the adjuster model with the help of knowledge distillation. Knowledge distillation is the process of transferring knowledge from a large model to a smaller, more efficient one. This technique promises the best of both worlds - maintaining high accuracy while reducing compute cost, and therefore latency, as we would not want to wait for another heavyweight model to run on the cache contents. We should note that the smaller model, used at inference time, will not be as “expressive,” or capable, after any amount of distillation as the larger model. Nevertheless, by periodically sending a subset of cached items to the primary model for fresh scoring, the system could collect data to help train the adjuster to have greater accuracy
领英推荐
.Smart caching not only addresses immediate performance needs but also offers flexibility to adapt to varying system conditions. By tuning parameters such as the maximum time of validity for cached data, the system can dynamically balance capacity and latency with accuracy and therefore, user satisfaction. Thus, an innovative caching approach such as smart caching represents a significant advancement in the optimization of recommendation systems. By integrating a smart caching system with an adjuster model, we can utilize caching for immediate compute and latency improvements which may also pave the way for future integration of more sophisticated models and features with latency constraints lifted.
If you benefited from this post, please share so it can help others.
Sources (All Content Can Be Found In Publicly Available Information; No Secrets):