Machine Learning 10: 'Recommendation System'
Why do the we care about the Recommendation Systems?
The answer to this question may be different based on different perspective. For example, for companies like Amazon, Spotify and Netflix is to generate more and more revenues and drive a significant amount of engagement to their websites that results in an exponential growth in their marketplace. But, for people using Amazon, Spotify and Netflix, it means saving their time and getting the things of their interest and those which are being highly liked into their suggestions, so that they don’t have to search for it, this is the essence of Recommendation Systems or Recommendation Engines.
Conceptually Recommended Systems or Recommendation Engines use two types of recommendation approach (or approaches).
1. Collaborative filtering (CF),
2. Content-based filtering (CBF)
Collaborative Filtering
Collaborative filtering, one of the earliest forms of recommendation systems. The earliest developed forms of these algorithms are also known as neighborhood based or memory based algorithms. If using machine learning or statistical model methods, they're referred to as model based algorithms. The basic idea of collaborative filtering is that given a large database of ratings profiles for individual users on what they rated/purchased, we can impute or predict ratings on items not rated/purchased by them, forming the basis of recommendation scores or top-N recommended items.
Under user-based collaborative filtering, this memory-based method works under the assumption that users with similar item tastes will rate items similarly. Therefore, the missing ratings for a user can be predicted by finding other similar users (a neighbourhood). Within the neighbourhood, we can aggregate the ratings of these neighbors on items unknown to the user, as basis for a prediction.
An inverted approach to nearest neighbors based recommendations is item-based collaborative filtering. Instead of finding the most similar users to each individual, an algorithm assesses the similarities between the items that are correlated in their ratings or purchase profile amongst all users.
Some additional starter articles to learning more about collaborative filtering can be found here and here(https://recommender-systems.org/collaborative-filtering/)
How the UBCF algorithm works
Strengths & Weaknesses of Neighborhood Methods
Strengths: simple to implement, and recommendations are easy to explain to user. Transparency about the recommendation to a user can be a great boost to the user's confidence in trusting a rating.
Weaknesses: these algorithms do not too work well on very sparse ratings matrices. Additionally, they are computationally expensive as the entire user database needs to be processed as the basis of forming recommendations. These algorithms will not work from a cold start since a new user has no historic data profile or ratings for the algorithm to start from.
Data Requirements: a user ratings profile, containing items they’ve rated/clicked/purchased. A "rating" can be defined however it fits the business use case.
Content-based filtering (CBF)
The Content-based filtering (CBF) recommenders are broken into three components:
- A model class, TFIDFModel.
2. A model provider, TFIDFModelProvider, that computes TF-IDF vectors for items.
3. A scorer/recommender class that uses the precomputed model to score items computing the user-personalized scores for items.
TF-IDF Recommender with Unweighted Profiles
To compute the unit-normalized TF-IDF vector for each item in the data set. The model contains a mapping of item IDs to TF-IDF vectors, normalized to unit vectors, for each item. The heart of the recommendation process is the score method of the item scorer which is TFIDF Item Scorer scoring each item by using cosine similarity and the score for an item is the cosine between that item's tag vector and the user's profile vector.
Weighted User Profile
In this variant, rather than just summing the vectors for all positively-rated items, a weighted sum of the item vectors is computed for all rated items, with weights being based on the user's rating.
More Algorithms to Learn
Exercises
As for the practice for this week, you have to build a recommendation system on these Kaggle datasets.
The Movies Dataset
Santander Product Recommendation