Movie Recommendation System via Machine Learning
Mahdi Karami
Looking for Opportunities in Numerical Simulations, Software Development, ML, and Data Science
Note! The full version of the article and the implementation of the algorithm in R (programming language) are available on GitHub.
Introduction
A recommendation system is a technology that uses machine learning algorithms to suggest items to users based on their ratings and preferences. These systems analyze users' behavior, preferences, and rating history, as well as information about the items themselves, to propose personalized recommendations.
Movie recommendation systems are commonly used by movie streaming companies, such as Amazon and Netflix, to catch their users' tastes and suggest the corresponding movies and series to them. The main technical issue about the recommendation systems is data sparsity, indicating that among a large list of movies and users, most of the possible movie-user pairs do not exist. It means that only a small portion of users rate a specific movie. A user only rates a very small portion of movies as well.
Database Inspection
A database of 10M movie ratings has been provided by MovieLens which includes the rating from 70,000 users to 10,000 movies. The ratings range from 0.5 (The worst) to 5 (the best), with increments of 0.5. The plots below display the distribution of ratings received by movies, as well as those given by users.
There are about 20 different genres that describe the content of movies. Each movie can be classified into multiple genres.
Methodology
Similar to any machine learning task, the available database is divided into two parts: train set (90%) and test set (10%), where the former contributes to building the predicting model while the latter is used to estimate the accuracy of the model.
领英推荐
A linear model can be assumed to estimate the rating value as follows.
The terms on the right-hand side of the formula respectively indicate the average rating of the database, the biased rating for a specific movie, the bias of each user, the bias related to the genres, and the error of estimation compared to the actual rating. The least-square estimation would be engaged to find the unknown coefficients in order to minimize the summation of errors.
Regularization
Regularization is a technique widely used in machine learning algorithms to prevent overfitting by penalizing large coefficients created by predicting models.
To apply the regularization method, a part of the train set should be separated (called the validation set). It does not contribute to building the model and is engaged to estimate the RMSE. The optimum value of the regularization parameter (lambda) will minimize the RMSE (root-mean-squared error).
Results
The results of the linear model with different options (i.e., considered terms in the modeling formula) are shown in the table and figure below.
Note! The full version of the article and the implementation of the algorithm in R (programming language) are available on GitHub.