Movie Recommendation System via Machine Learning

Movie Recommendation System via Machine Learning

Note! The full version of the article and the implementation of the algorithm in R (programming language) are available on GitHub.

Introduction

A recommendation system is a technology that uses machine learning algorithms to suggest items to users based on their ratings and preferences. These systems analyze users' behavior, preferences, and rating history, as well as information about the items themselves, to propose personalized recommendations.

Movie recommendation systems are commonly used by movie streaming companies, such as Amazon and Netflix, to catch their users' tastes and suggest the corresponding movies and series to them. The main technical issue about the recommendation systems is data sparsity, indicating that among a large list of movies and users, most of the possible movie-user pairs do not exist. It means that only a small portion of users rate a specific movie. A user only rates a very small portion of movies as well.

Database Inspection

A database of 10M movie ratings has been provided by MovieLens which includes the rating from 70,000 users to 10,000 movies. The ratings range from 0.5 (The worst) to 5 (the best), with increments of 0.5. The plots below display the distribution of ratings received by movies, as well as those given by users.

No alt text provided for this image
Number of Ratings Received by Movies


No alt text provided for this image
Number of Ratings Given by Users

There are about 20 different genres that describe the content of movies. Each movie can be classified into multiple genres.

No alt text provided for this image
Number of Movies in each Genre

Methodology

Similar to any machine learning task, the available database is divided into two parts: train set (90%) and test set (10%), where the former contributes to building the predicting model while the latter is used to estimate the accuracy of the model.

A linear model can be assumed to estimate the rating value as follows.

No alt text provided for this image
Linear Model

The terms on the right-hand side of the formula respectively indicate the average rating of the database, the biased rating for a specific movie, the bias of each user, the bias related to the genres, and the error of estimation compared to the actual rating. The least-square estimation would be engaged to find the unknown coefficients in order to minimize the summation of errors.

Regularization

Regularization is a technique widely used in machine learning algorithms to prevent overfitting by penalizing large coefficients created by predicting models.

No alt text provided for this image
Regularization Formula

To apply the regularization method, a part of the train set should be separated (called the validation set). It does not contribute to building the model and is engaged to estimate the RMSE. The optimum value of the regularization parameter (lambda) will minimize the RMSE (root-mean-squared error).

Results

The results of the linear model with different options (i.e., considered terms in the modeling formula) are shown in the table and figure below.

No alt text provided for this image
RMSE of Train and Test Sets for Different Modeling Options
No alt text provided for this image
comparison of the Actual Ratings with those Resulted from Different Modeling Options

Note! The full version of the article and the implementation of the algorithm in R (programming language) are available on GitHub.

要查看或添加评论,请登录

社区洞察

其他会员也浏览了