Understanding Rotten Tomatoes's Weird Scoring System
Introduction
Rotten Tomatoes has a very strange way of aggregating movie scores. A movie's Rotten Tomatoes (RT) score isn't a straight average of all the review scores. Rather, the website counts a movie's positive reviews as a proportion of all its reviews. For instance, if a movie has five reviews, four of which are positive, that movie has an RT score of (4/5) = 80%, regardless of the actual scores. This can have odd results. Let's consider two hypothetical movies:
Which of movies 1 or 2 is "better", and is the RT score actually a good way to measure consensus? I don't have an answer to these questions. Rather, for this project, I wanted to answer the following questions:
For this project, I downloaded a Rotten Tomatoes movies and reviews dataset from Kaggle, did some data cleaning in SQL, and plotted graphs using R's ggplot2 package.
Data Preparation
In order to calculate a movie's average review score, I first needed to address three key problems:
I can think of two ways to address the third problem, neither of which are ideal. One is to use a machine learning algorithm. This algorithm would use the reviews that do have scores as training data and try to impute review scores onto reviews with no score using recurring words in a review's summary. I didn't do this because the algorithm could have a number of potential biases. For example, if by complete coincidence, reviews with higher scores were more likely to have "dog" in their summary, the algorithm would reflect that, and I would have no way of knowing. This situation is actually very likely to occur given how many words each summary uses and how many summaries there are.
To deal with reviews without a score, I instead just chose to filter them out, not including them in any calculations or other analyses. This isn't an ideal solution either. About 27% of reviews didn't have a score. For most projects, I'm very reluctant to filter out that much data. But this was the simplest solution I had available.
For each of the plots below, I calculate a movie's average score and RT score while not including reviews with no score, so my numbers may not match the exact numbers on the Rotten Tomatoes website.
Discussion
With all of those issues addressed, I read my data into R and plotted a scatter plot comparing average scores to RT scores. For this scatter plot, I wanted to just focus on movies with at least 100 reviews. These were the movies I figured people would actually care about. Additionally, I realized while sifting through this data that Rotten Tomatoes sometimes incorrectly label negative reviews as positive, and vice versa. Those errors can throw calculations off and create some wildly erroneous data points. Filtering for just movies with a lot of reviews minimizes the impact of these errors. With all that said, I have the scatter plot below:
Average scores and RT scores are tightly correlated. However, this scatter plot does not follow a completely linear pattern. Instead, as average scores approach 100%, RT scores increase but not as quickly. For higher average scores, the relationship seems to flatten out. Since there seems to be less of an association at higher scores, I decided to zoom on average scores over 60%. I also divided the graph into four sections:
Some quick observations:
This last observation may have been due to how I chose to filter the data and and my choice to just focus on the highest rated movies.
Next, I wanted to get an idea of what kinds of movies ended up in each category. This question depends on what I chose the cutoff points to be. For movies close to the center, subtle changes in the cutoffs could cause movies to end up in different categories. In any case, I first looked at the top ten movies in the "universally loved category", sorting by average score. The highest rated movie was Seven Samurai, released in 1956. In fact, nine out of the ten movies were released before 2000. The fact that Seven Samurai is number 1 is again probably due to how I filtered the movies and reviews. Different filtering conditions would likely lead to a different result. Still, it's interesting to see the top ten list dominated by older movies.
Older movies are in fact over-represented in the "universally loved" category. While just 3% of movies overall covered in this analysis are were released before the year 2000, among just movies that were universally loved, over 6% were released before 2000. These numbers are again likely due to how I filtered the data. I filtered for movies with more than 100 reviews. Without that filter, there probably would have been a greater amount older movies in all four categories since older movies tend to have fewer reviews, which I will show soon. Still, the fact that the top 10 consists mostly of older movies indicates that they do tend to review better than newer ones even accounting for these biases.
To demonstrate this, I looked at look at two bar graphs. The first bar graph looks at the average number of reviews movies received each year. Note that in these next few bar graphs, I did not filter for movies with more than 100 reviews. There is a clear upward trend especially around the year 2000.
领英推荐
Next, I have a bar graph of the proportion of movies that were universally loved each year. There is a clear downward trend. These two graphs show that even though there are fewer reviews for older movies, older movies do clearly have both higher RT scores and average scores.
Next, I will at how the proportion of all four categories have changed over time. In this area graph, I once again filtered for just movies with more than 100 reviews.
As I showed earlier, the "universally loved" category has decreased over time. In it's place, the "universally liked" and "not as well liked" categories have increased. Reviewers tend to give newer movies more negative reviews. Additionally, reviewers tend to give newer movies lower scores, even ones with high RT scores. Does this mean that movies have gotten worse over time? I don't think so. Instead I think reviewers just have a tendency to rate older movies more highly while being more critical of newer ones.
To wrap up, I wanted to see if some genres are over-represented in each category. For reference, I will show the bar graph from before again. It serves as a good baseline for how movies overall are distributed.
Most genres I checked shared a similar pattern to the one I show above. I could not find a single genre where the "loved by some" category is larger than the "universally liked" category. Overall, the "loved by some" category is the smallest, regardless of release year or genre. But I may have been able to get a more equal distribution with different cutoffs.
Different genres tend to differ by the relative size of the "universally loved" category. For instance, comedy movies are less likely to be universally loved, while the other three categories look similar to the overall distribution. In contrast, drama movies are more likely to be universally loved.
By far the highest rated genre though is the classics genre. Classics is the only genre to have no movies in the "not as well liked" category.
This result lines up with the findings from earlier that older movies tend to be rated higher.
TL;DR
For this project, I was interested in seeing how movies' RT scores compared to their average review scores. It turns out that the two are tightly correlated, though they follow more of an S-shaped curve than a straight linear regression. There are very few movies with a high RT score and low average score, or vice versa.
Reviewers tend to rate older movies more highly, and tend to prefer genres such as classics or genres to others. Over time, even among the highest-rated movies, reviewers tend to give lower average scores even if RT scores are similar.
How My Own Decisions Impacted the Results
Overall, the main takeaway I have from this project how much decisions about how to filter data and handle missing or erroneous values can impact a project's findings. I hope that people don't come away from this project thinking that Seven Samurai is the most critically acclaimed movie, because that result would likely be different if I made different slightly different decisions about filtering for movies with a certain number of reviews. I'm pretty sure that the only reason why the "loved by some" category is so much smaller than the others is because of where I placed the cutoffs.
I tried to justify all the decisions I made in filtering the data, but it's still important to keep in mind how those decisions can impact the results. A lot of these decisions ultimately come down to convenience. I ultimately decided against the machine learning idea because I don't have experience in that domain. I pointed out a potential issue with that approach, but maybe someone more knowledgeable could have resolved that issue.
Ultimately, I don't Rotten Tomatoes's strange scoring system is a great way to measure consensus. A movie's RT score tends to give a very similar story to its average score, but removes key information about the extent to which a reviewer likes or dislikes a movie. I suspect that part of the reason Rotten Tomatoes uses its scoring system is due to convenience. If Rotten Tomatoes tries to calculate an average score, the website runs into the same issue I did with reviews that don't give a score.
Thanks for reaching. I have linked to my GitHub with my SQL and R code. I also linked to the original dataset.
My GitHub: https://github.com/Zach-Nabavian/Rotten-Tomatoes-Movie-Score-Analysis
Original Dataset: https://www.kaggle.com/datasets/stefanoleone992/rotten-tomatoes-movies-and-critic-reviews-dataset