登录查看更多内容

Understanding Rotten Tomatoes's Weird Scoring System

Zach Nabavian

Data Analyst | Demand Forecasting | System Optimization

发布日期: 2023年6月8日

Introduction

Rotten Tomatoes has a very strange way of aggregating movie scores. A movie's Rotten Tomatoes (RT) score isn't a straight average of all the review scores. Rather, the website counts a movie's positive reviews as a proportion of all its reviews. For instance, if a movie has five reviews, four of which are positive, that movie has an RT score of (4/5) = 80%, regardless of the actual scores. This can have odd results. Let's consider two hypothetical movies:

For movie 1, every reviewer gives an 80%. Movie 1 has 80% average score, but a 100% RT score. Every reviewer likes the movie, but no one seems to love it.
For movie 2, four reviewers give a 100%, but one more reviewer gives a 50%. Movie 2 has a 90% average score, but an 80% RT score. Most reviewers love this movie, but not all of them do.

Which of movies 1 or 2 is "better", and is the RT score actually a good way to measure consensus? I don't have an answer to these questions. Rather, for this project, I wanted to answer the following questions:

How correlated is a movie's average score with it's RT score?
What categorizes movies with a high average score and a low RT score, and vice versa?

For this project, I downloaded a Rotten Tomatoes movies and reviews dataset from Kaggle, did some data cleaning in SQL, and plotted graphs using R's ggplot2 package.

Data Preparation

In order to calculate a movie's average review score, I first needed to address three key problems:

Not every review uses a uniform scoring scale. Some reviews score out of 4, 5, or 10. For these reviews, it's simple enough to scale every score up to a standardized score out of 100.
Some reviews use letter grades rather than numerical scores. For these letter scores, I converted each letter grade to a corresponding score out of 100.
Some reviews didn't list a score at all. I had no good solution to this problem.

I can think of two ways to address the third problem, neither of which are ideal. One is to use a machine learning algorithm. This algorithm would use the reviews that do have scores as training data and try to impute review scores onto reviews with no score using recurring words in a review's summary. I didn't do this because the algorithm could have a number of potential biases. For example, if by complete coincidence, reviews with higher scores were more likely to have "dog" in their summary, the algorithm would reflect that, and I would have no way of knowing. This situation is actually very likely to occur given how many words each summary uses and how many summaries there are.

To deal with reviews without a score, I instead just chose to filter them out, not including them in any calculations or other analyses. This isn't an ideal solution either. About 27% of reviews didn't have a score. For most projects, I'm very reluctant to filter out that much data. But this was the simplest solution I had available.

For each of the plots below, I calculate a movie's average score and RT score while not including reviews with no score, so my numbers may not match the exact numbers on the Rotten Tomatoes website.

Discussion

With all of those issues addressed, I read my data into R and plotted a scatter plot comparing average scores to RT scores. For this scatter plot, I wanted to just focus on movies with at least 100 reviews. These were the movies I figured people would actually care about. Additionally, I realized while sifting through this data that Rotten Tomatoes sometimes incorrectly label negative reviews as positive, and vice versa. Those errors can throw calculations off and create some wildly erroneous data points. Filtering for just movies with a lot of reviews minimizes the impact of these errors. With all that said, I have the scatter plot below:

Average scores and RT scores are tightly correlated. However, this scatter plot does not follow a completely linear pattern. Instead, as average scores approach 100%, RT scores increase but not as quickly. For higher average scores, the relationship seems to flatten out. Since there seems to be less of an association at higher scores, I decided to zoom on average scores over 60%. I also divided the graph into four sections:

One section for movies with both high average scores and high RT scores. I'll call these movies "universally loved."
Another section for movies with high average scores and lower RT scores. I'll call these "loved by some."
Another section for movies with low average scores yet high RT scores. I'll call these "universally liked."
One more section for movies with lower average scores and lower RT scores called "not as well liked."

Some quick observations:

There is still a correlation, but it's not as strong as before
The "universally loved" and "not as well liked" categories are larger than the other two. For most movies, the average score and RT score do match up.
The "loved by some" category, movies with higher average scores but lower RT scores is smaller than the "universally liked" category, movies with higher RT scores but lower average scores. The bar graph below makes that more clear:

This last observation may have been due to how I chose to filter the data and and my choice to just focus on the highest rated movies.

Next, I wanted to get an idea of what kinds of movies ended up in each category. This question depends on what I chose the cutoff points to be. For movies close to the center, subtle changes in the cutoffs could cause movies to end up in different categories. In any case, I first looked at the top ten movies in the "universally loved category", sorting by average score. The highest rated movie was Seven Samurai, released in 1956. In fact, nine out of the ten movies were released before 2000. The fact that Seven Samurai is number 1 is again probably due to how I filtered the movies and reviews. Different filtering conditions would likely lead to a different result. Still, it's interesting to see the top ten list dominated by older movies.

Older movies are in fact over-represented in the "universally loved" category. While just 3% of movies overall covered in this analysis are were released before the year 2000, among just movies that were universally loved, over 6% were released before 2000. These numbers are again likely due to how I filtered the data. I filtered for movies with more than 100 reviews. Without that filter, there probably would have been a greater amount older movies in all four categories since older movies tend to have fewer reviews, which I will show soon. Still, the fact that the top 10 consists mostly of older movies indicates that they do tend to review better than newer ones even accounting for these biases.

To demonstrate this, I looked at look at two bar graphs. The first bar graph looks at the average number of reviews movies received each year. Note that in these next few bar graphs, I did not filter for movies with more than 100 reviews. There is a clear upward trend especially around the year 2000.

领英推荐

Effective Strategies for Handling Missing Data in…

ITVersity, Inc. 1 个月前

What are the most in-demand skills in data science?

Sawtooth 2 年前

The Three Most Common Statistical Tests You Should…

Keith McNulty 6 个月前

Next, I have a bar graph of the proportion of movies that were universally loved each year. There is a clear downward trend. These two graphs show that even though there are fewer reviews for older movies, older movies do clearly have both higher RT scores and average scores.

Next, I will at how the proportion of all four categories have changed over time. In this area graph, I once again filtered for just movies with more than 100 reviews.

As I showed earlier, the "universally loved" category has decreased over time. In it's place, the "universally liked" and "not as well liked" categories have increased. Reviewers tend to give newer movies more negative reviews. Additionally, reviewers tend to give newer movies lower scores, even ones with high RT scores. Does this mean that movies have gotten worse over time? I don't think so. Instead I think reviewers just have a tendency to rate older movies more highly while being more critical of newer ones.

To wrap up, I wanted to see if some genres are over-represented in each category. For reference, I will show the bar graph from before again. It serves as a good baseline for how movies overall are distributed.

Most genres I checked shared a similar pattern to the one I show above. I could not find a single genre where the "loved by some" category is larger than the "universally liked" category. Overall, the "loved by some" category is the smallest, regardless of release year or genre. But I may have been able to get a more equal distribution with different cutoffs.

Different genres tend to differ by the relative size of the "universally loved" category. For instance, comedy movies are less likely to be universally loved, while the other three categories look similar to the overall distribution. In contrast, drama movies are more likely to be universally loved.

By far the highest rated genre though is the classics genre. Classics is the only genre to have no movies in the "not as well liked" category.

This result lines up with the findings from earlier that older movies tend to be rated higher.

TL;DR

For this project, I was interested in seeing how movies' RT scores compared to their average review scores. It turns out that the two are tightly correlated, though they follow more of an S-shaped curve than a straight linear regression. There are very few movies with a high RT score and low average score, or vice versa.

Reviewers tend to rate older movies more highly, and tend to prefer genres such as classics or genres to others. Over time, even among the highest-rated movies, reviewers tend to give lower average scores even if RT scores are similar.

How My Own Decisions Impacted the Results

Overall, the main takeaway I have from this project how much decisions about how to filter data and handle missing or erroneous values can impact a project's findings. I hope that people don't come away from this project thinking that Seven Samurai is the most critically acclaimed movie, because that result would likely be different if I made different slightly different decisions about filtering for movies with a certain number of reviews. I'm pretty sure that the only reason why the "loved by some" category is so much smaller than the others is because of where I placed the cutoffs.

I tried to justify all the decisions I made in filtering the data, but it's still important to keep in mind how those decisions can impact the results. A lot of these decisions ultimately come down to convenience. I ultimately decided against the machine learning idea because I don't have experience in that domain. I pointed out a potential issue with that approach, but maybe someone more knowledgeable could have resolved that issue.

Ultimately, I don't Rotten Tomatoes's strange scoring system is a great way to measure consensus. A movie's RT score tends to give a very similar story to its average score, but removes key information about the extent to which a reviewer likes or dislikes a movie. I suspect that part of the reason Rotten Tomatoes uses its scoring system is due to convenience. If Rotten Tomatoes tries to calculate an average score, the website runs into the same issue I did with reviews that don't give a score.

Thanks for reaching. I have linked to my GitHub with my SQL and R code. I also linked to the original dataset.

My GitHub: https://github.com/Zach-Nabavian/Rotten-Tomatoes-Movie-Score-Analysis

Original Dataset: https://www.kaggle.com/datasets/stefanoleone992/rotten-tomatoes-movies-and-critic-reviews-dataset

要查看或添加评论，请登录

Zach Nabavian的更多文章

Lessons Learned from My Energy Demand Forecasting Project

2025年2月11日

Lessons Learned from My Energy Demand Forecasting Project

I recently concluded my biggest data science portfolio project. The goal of this project was develop demand forecast…

2 条评论
Using Shiny to Analyze Salary Data

2023年9月5日

Using Shiny to Analyze Salary Data

Shiny apps are simple web applications built using R or Python, though the latter is still early in development. They…

2 条评论
Putting A/B Tests into Practice

2023年8月29日

Putting A/B Tests into Practice

Background The developers of the mobile app Cookie Cats wanted to see how its user base responded to changes in the…

4 条评论
How Confounding Variables Produce Misleading Results: An Example With Super Mario Maker

2023年3月14日

How Confounding Variables Produce Misleading Results: An Example With Super Mario Maker

While working on an exploratory data analysis project on Super Mario Maker, I tried to answer what I thought would be a…

3 条评论

Understanding Rotten Tomatoes's Weird Scoring System

Zach Nabavian

Data Analyst | Demand Forecasting | System Optimization

Introduction

Data Preparation

Discussion

领英推荐

TL;DR

How My Own Decisions Impacted the Results

Zach Nabavian的更多文章

社区洞察

其他会员也浏览了

The Three Most Common Statistical Tests You Should Deeply Understand

Building and Visualizing a Social Network through the Vikings’ Example

F-distribution and its Application in Hypothesis Testing

Cleaning the DATA

Data Insights for Everyone — The Semantic Layer to the Rescue

Nonparametric Regression in R Studio

How Dave Does Data - 20 Days of Real Analytics

A Comprehensive Guide to the Grammar of Graphics for Effective Visualization of Multi-dimensional Data

LinkedIn Top Voices 2018: Data Science & Analytics

Adventures in Data Science: From Wrangling Rogue Data to Predicting the Future (and Everything in Between)

Introduction

Data Preparation

Discussion

领英推荐

TL;DR

How My Own Decisions Impacted the Results

Zach Nabavian的更多文章

Lessons Learned from My Energy Demand Forecasting Project

Using Shiny to Analyze Salary Data

Putting A/B Tests into Practice

How Confounding Variables Produce Misleading Results: An Example With Super Mario Maker

社区洞察

其他会员也浏览了

The Three Most Common Statistical Tests You Should Deeply Understand

Building and Visualizing a Social Network through the Vikings’ Example

F-distribution and its Application in Hypothesis Testing

Cleaning the DATA

Data Insights for Everyone — The Semantic Layer to the Rescue

Nonparametric Regression in R Studio

How Dave Does Data - 20 Days of Real Analytics

A Comprehensive Guide to the Grammar of Graphics for Effective Visualization of Multi-dimensional Data

LinkedIn Top Voices 2018: Data Science & Analytics

Adventures in Data Science: From Wrangling Rogue Data to Predicting the Future (and Everything in Between)