登录查看更多内容

Using Multiple Regression To Examine What Variables Are Most Correlated With A Movie’s Box Office Success

Austin Wolff

Market Analyst @ BiggerPockets

发布日期: 2021年10月14日

+ 关注

Tools Used: Python, Jupyter Notebook, Pandas, Dask, SciPy, Matplotlib, Seaborn, Scikit-learn, DataPrep

When asked how movie executives know if their film would make money or not, Hollywood-legend William Goldman once said,

“Nobody knows anything.”

But surely there are independent variables that are more correlated with box office than others. I was personally motivated to find this out: I’m a filmmaker too, and I’d much rather make money than lose money.

So my goal was to analyze the Pearson Correlation Coefficients to determine just how correlated each independent variable is with a movie’s gross revenue, in hopes of one day replicating the box office success of movies I love.

Getting The?Data

The most comprehensive movie database on the internet is probably the aptly name Internet Movie Database, or IMDb.com. They have a relational database accessible by downloadable zip files on their website, which is updated every day.

The files were millions of rows long, so in order to wrangle the data in Jupyter, I imported Dask to use parallel processing. I merged the “relational dataframes” based on their primary keys, then I double-checked a few rows.

# Testing to see if the DataFrame also holds my own IMDb credits!

imdb_all_data.loc[imdb_all_data["primaryName"] == "Austin James Wolff"].compute()

After I merged the tables, I converted the Dask DataFrame back into Pandas to continue the analysis.

Limitations To The?Dataset

IMDb’s dataset, while massive, was still missing many variables I needed, such as Rating, Score, and Budget. So I pulled a public dataset from Kaggle which contained the variables I needed, but only for movies from the years 1980–2020.?

Data Wrangling and Feature Engineering

First, I removed all non-US movies. Then I changed the type of the release date column to a datetime with Pandas and some help from regular expressions (this column was quite dirty). Then I created a column for “profit” (gross revenue - budget).

Now came the interesting part. It is common knowledge in the industry that your movie is more likely to make more money if you have “star power,” a qualitative measure of the popularity of your top actors. This can also apply, to some extent, to the popularity of the director (think Christopher Nolan or Quentin Tarantino).

I wanted to quantify this “bankability.” So I adapted an algorithm from my colleague Ravi Gupta and made some slight modifications[1].?(You can find his algorithm here. His entire post is quite brilliant and I highly recommend it!)

I created a new DataFrame with just actors and directors, and mapped their top 4 movies they were known for, along with each movies’ gross revenue and profit. Then I took the average of the gross and profit for each actor and director.

For each movie, I added the Average Gross the top billed actor was known for with the Average Gross of the director to get “Total Gross Bankability.”

I decided we should add the the mean bankability of the actor with the director to account for people wanting to see a movie because of a particular actor and not necessarily the director, and vice versa.

I repeated the same process for profit. Now we have two new “bankability” coefficients for each movie we can use in our data analysis to test for correlation.

This is a very simple equation, and quantifying “bankability” for an actor, director, and “total star power” for a movie is certainly worth exploring in the future.?

At this point, we had all the features necessary for our analysis.

Statistical Methods?Used

First, I thought it necessary to remove outliers, as there are a handful of movies with box offices of over $1 Billion, that already affect our model.

A few James Cameron movies (Titanic, Avatar) and Marvel "Tentpole Movies" (The Avengers, Avengers: Infinity War, Avengers: Endgame) would be considered outliers.


# Current gross mean, including outliers

gross_mean = df_movies['gross'].mean().round(2)
print('Gross Mean:', "{:,}".format(gross_mean))

# Current gross standard deviation

gross_std = df_movies['gross'].std().round(2)
print('Gross Standard Deviation:', "{:,}".format(gross_std))

Gross Mean: 82,468,351.51

Gross Standard Deviation: 168,082,426.51

Avengers Endgame Box Office: $2.789 Billion

Removing Outliers

I classified outliers as movies with box offices more than 3 standard deviations away from the mean.


# Calculate Z-Score

df_movies['Z_Score'] = np.abs(stats.zscore(df_movies["gross"]))

# Filter out outliers and make a deepcopy

df_less_outliers = copy.deepcopy(df_movies[df_movies['Z_Score'] < 3])

Now let’s calculate the new mean and standard deviation.


gross_mean_less_3sigma = df_less_outliers['gross'].mean().round(2)
print('Gross Mean:', "{:,}".format(gross_mean_less_3sigma))

gross_std_less_3sigma = df_less_outliers['gross'].std().round(2)
print('Gross Standard Deviation:', "{:,}".format(gross_std_less_3sigma))

Gross Mean: 63,085,873.0

Gross Standard Deviation: 98,445,947.42

领英推荐

Polars Vs Pandas: Benchmarking performances and beyond

Machine Learning Reply GmbH 1 年前

Mastering Matplotlib: Easy Plotting Tips and Common…

Ali Asghar Torabi 1 年前

Introduction to Pandas

Can Arslan 2 年前

The gross mean decreased by about 20 Million, and the Gross Standard Deviation decreased by about 70 Million. Let’s look at our histogram again.

Common domain knowledge states that most movies make little to no money, so at first glance this graph appears to accurately represent the filmmaking industry.

Now let’s return to the question of the article: What variables drive box office the most?

My Hypotheses

Because this was initially an EDA, there are multiple hypotheses. First, I must test the statistical significance of the linear relationship between each variable and the gross.

We’ll start by defining a null and alternative hypothesis for “budget,” the variable I think most correlates to box office. The null hypothesis is that the movie’s budget and box office have a linear relationship of 0 and are not statistically significantly associated with one another, with an alpha = 0.05.

H?: β? = 0

H?: β? ≠ 0

The alternative hypothesis is the linear relationship of the budget and box office is not 0 and are statistically significantly associated with one another.

We’ll use linear regression:


from statsmodels.formula.api import ols
model = ols("gross ~ budget", data=df_clean_movies).fit()
print(model.summary())

R-squared: 0.466

Adj. R-squared: 0.466

Slope: 2.11

P-value: 0.000

Because the P-Value is less than the alpha (p < 0.05) we reject the null hypothesis and can conclude there is a statistically significant association between a movie’s budget and box office.?

But all we did was conclude something that is common knowledge. What I’m looking for is the measure of correlation for each variable (we can test for statistically significant association afterwards).?


# This will show us correlation of our variables

df_clean_movies.corr().sort_values(by="gross", ascending=False)

From this table, we can see that “profit” has the highest correlation to “gross” (unsurprisingly) with a coefficient of 0.96.?

The next highest is “budget” with a correlation coefficient of 0.68.

The third highest is “Total_Gross_Bankability”, our engineered feature to represent the “total star power” of the actor and director, with a correlation coefficient of 0.45.?

Let’s test the statistical significance of the relationship of “gross” and “Total_Gross_Bankability” while keeping budget in account, with an alpha = 0.05.


model2 = ols("gross ~ budget + Total_Gross_Bankability", data=df_clean_movies).fit()
print(model2.summary())

R-squared: 0.501 (an increase)

Adj. R-squared: 0.500 (an increase)

Slope:?.09

P-value: 0.000

Because the P-Value is less than the alpha (p < 0.05) we reject the null hypothesis and can conclude there is a statistically significant association between a movie’s “Total Gross Bankability” and box office, while keeping budget into account.

Results

This analysis concludes the variable with highest correlation and statistically significant association to a movie’s box office is its budget. This is not that surprising to me.

What is surprising, however, is this analysis also concludes that there is a statistically significant association between box office and “star power” bankability (titled “Total_Gross_Bankability” in the DataFrame).?

Remember, “Total Gross Bankability” was a feature engineered by taking the average box office of the 4 most popular movies an actor/director was known for on their IMDb page. Because the correlation coefficient was 0.45, it might be worth exploring in the future.?

Conclusion

Budget and “Total Gross Bankability” (a quantification of a movie’s “star power”) were statistically significantly associated with a movie’s box office. Budget had a relatively strong correlation (0.68), and “Total Gross Bankability” had a moderate correlation (0.45).

The year, runtime, and score had relatively weak correlation with box office.

Limitations of my analysis include missing data in all rows, which either had to be dropped or ignored. I also did not measure the correlation of categorical variables, only quantitative. The dataset also only included movies released in the USA between 1980–2020, and excluded any movies released by streaming companies (like Netflix-originals, since their business model is not built around “box office”).

Further research could use logistic regression to analyze the thousands of production companies and tens of thousands of actors in films to analyze their correlation with box office success, as well as potentially find a better algorithm to analyze the “star power” of a movie (the ability to get an audience to see a movie because they want to see a movie with a particular actor or because it’s made by a particular director).

If we had access to the marketing data, I would be particularly interested in examining the correlation between box office and spend on different marketing mediums, such as OOH billboards, digital display ads, YouTube Ad commercials, social media posts, getting actors on talk shows, and more. This is certainly an area for further research.

You can check out my GitHub here. For convenience, you can see the entire notebook referenced above here. To see more Data Science projects, follow me on GitHub or connect with me on LinkedIn!

[1] Gupta, Ravi. Predicting Movie Profitability and Risk at the Pre-production Phase. Medium. Retrieved from https://towardsdatascience.com/predicting-movie-profitability-and-risk-at-the-pre-production-phase-2288505b4aec

要查看或添加评论，请登录

Austin Wolff的更多文章

Ranking The Best MSAs in 2024 for Real Estate Investment

2024年5月1日

Ranking The Best MSAs in 2024 for Real Estate Investment

What are the best MSAs for Real Estate Investment in 2024? My previous analyses have been read by multi-billion dollar…

1 条评论
Real Estate Data Science: Ranking MSAs by Job Growth Only

2024年4月29日

Real Estate Data Science: Ranking MSAs by Job Growth Only

There are many variables to help you analyze the top MSAs you should be investing in. However, one of the largest…
Time-Series: Removing COVID from the BLS Jobs Dataset

2024年4月26日

Time-Series: Removing COVID from the BLS Jobs Dataset

The COVID pandemic resulted in a catastrophic dip in employment, recorded in the Bureau of Labor Statistics' (BLS)…
End-to-End AWS ETL Data Pipeline: Ranking The Best Places To Invest In Real Estate

2023年3月6日

End-to-End AWS ETL Data Pipeline: Ranking The Best Places To Invest In Real Estate

If your real estate company wants to secure the highest IRR possible, you need to select the right market. I built an…
Using AWS EMR (Spark, Hadoop) To Solve An Infamous Real Estate Problem

2023年3月5日

Using AWS EMR (Spark, Hadoop) To Solve An Infamous Real Estate Problem

If your real estate company uses demographics at the Census Tract and Block Group level to make investment decisions, I…
Creating an ETL Data Pipeline Using Airflow With DockerOperator

2023年1月24日

Creating an ETL Data Pipeline Using Airflow With DockerOperator

What are the best MSAs (Metropolitan Statistical Areas) to invest in real estate right now? Is it Austin, TX? Tampa…

1 条评论
I went to AFM. Here are my notes.

2022年11月4日

I went to AFM. Here are my notes.

Here are the new things I learned at AFM (these notes are mostly for my value, but I thought they might be helpful to…

1 条评论
Building Predictive Models (Random Forest, XGBoost, and Grid Search) To Predict A Movie's Box Office

2021年11月10日

Building Predictive Models (Random Forest, XGBoost, and Grid Search) To Predict A Movie's Box Office

This is a follow-up to my previous article, where I found the variables most correlated with a movie's box office…
Commercial Broker's Opinion of Cap Rates in Phoenix

2019年6月27日

Commercial Broker's Opinion of Cap Rates in Phoenix

My hands were shaking..

2 条评论

See all articles

Using Multiple Regression To Examine What Variables Are Most Correlated With A Movie’s Box Office Success

Austin Wolff

Market Analyst @ BiggerPockets

Getting The?Data

Limitations To The?Dataset

Data Wrangling and Feature Engineering

Statistical Methods?Used

Removing Outliers

领英推荐

My Hypotheses

Results

Conclusion

Austin Wolff的更多文章

社区洞察

其他会员也浏览了

Accessing Data with iloc: Position-Based Indexing in Pandas

The Usain Bolt of Data Processing, Pandas Lag Behind!

Pandas - Sort DataFrame

+30 Useful Operations in Pandas ??

Revolutionizing Data Analytics: Unleashing Python's Power Within Excel

Pick Your Bear!

6th Story – If You can Visualize It. You can Explain It

RStudio Became Posit PBC Yesterday - Here's Why I Think That's Good News

Simple scheduled sentiment analysis using Jupyter Notebooks, NBFire, Google Sheets and NLTK

How to start in Data Science

Getting The?Data

Limitations To The?Dataset

Data Wrangling and Feature Engineering

Statistical Methods?Used

Removing Outliers

领英推荐

My Hypotheses

Results

Conclusion

Austin Wolff的更多文章

Ranking The Best MSAs in 2024 for Real Estate Investment

Real Estate Data Science: Ranking MSAs by Job Growth Only

Time-Series: Removing COVID from the BLS Jobs Dataset

End-to-End AWS ETL Data Pipeline: Ranking The Best Places To Invest In Real Estate

Using AWS EMR (Spark, Hadoop) To Solve An Infamous Real Estate Problem

Creating an ETL Data Pipeline Using Airflow With DockerOperator

I went to AFM. Here are my notes.

Building Predictive Models (Random Forest, XGBoost, and Grid Search) To Predict A Movie's Box Office

Commercial Broker's Opinion of Cap Rates in Phoenix

社区洞察

其他会员也浏览了

Accessing Data with iloc: Position-Based Indexing in Pandas

The Usain Bolt of Data Processing, Pandas Lag Behind!

Pandas - Sort DataFrame

+30 Useful Operations in Pandas ??

Revolutionizing Data Analytics: Unleashing Python's Power Within Excel

Pick Your Bear!

6th Story – If You can Visualize It. You can Explain It

RStudio Became Posit PBC Yesterday - Here's Why I Think That's Good News

Simple scheduled sentiment analysis using Jupyter Notebooks, NBFire, Google Sheets and NLTK

How to start in Data Science