Using Multiple Regression To Examine What Variables Are Most Correlated With A Movie’s Box Office Success
Tools Used: Python, Jupyter Notebook, Pandas, Dask, SciPy, Matplotlib, Seaborn, Scikit-learn, DataPrep
When asked how movie executives know if their film would make money or not, Hollywood-legend William Goldman once said,
“Nobody knows anything.”
But surely there are independent variables that are more correlated with box office than others. I was personally motivated to find this out: I’m a filmmaker too, and I’d much rather make money than lose money.
So my goal was to analyze the Pearson Correlation Coefficients to determine just how correlated each independent variable is with a movie’s gross revenue, in hopes of one day replicating the box office success of movies I love.
Getting The?Data
The most comprehensive movie database on the internet is probably the aptly name Internet Movie Database, or IMDb.com. They have a relational database accessible by downloadable zip files on their website, which is updated every day.
The files were millions of rows long, so in order to wrangle the data in Jupyter, I imported Dask to use parallel processing. I merged the “relational dataframes” based on their primary keys, then I double-checked a few rows.
# Testing to see if the DataFrame also holds my own IMDb credits!
imdb_all_data.loc[imdb_all_data["primaryName"] == "Austin James Wolff"].compute()
After I merged the tables, I converted the Dask DataFrame back into Pandas to continue the analysis.
Limitations To The?Dataset
IMDb’s dataset, while massive, was still missing many variables I needed, such as Rating, Score, and Budget. So I pulled a public dataset from Kaggle which contained the variables I needed, but only for movies from the years 1980–2020.?
Data Wrangling and Feature Engineering
First, I removed all non-US movies. Then I changed the type of the release date column to a datetime with Pandas and some help from regular expressions (this column was quite dirty). Then I created a column for “profit” (gross revenue - budget).
Now came the interesting part. It is common knowledge in the industry that your movie is more likely to make more money if you have “star power,” a qualitative measure of the popularity of your top actors. This can also apply, to some extent, to the popularity of the director (think Christopher Nolan or Quentin Tarantino).
I wanted to quantify this “bankability.” So I adapted an algorithm from my colleague Ravi Gupta and made some slight modifications[1].?(You can find his algorithm here. His entire post is quite brilliant and I highly recommend it!)
I created a new DataFrame with just actors and directors, and mapped their top 4 movies they were known for, along with each movies’ gross revenue and profit. Then I took the average of the gross and profit for each actor and director.
For each movie, I added the Average Gross the top billed actor was known for with the Average Gross of the director to get “Total Gross Bankability.”
I decided we should add the the mean bankability of the actor with the director to account for people wanting to see a movie because of a particular actor and not necessarily the director, and vice versa.
I repeated the same process for profit. Now we have two new “bankability” coefficients for each movie we can use in our data analysis to test for correlation.
This is a very simple equation, and quantifying “bankability” for an actor, director, and “total star power” for a movie is certainly worth exploring in the future.?
At this point, we had all the features necessary for our analysis.
Statistical Methods?Used
First, I thought it necessary to remove outliers, as there are a handful of movies with box offices of over $1 Billion, that already affect our model.
A few James Cameron movies (Titanic, Avatar) and Marvel "Tentpole Movies" (The Avengers, Avengers: Infinity War, Avengers: Endgame) would be considered outliers.
# Current gross mean, including outliers
gross_mean = df_movies['gross'].mean().round(2)
print('Gross Mean:', "{:,}".format(gross_mean))
# Current gross standard deviation
gross_std = df_movies['gross'].std().round(2)
print('Gross Standard Deviation:', "{:,}".format(gross_std))
Gross Mean: 82,468,351.51
Gross Standard Deviation: 168,082,426.51
Avengers Endgame Box Office: $2.789 Billion
Removing Outliers
I classified outliers as movies with box offices more than 3 standard deviations away from the mean.
# Calculate Z-Score
df_movies['Z_Score'] = np.abs(stats.zscore(df_movies["gross"]))
# Filter out outliers and make a deepcopy
df_less_outliers = copy.deepcopy(df_movies[df_movies['Z_Score'] < 3])
Now let’s calculate the new mean and standard deviation.
gross_mean_less_3sigma = df_less_outliers['gross'].mean().round(2)
print('Gross Mean:', "{:,}".format(gross_mean_less_3sigma))
gross_std_less_3sigma = df_less_outliers['gross'].std().round(2)
print('Gross Standard Deviation:', "{:,}".format(gross_std_less_3sigma))
Gross Mean: 63,085,873.0
Gross Standard Deviation: 98,445,947.42
领英推荐
The gross mean decreased by about 20 Million, and the Gross Standard Deviation decreased by about 70 Million. Let’s look at our histogram again.
Common domain knowledge states that most movies make little to no money, so at first glance this graph appears to accurately represent the filmmaking industry.
Now let’s return to the question of the article: What variables drive box office the most?
My Hypotheses
Because this was initially an EDA, there are multiple hypotheses. First, I must test the statistical significance of the linear relationship between each variable and the gross.
We’ll start by defining a null and alternative hypothesis for “budget,” the variable I think most correlates to box office. The null hypothesis is that the movie’s budget and box office have a linear relationship of 0 and are not statistically significantly associated with one another, with an alpha = 0.05.
H?: β? = 0
H?: β? ≠ 0
The alternative hypothesis is the linear relationship of the budget and box office is not 0 and are statistically significantly associated with one another.
We’ll use linear regression:
from statsmodels.formula.api import ols
model = ols("gross ~ budget", data=df_clean_movies).fit()
print(model.summary())
R-squared: 0.466
Adj. R-squared: 0.466
Slope: 2.11
P-value: 0.000
Because the P-Value is less than the alpha (p < 0.05) we reject the null hypothesis and can conclude there is a statistically significant association between a movie’s budget and box office.?
But all we did was conclude something that is common knowledge. What I’m looking for is the measure of correlation for each variable (we can test for statistically significant association afterwards).?
# This will show us correlation of our variables
df_clean_movies.corr().sort_values(by="gross", ascending=False)
From this table, we can see that “profit” has the highest correlation to “gross” (unsurprisingly) with a coefficient of 0.96.?
The next highest is “budget” with a correlation coefficient of 0.68.
The third highest is “Total_Gross_Bankability”, our engineered feature to represent the “total star power” of the actor and director, with a correlation coefficient of 0.45.?
Let’s test the statistical significance of the relationship of “gross” and “Total_Gross_Bankability” while keeping budget in account, with an alpha = 0.05.
model2 = ols("gross ~ budget + Total_Gross_Bankability", data=df_clean_movies).fit()
print(model2.summary())
R-squared: 0.501 (an increase)
Adj. R-squared: 0.500 (an increase)
Slope:?.09
P-value: 0.000
Because the P-Value is less than the alpha (p < 0.05) we reject the null hypothesis and can conclude there is a statistically significant association between a movie’s “Total Gross Bankability” and box office, while keeping budget into account.
Results
This analysis concludes the variable with highest correlation and statistically significant association to a movie’s box office is its budget. This is not that surprising to me.
What is surprising, however, is this analysis also concludes that there is a statistically significant association between box office and “star power” bankability (titled “Total_Gross_Bankability” in the DataFrame).?
Remember, “Total Gross Bankability” was a feature engineered by taking the average box office of the 4 most popular movies an actor/director was known for on their IMDb page. Because the correlation coefficient was 0.45, it might be worth exploring in the future.?
Conclusion
Budget and “Total Gross Bankability” (a quantification of a movie’s “star power”) were statistically significantly associated with a movie’s box office. Budget had a relatively strong correlation (0.68), and “Total Gross Bankability” had a moderate correlation (0.45).
The year, runtime, and score had relatively weak correlation with box office.
Limitations of my analysis include missing data in all rows, which either had to be dropped or ignored. I also did not measure the correlation of categorical variables, only quantitative. The dataset also only included movies released in the USA between 1980–2020, and excluded any movies released by streaming companies (like Netflix-originals, since their business model is not built around “box office”).
Further research could use logistic regression to analyze the thousands of production companies and tens of thousands of actors in films to analyze their correlation with box office success, as well as potentially find a better algorithm to analyze the “star power” of a movie (the ability to get an audience to see a movie because they want to see a movie with a particular actor or because it’s made by a particular director).
If we had access to the marketing data, I would be particularly interested in examining the correlation between box office and spend on different marketing mediums, such as OOH billboards, digital display ads, YouTube Ad commercials, social media posts, getting actors on talk shows, and more. This is certainly an area for further research.
You can check out my GitHub here. For convenience, you can see the entire notebook referenced above here. To see more Data Science projects, follow me on GitHub or connect with me on LinkedIn!
[1] Gupta, Ravi. Predicting Movie Profitability and Risk at the Pre-production Phase. Medium. Retrieved from https://towardsdatascience.com/predicting-movie-profitability-and-risk-at-the-pre-production-phase-2288505b4aec