登录查看更多内容

Building Predictive Models (Random Forest, XGBoost, and Grid Search) To Predict A Movie's Box Office

Austin Wolff

Market Analyst @ BiggerPockets

发布日期: 2021年11月10日

This is a follow-up to my previous article, where I found the variables most correlated with a movie's box office success. Recently, I tested numerous machine learning models to learn which model performed the best when predicting our target (a movie's box office). I will detail which models I used and their associated results below.

Cleaning The Data

But first, I had to clean and organize my data. I pulled data from IMDb's open database, which they update once per day. The dataset was millions of rows long, so I had to use Dask to help my poor MacBook Air process the data.

I narrowed the dataset down to only movies released in the US between 1980-2020, then merged it with another dataset from TMDB, which contained details regarding budget and box office that IMDb's dataset didn't have.

Like most machine learning projects, getting, cleaning, and preparing the data took about 80% of the total project time.

Feature Engineering

Based on my previous article, I knew there was a correlation between a movie's box office and the previous box office success of the actors and director associated with said movie. Here is the simple equation I used to quantify that number:

I simply calculated the average box office associated with every actor and director for their top 4 movies then added those numbers together (the avg. gross for the star actor and the avg. gross for the director) of each movie to get the "Total Gross Bankability."

Data Leakage

The "Total Gross Bankability" is the only feature that has possible data leakage, as it takes the box office of an actor's previous top movies. However, this can incorporate target data into our feature.

To remove data leakage, it would be good to only calculate the avg. gross for an actor for all movies released before the target movie (after all, if we were to use this in the real world, we wouldn't have access to the box office of future movies an actor would be in).

Finding Our Baseline With Dummy Regression

After splitting the training and testing data, I started with our baseline.

model_baseline = make_pipeline(
? ? ? ? ? ? ? ? OneHotEncoder(use_cat_names=True),
? ? ? ? ? ? ? ? DummyRegressor()
)

Then I used cross validation to help evaluate the metrics.


scoring = ['neg_mean_absolute_error','neg_mean_absolute_percentage_error','r2']


cv_baseline = cross_validate(
? ? model_baseline,
? ? X_train,
? ? y_train,
? ? scoring=scoring,
? ? cv=5
)


cv_mae_baseline = -cv_baseline['test_neg_mean_absolute_error'].mean()
cv_mape_baseline = -cv_baseline['test_neg_mean_absolute_percentage_error'].mean()
cv_r2_baseline = cv_baseline['test_r2'].mean()




print('Baseline MAE:', '{:,}'.format(cv_mae_baseline.round(2)))


print('Baseline MAPE:', '{:,}'.format(cv_mape_baseline.round(6)))


print('Baseline R2 score:', '{:,}'.format(cv_r2_baseline.round(6)))

Below were the printed results:


Baseline MAE: 93,947,168.94
Baseline MAPE: 282.409473
Baseline R2 score: -0.000994

Due to the high variance of the target column, I decided my key metric was going to be the R2 score and not MAE. However, I kept printing the mean absolute error and mean absolute percentage error for reference. Now that I had my baseline, it was time to use other well-known machine learning models.

Using Linear Regression

I made my linear model and reused most of the code above with some slight variations.


model_lr_1 = make_pipeline(
? ? ? ? ? ? ? ? OneHotEncoder(use_cat_names=True),
? ? ? ? ? ? ? ? SimpleImputer(strategy='mean'),
? ? ? ? ? ? ? ? LinearRegression()
)

After cross-validating and scoring, below were the printed results:


Linear Regression (1) MAE: 455,712,698.77
Linear Regression (1) MAPE: 178.76018
Linear Regression (1) R2 score: -14,205.471755

Linear Regression performed worse than the baseline! I tried again, this time scaling my data, to see if that made a difference.


model_lr_2 = make_pipeline(
? ? ? ? ? ? ? ? OneHotEncoder(use_cat_names=True),
? ? ? ? ? ? ? ? SimpleImputer(strategy='mean'),
? ? ? ? ? ? ? ? StandardScaler(),
? ? ? ? ? ? ? ? LinearRegression()
)

Printed results:


Linear Regression (2) MAE: 4.039487079675694e+17
Linear Regression (2) MAPE: 3,207,677,317,315.1016
Linear Regression (2) R2 score: -3.033014419394398e+22

Scaling the data made it much worse. This is likely because there is a lot of variance in the data, as most movies make little money, while some make over 1,000 times that of others.

Next, I tried Ridge Regression.

Using Ridge Regression

Unfortunately with Ridge, we had to scale the data.


model_ridge_1 = make_pipeline(
? ? ? ? ? ? ? ? OneHotEncoder(use_cat_names=True),
? ? ? ? ? ? ? ? SimpleImputer(strategy='mean'),
? ? ? ? ? ? ? ? StandardScaler(),
? ? ? ? ? ? ? ? Ridge()
)

But fortunately, it made quite a positive difference. Here are the printed results:


Ridge Regression (1) MAE: 64,164,252.9
Ridge Regression (1) MAPE: 130.469053
Ridge Regression (1) R2 score: 0.5957777

I tried a few different alpha values for Ridge, but it didn't make too much of a difference.


model_ridge_2 = make_pipeline(
? ? ? ? ? ? ? ? OneHotEncoder(use_cat_names=True),
? ? ? ? ? ? ? ? SimpleImputer(strategy='mean'),
? ? ? ? ? ? ? ? StandardScaler(),
? ? ? ? ? ? ? ? Ridge(alpha=100)
)

Here were the new scores:

领英推荐

The Power of Probabilistic Scenarios in Constantly…

International Standard for Lean Six Sigma (ISLSS) 1 年前

Unlocking Insights from Timeline Data Using Regression…

Sandip Palit 2 个月前

10 Power BI Shortcuts to Speed Up Your Work

Anurodh Kumar 1 个月前


Ridge Regression (2) MAE: 63,796,418.02
Ridge Regression (2) MAPE: 128.287068
Ridge Regression (2) R2 score: 0.595994

It barely made a difference. I could've used Grid Search to find the optimal hyperparameters, but I wanted to try a tree-based model first.

I had a hunch tree-based models would outperform linear regression models because scoring a big box office is not necessarily linear, and there are other interactions at play like story, the star power of the movie, and word-of-mouth advertising. Below proves my hunch right.

Using Random Forest

I built the first model using an Ordinal Encoder for my categorical variables.


model_forest_1 = make_pipeline(
? ? ? ? ? ? ? ? OrdinalEncoder(),
? ? ? ? ? ? ? ? SimpleImputer(strategy='mean'),
? ? ? ? ? ? ? ? RandomForestRegressor()
)

Here were the scores:


Random Forest (1) MAE: 45,806,089.2
Random Forest (1) MAPE: 63.284248
Random Forest (1) R2 score: 0.6850748

The scores were already much better, and I didn't even tune any hyperparameters yet! Next I tried One Hot Encoding to see if that made a difference.


model_forest_2 = make_pipeline(
? ? ? ? ? ? ? ? OneHotEncoder(use_cat_names=True),
? ? ? ? ? ? ? ? SimpleImputer(strategy='mean'),
? ? ? ? ? ? ? ? RandomForestRegressor(),
)

The scores:


Random Forest (2) MAE: 44,759,313.8
Random Forest (2) MAPE: 48.350349
Random Forest (2) R2 score: 0.6904746

One Hot Encoding gave better metrics than Ordinal, so I stuck with that.

Using Grid Search

To find the optimal hyperparameters, I used GridSearchCV() on "model_forest_2." Instead of testing all parameters in an iterative range, instead I used parameter extremes to lower computation time but without leaving things to chance using Random Search.


params = {
? ? "simpleimputer__strategy": ['mean','median'],
? ? "randomforestregressor__n_estimators": [75, 100, 200],
? ? "randomforestregressor__max_depth": [None, 100],
? ? "randomforestregressor__min_samples_leaf": [1, 0.1],
}

model_grid_rf = GridSearchCV(
? ? model_forest_2,
? ? param_grid = params,
? ? n_jobs=-1,
? ? cv=5,
#? ? ?verbose=3
)

model_grid_rf.fit(X_train, y_train)

print(model_grid_rf.best_params_)

Here were the printed hyperparameters:


{'randomforestregressor__max_depth': None, 'randomforestregressor__min_samples_leaf': 1, 'randomforestregressor__n_estimators': 100, 'simpleimputer__strategy': 'mean'}

The hyperparameters ended up just being the default values. For clarification, I built a third model and explicitly stated the hyperparameters:


model_forest_3 = make_pipeline(
? ? ? ? ? ? ? ? OneHotEncoder(use_cat_names=True),
? ? ? ? ? ? ? ? SimpleImputer(strategy='mean'),
? ? ? ? ? ? ? ? RandomForestRegressor(max_depth=None,
? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?min_samples_leaf=1,
? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?n_estimators=100
? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?)
)

Here were the printed results:


Random Forest (3) MAE: 44,809,586.4
Random Forest (3) MAPE: 53.607505
Random Forest (3) R2 score: 0.6880779

All 3 scores were a little worse. This is most likely due to the randomness of the regression, as we would most likely get a different result every time anyway. Regardless, to be conservative, we will stick with these values for reporting.

XGBoost

Then I tried XGBoost. I tuned the hyperparameters using GridSearchCV:


xgb_params = {
? ? "xgbregressor__max_depth": [3,6,10],
? ? "xgbregressor__learning_rate": [0.01,0.1,0.3],
? ? "xgbregressor__n_estimators": [100, 500, 1000],
? ? "xgbregressor__colsample_bytree": [0.1, 0.5, 1]
}


xgb_grid_1 = GridSearchCV(model_xgb_1,
? ? ? ? ? ? ? ? ? ? ? ? ?param_grid=xgb_params,
? ? ? ? ? ? ? ? ? ? ? ? ?n_jobs=-1,
? ? ? ? ? ? ? ? ? ? ? ? scoring='r2',
? ? ? ? ? ? ? ? ? ? ? ? ? cv=2,
? ? ? ? ? ? ? ? ? ? ? ? ?verbose=3)


xgb_grid_1.fit(X_train, y_train)

Then I printed the results:


XGBoost (2) MAE: 45,012,162.4
XGBoost (2) MAPE: 56.752657
XGBoost (2) R2 score: 0.700572

The MAE and MAPE scores were a little worse, but the R2 Score was better. So I used this model on the testing data, and compared it to the Random Forest Model:


Random Forest R2 (Test) = 0.68125

XGBoost R2 (Test) = 0.7136502

The XGBoost performed the best! This is not surprising as it's the most common model used to win Kaggle Competitions.

Results

While Random Forest beat the Linear Regression models, XGBoost still outperformed them all. This will be yet another example of the "out-of-the-box" adaptability of the XGBoost model.

For those curious what the top features were, here is a "feature importances" graph, pulled from the Random Forest model:

As you can see, "budget" was by far the most important feature, with "Total_Gross_Bankability" coming in second.

Further Research

Limitations of my analysis include missing data in all rows, which either had to be dropped or ignored. The dataset also only included movies released in the USA between 1980–2020, and excluded any movies released by streaming companies (like Netflix-originals, since their business model is not built around “box office”).

Area for further research include calculating the Total Gross Bankability based on the movies released previous to one we are trying to predict, log-transforming the box office to test if that impacts the R2 score, and removing movies classified as outliers in a box-plot model (those with box offices more than 1.5 times the interquartile ranges).

You can check out?my GitHub here. For convenience, you can see the entire notebook referenced above here. To see more Data Science projects, follow me on GitHub or connect with me on?LinkedIn!

要查看或添加评论，请登录

Austin Wolff的更多文章

Ranking The Best MSAs in 2024 for Real Estate Investment

2024年5月1日

Ranking The Best MSAs in 2024 for Real Estate Investment

What are the best MSAs for Real Estate Investment in 2024? My previous analyses have been read by multi-billion dollar…

1 条评论
Real Estate Data Science: Ranking MSAs by Job Growth Only

2024年4月29日

Real Estate Data Science: Ranking MSAs by Job Growth Only

There are many variables to help you analyze the top MSAs you should be investing in. However, one of the largest…
Time-Series: Removing COVID from the BLS Jobs Dataset

2024年4月26日

Time-Series: Removing COVID from the BLS Jobs Dataset

The COVID pandemic resulted in a catastrophic dip in employment, recorded in the Bureau of Labor Statistics' (BLS)…
End-to-End AWS ETL Data Pipeline: Ranking The Best Places To Invest In Real Estate

2023年3月6日

End-to-End AWS ETL Data Pipeline: Ranking The Best Places To Invest In Real Estate

If your real estate company wants to secure the highest IRR possible, you need to select the right market. I built an…
Using AWS EMR (Spark, Hadoop) To Solve An Infamous Real Estate Problem

2023年3月5日

Using AWS EMR (Spark, Hadoop) To Solve An Infamous Real Estate Problem

If your real estate company uses demographics at the Census Tract and Block Group level to make investment decisions, I…
Creating an ETL Data Pipeline Using Airflow With DockerOperator

2023年1月24日

Creating an ETL Data Pipeline Using Airflow With DockerOperator

What are the best MSAs (Metropolitan Statistical Areas) to invest in real estate right now? Is it Austin, TX? Tampa…

1 条评论
I went to AFM. Here are my notes.

2022年11月4日

I went to AFM. Here are my notes.

Here are the new things I learned at AFM (these notes are mostly for my value, but I thought they might be helpful to…

1 条评论
Using Multiple Regression To Examine What Variables Are Most Correlated With A Movie’s Box Office Success

2021年10月14日

Using Multiple Regression To Examine What Variables Are Most Correlated With A Movie’s Box Office Success

Tools Used: Python, Jupyter Notebook, Pandas, Dask, SciPy, Matplotlib, Seaborn, Scikit-learn, DataPrep When asked how…
Commercial Broker's Opinion of Cap Rates in Phoenix

2019年6月27日

Commercial Broker's Opinion of Cap Rates in Phoenix

My hands were shaking..

2 条评论

See all articles

Building Predictive Models (Random Forest, XGBoost, and Grid Search) To Predict A Movie's Box Office

Austin Wolff

Market Analyst @ BiggerPockets

Cleaning The Data

Feature Engineering

Data Leakage

Finding Our Baseline With Dummy Regression

Using Linear Regression

Using Ridge Regression

领英推荐

Using Random Forest

Using Grid Search

XGBoost

Results

Further Research

Austin Wolff的更多文章

社区洞察

其他会员也浏览了

10 Power BI Shortcuts to Speed Up Your Work

Mishandling Missing Values @ DS ML models

How to lie with visualization

Q. How to choose the best-fit among various Statistical Models ?

K-Nearest Neighbors (KNN) algorithm:

Big Data is Officially Out, AI is In

Mastering Model Evaluation

Look-ahead bias

Algorithmically Speaking - #5: Representing Graphs

Day 7: k-Nearest Neighbors (k-NN)

Cleaning The Data

Feature Engineering

Data Leakage

Finding Our Baseline With Dummy Regression

Using Linear Regression

Using Ridge Regression

领英推荐

Using Random Forest

Using Grid Search

XGBoost

Results

Further Research

Austin Wolff的更多文章

Ranking The Best MSAs in 2024 for Real Estate Investment

Real Estate Data Science: Ranking MSAs by Job Growth Only

Time-Series: Removing COVID from the BLS Jobs Dataset

End-to-End AWS ETL Data Pipeline: Ranking The Best Places To Invest In Real Estate

Using AWS EMR (Spark, Hadoop) To Solve An Infamous Real Estate Problem

Creating an ETL Data Pipeline Using Airflow With DockerOperator

I went to AFM. Here are my notes.

Using Multiple Regression To Examine What Variables Are Most Correlated With A Movie’s Box Office Success

Commercial Broker's Opinion of Cap Rates in Phoenix

社区洞察

其他会员也浏览了

10 Power BI Shortcuts to Speed Up Your Work

Mishandling Missing Values @ DS ML models

How to lie with visualization

Q. How to choose the best-fit among various Statistical Models ?

K-Nearest Neighbors (KNN) algorithm:

Big Data is Officially Out, AI is In

Mastering Model Evaluation

Look-ahead bias

Algorithmically Speaking - #5: Representing Graphs

Day 7: k-Nearest Neighbors (k-NN)