Building Predictive Models (Random Forest, XGBoost, and Grid Search) To Predict A Movie's Box Office
This is a follow-up to my previous article, where I found the variables most correlated with a movie's box office success. Recently, I tested numerous machine learning models to learn which model performed the best when predicting our target (a movie's box office). I will detail which models I used and their associated results below.
Cleaning The Data
But first, I had to clean and organize my data. I pulled data from IMDb's open database, which they update once per day. The dataset was millions of rows long, so I had to use Dask to help my poor MacBook Air process the data.
I narrowed the dataset down to only movies released in the US between 1980-2020, then merged it with another dataset from TMDB, which contained details regarding budget and box office that IMDb's dataset didn't have.
Like most machine learning projects, getting, cleaning, and preparing the data took about 80% of the total project time.
Feature Engineering
Based on my previous article, I knew there was a correlation between a movie's box office and the previous box office success of the actors and director associated with said movie. Here is the simple equation I used to quantify that number:
I simply calculated the average box office associated with every actor and director for their top 4 movies then added those numbers together (the avg. gross for the star actor and the avg. gross for the director) of each movie to get the "Total Gross Bankability."
Data Leakage
The "Total Gross Bankability" is the only feature that has possible data leakage, as it takes the box office of an actor's previous top movies. However, this can incorporate target data into our feature.
To remove data leakage, it would be good to only calculate the avg. gross for an actor for all movies released before the target movie (after all, if we were to use this in the real world, we wouldn't have access to the box office of future movies an actor would be in).
Finding Our Baseline With Dummy Regression
After splitting the training and testing data, I started with our baseline.
model_baseline = make_pipeline(
? ? ? ? ? ? ? ? OneHotEncoder(use_cat_names=True),
? ? ? ? ? ? ? ? DummyRegressor()
)
Then I used cross validation to help evaluate the metrics.
scoring = ['neg_mean_absolute_error','neg_mean_absolute_percentage_error','r2']
cv_baseline = cross_validate(
? ? model_baseline,
? ? X_train,
? ? y_train,
? ? scoring=scoring,
? ? cv=5
)
cv_mae_baseline = -cv_baseline['test_neg_mean_absolute_error'].mean()
cv_mape_baseline = -cv_baseline['test_neg_mean_absolute_percentage_error'].mean()
cv_r2_baseline = cv_baseline['test_r2'].mean()
print('Baseline MAE:', '{:,}'.format(cv_mae_baseline.round(2)))
print('Baseline MAPE:', '{:,}'.format(cv_mape_baseline.round(6)))
print('Baseline R2 score:', '{:,}'.format(cv_r2_baseline.round(6)))
Below were the printed results:
Baseline MAE: 93,947,168.94
Baseline MAPE: 282.409473
Baseline R2 score: -0.000994
Due to the high variance of the target column, I decided my key metric was going to be the R2 score and not MAE. However, I kept printing the mean absolute error and mean absolute percentage error for reference. Now that I had my baseline, it was time to use other well-known machine learning models.
Using Linear Regression
I made my linear model and reused most of the code above with some slight variations.
model_lr_1 = make_pipeline(
? ? ? ? ? ? ? ? OneHotEncoder(use_cat_names=True),
? ? ? ? ? ? ? ? SimpleImputer(strategy='mean'),
? ? ? ? ? ? ? ? LinearRegression()
)
After cross-validating and scoring, below were the printed results:
Linear Regression (1) MAE: 455,712,698.77
Linear Regression (1) MAPE: 178.76018
Linear Regression (1) R2 score: -14,205.471755
Linear Regression performed worse than the baseline! I tried again, this time scaling my data, to see if that made a difference.
model_lr_2 = make_pipeline(
? ? ? ? ? ? ? ? OneHotEncoder(use_cat_names=True),
? ? ? ? ? ? ? ? SimpleImputer(strategy='mean'),
? ? ? ? ? ? ? ? StandardScaler(),
? ? ? ? ? ? ? ? LinearRegression()
)
Printed results:
Linear Regression (2) MAE: 4.039487079675694e+17
Linear Regression (2) MAPE: 3,207,677,317,315.1016
Linear Regression (2) R2 score: -3.033014419394398e+22
Scaling the data made it much worse. This is likely because there is a lot of variance in the data, as most movies make little money, while some make over 1,000 times that of others.
Next, I tried Ridge Regression.
Using Ridge Regression
Unfortunately with Ridge, we had to scale the data.
model_ridge_1 = make_pipeline(
? ? ? ? ? ? ? ? OneHotEncoder(use_cat_names=True),
? ? ? ? ? ? ? ? SimpleImputer(strategy='mean'),
? ? ? ? ? ? ? ? StandardScaler(),
? ? ? ? ? ? ? ? Ridge()
)
But fortunately, it made quite a positive difference. Here are the printed results:
Ridge Regression (1) MAE: 64,164,252.9
Ridge Regression (1) MAPE: 130.469053
Ridge Regression (1) R2 score: 0.5957777
I tried a few different alpha values for Ridge, but it didn't make too much of a difference.
model_ridge_2 = make_pipeline(
? ? ? ? ? ? ? ? OneHotEncoder(use_cat_names=True),
? ? ? ? ? ? ? ? SimpleImputer(strategy='mean'),
? ? ? ? ? ? ? ? StandardScaler(),
? ? ? ? ? ? ? ? Ridge(alpha=100)
)
Here were the new scores:
领英推荐
Ridge Regression (2) MAE: 63,796,418.02
Ridge Regression (2) MAPE: 128.287068
Ridge Regression (2) R2 score: 0.595994
It barely made a difference. I could've used Grid Search to find the optimal hyperparameters, but I wanted to try a tree-based model first.
I had a hunch tree-based models would outperform linear regression models because scoring a big box office is not necessarily linear, and there are other interactions at play like story, the star power of the movie, and word-of-mouth advertising. Below proves my hunch right.
Using Random Forest
I built the first model using an Ordinal Encoder for my categorical variables.
model_forest_1 = make_pipeline(
? ? ? ? ? ? ? ? OrdinalEncoder(),
? ? ? ? ? ? ? ? SimpleImputer(strategy='mean'),
? ? ? ? ? ? ? ? RandomForestRegressor()
)
Here were the scores:
Random Forest (1) MAE: 45,806,089.2
Random Forest (1) MAPE: 63.284248
Random Forest (1) R2 score: 0.6850748
The scores were already much better, and I didn't even tune any hyperparameters yet! Next I tried One Hot Encoding to see if that made a difference.
model_forest_2 = make_pipeline(
? ? ? ? ? ? ? ? OneHotEncoder(use_cat_names=True),
? ? ? ? ? ? ? ? SimpleImputer(strategy='mean'),
? ? ? ? ? ? ? ? RandomForestRegressor(),
)
The scores:
Random Forest (2) MAE: 44,759,313.8
Random Forest (2) MAPE: 48.350349
Random Forest (2) R2 score: 0.6904746
One Hot Encoding gave better metrics than Ordinal, so I stuck with that.
Using Grid Search
To find the optimal hyperparameters, I used GridSearchCV() on "model_forest_2." Instead of testing all parameters in an iterative range, instead I used parameter extremes to lower computation time but without leaving things to chance using Random Search.
params = {
? ? "simpleimputer__strategy": ['mean','median'],
? ? "randomforestregressor__n_estimators": [75, 100, 200],
? ? "randomforestregressor__max_depth": [None, 100],
? ? "randomforestregressor__min_samples_leaf": [1, 0.1],
}
model_grid_rf = GridSearchCV(
? ? model_forest_2,
? ? param_grid = params,
? ? n_jobs=-1,
? ? cv=5,
#? ? ?verbose=3
)
model_grid_rf.fit(X_train, y_train)
print(model_grid_rf.best_params_)
Here were the printed hyperparameters:
{'randomforestregressor__max_depth': None, 'randomforestregressor__min_samples_leaf': 1, 'randomforestregressor__n_estimators': 100, 'simpleimputer__strategy': 'mean'}
The hyperparameters ended up just being the default values. For clarification, I built a third model and explicitly stated the hyperparameters:
model_forest_3 = make_pipeline(
? ? ? ? ? ? ? ? OneHotEncoder(use_cat_names=True),
? ? ? ? ? ? ? ? SimpleImputer(strategy='mean'),
? ? ? ? ? ? ? ? RandomForestRegressor(max_depth=None,
? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?min_samples_leaf=1,
? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?n_estimators=100
? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?)
)
Here were the printed results:
Random Forest (3) MAE: 44,809,586.4
Random Forest (3) MAPE: 53.607505
Random Forest (3) R2 score: 0.6880779
All 3 scores were a little worse. This is most likely due to the randomness of the regression, as we would most likely get a different result every time anyway. Regardless, to be conservative, we will stick with these values for reporting.
XGBoost
Then I tried XGBoost. I tuned the hyperparameters using GridSearchCV:
xgb_params = {
? ? "xgbregressor__max_depth": [3,6,10],
? ? "xgbregressor__learning_rate": [0.01,0.1,0.3],
? ? "xgbregressor__n_estimators": [100, 500, 1000],
? ? "xgbregressor__colsample_bytree": [0.1, 0.5, 1]
}
xgb_grid_1 = GridSearchCV(model_xgb_1,
? ? ? ? ? ? ? ? ? ? ? ? ?param_grid=xgb_params,
? ? ? ? ? ? ? ? ? ? ? ? ?n_jobs=-1,
? ? ? ? ? ? ? ? ? ? ? ? scoring='r2',
? ? ? ? ? ? ? ? ? ? ? ? ? cv=2,
? ? ? ? ? ? ? ? ? ? ? ? ?verbose=3)
xgb_grid_1.fit(X_train, y_train)
Then I printed the results:
XGBoost (2) MAE: 45,012,162.4
XGBoost (2) MAPE: 56.752657
XGBoost (2) R2 score: 0.700572
The MAE and MAPE scores were a little worse, but the R2 Score was better. So I used this model on the testing data, and compared it to the Random Forest Model:
Random Forest R2 (Test) = 0.68125
XGBoost R2 (Test) = 0.7136502
The XGBoost performed the best! This is not surprising as it's the most common model used to win Kaggle Competitions.
Results
While Random Forest beat the Linear Regression models, XGBoost still outperformed them all. This will be yet another example of the "out-of-the-box" adaptability of the XGBoost model.
For those curious what the top features were, here is a "feature importances" graph, pulled from the Random Forest model:
As you can see, "budget" was by far the most important feature, with "Total_Gross_Bankability" coming in second.
Further Research
Limitations of my analysis include missing data in all rows, which either had to be dropped or ignored. The dataset also only included movies released in the USA between 1980–2020, and excluded any movies released by streaming companies (like Netflix-originals, since their business model is not built around “box office”).
Area for further research include calculating the Total Gross Bankability based on the movies released previous to one we are trying to predict, log-transforming the box office to test if that impacts the R2 score, and removing movies classified as outliers in a box-plot model (those with box offices more than 1.5 times the interquartile ranges).
You can check out?my GitHub here. For convenience, you can see the entire notebook referenced above here. To see more Data Science projects, follow me on GitHub or connect with me on?LinkedIn!