Building Predictive Models (Random Forest, XGBoost, and Grid Search) To Predict A Movie's Box Office
Photo by Michael Krahn on Unsplash

Building Predictive Models (Random Forest, XGBoost, and Grid Search) To Predict A Movie's Box Office

This is a follow-up to my previous article, where I found the variables most correlated with a movie's box office success. Recently, I tested numerous machine learning models to learn which model performed the best when predicting our target (a movie's box office). I will detail which models I used and their associated results below.

Cleaning The Data

But first, I had to clean and organize my data. I pulled data from IMDb's open database, which they update once per day. The dataset was millions of rows long, so I had to use Dask to help my poor MacBook Air process the data.

I narrowed the dataset down to only movies released in the US between 1980-2020, then merged it with another dataset from TMDB, which contained details regarding budget and box office that IMDb's dataset didn't have.

Like most machine learning projects, getting, cleaning, and preparing the data took about 80% of the total project time.

Feature Engineering

Based on my previous article, I knew there was a correlation between a movie's box office and the previous box office success of the actors and director associated with said movie. Here is the simple equation I used to quantify that number:

No alt text provided for this image
No alt text provided for this image

I simply calculated the average box office associated with every actor and director for their top 4 movies then added those numbers together (the avg. gross for the star actor and the avg. gross for the director) of each movie to get the "Total Gross Bankability."

Data Leakage

The "Total Gross Bankability" is the only feature that has possible data leakage, as it takes the box office of an actor's previous top movies. However, this can incorporate target data into our feature.

To remove data leakage, it would be good to only calculate the avg. gross for an actor for all movies released before the target movie (after all, if we were to use this in the real world, we wouldn't have access to the box office of future movies an actor would be in).

Finding Our Baseline With Dummy Regression

After splitting the training and testing data, I started with our baseline.

model_baseline = make_pipeline(
? ? ? ? ? ? ? ? OneHotEncoder(use_cat_names=True),
? ? ? ? ? ? ? ? DummyRegressor()
)
        

Then I used cross validation to help evaluate the metrics.


scoring = ['neg_mean_absolute_error','neg_mean_absolute_percentage_error','r2']


cv_baseline = cross_validate(
? ? model_baseline,
? ? X_train,
? ? y_train,
? ? scoring=scoring,
? ? cv=5
)


cv_mae_baseline = -cv_baseline['test_neg_mean_absolute_error'].mean()
cv_mape_baseline = -cv_baseline['test_neg_mean_absolute_percentage_error'].mean()
cv_r2_baseline = cv_baseline['test_r2'].mean()




print('Baseline MAE:', '{:,}'.format(cv_mae_baseline.round(2)))


print('Baseline MAPE:', '{:,}'.format(cv_mape_baseline.round(6)))


print('Baseline R2 score:', '{:,}'.format(cv_r2_baseline.round(6)))

        

Below were the printed results:


Baseline MAE: 93,947,168.94
Baseline MAPE: 282.409473
Baseline R2 score: -0.000994
        

Due to the high variance of the target column, I decided my key metric was going to be the R2 score and not MAE. However, I kept printing the mean absolute error and mean absolute percentage error for reference. Now that I had my baseline, it was time to use other well-known machine learning models.

Using Linear Regression

I made my linear model and reused most of the code above with some slight variations.


model_lr_1 = make_pipeline(
? ? ? ? ? ? ? ? OneHotEncoder(use_cat_names=True),
? ? ? ? ? ? ? ? SimpleImputer(strategy='mean'),
? ? ? ? ? ? ? ? LinearRegression()
)
        

After cross-validating and scoring, below were the printed results:


Linear Regression (1) MAE: 455,712,698.77
Linear Regression (1) MAPE: 178.76018
Linear Regression (1) R2 score: -14,205.471755
        

Linear Regression performed worse than the baseline! I tried again, this time scaling my data, to see if that made a difference.


model_lr_2 = make_pipeline(
? ? ? ? ? ? ? ? OneHotEncoder(use_cat_names=True),
? ? ? ? ? ? ? ? SimpleImputer(strategy='mean'),
? ? ? ? ? ? ? ? StandardScaler(),
? ? ? ? ? ? ? ? LinearRegression()
)
        

Printed results:


Linear Regression (2) MAE: 4.039487079675694e+17
Linear Regression (2) MAPE: 3,207,677,317,315.1016
Linear Regression (2) R2 score: -3.033014419394398e+22
        

Scaling the data made it much worse. This is likely because there is a lot of variance in the data, as most movies make little money, while some make over 1,000 times that of others.

Next, I tried Ridge Regression.

Using Ridge Regression

Unfortunately with Ridge, we had to scale the data.


model_ridge_1 = make_pipeline(
? ? ? ? ? ? ? ? OneHotEncoder(use_cat_names=True),
? ? ? ? ? ? ? ? SimpleImputer(strategy='mean'),
? ? ? ? ? ? ? ? StandardScaler(),
? ? ? ? ? ? ? ? Ridge()
)        

But fortunately, it made quite a positive difference. Here are the printed results:


Ridge Regression (1) MAE: 64,164,252.9
Ridge Regression (1) MAPE: 130.469053
Ridge Regression (1) R2 score: 0.5957777
        

I tried a few different alpha values for Ridge, but it didn't make too much of a difference.


model_ridge_2 = make_pipeline(
? ? ? ? ? ? ? ? OneHotEncoder(use_cat_names=True),
? ? ? ? ? ? ? ? SimpleImputer(strategy='mean'),
? ? ? ? ? ? ? ? StandardScaler(),
? ? ? ? ? ? ? ? Ridge(alpha=100)
)
        

Here were the new scores:


Ridge Regression (2) MAE: 63,796,418.02
Ridge Regression (2) MAPE: 128.287068
Ridge Regression (2) R2 score: 0.595994
        

It barely made a difference. I could've used Grid Search to find the optimal hyperparameters, but I wanted to try a tree-based model first.

I had a hunch tree-based models would outperform linear regression models because scoring a big box office is not necessarily linear, and there are other interactions at play like story, the star power of the movie, and word-of-mouth advertising. Below proves my hunch right.

Using Random Forest

I built the first model using an Ordinal Encoder for my categorical variables.


model_forest_1 = make_pipeline(
? ? ? ? ? ? ? ? OrdinalEncoder(),
? ? ? ? ? ? ? ? SimpleImputer(strategy='mean'),
? ? ? ? ? ? ? ? RandomForestRegressor()
)
        

Here were the scores:


Random Forest (1) MAE: 45,806,089.2
Random Forest (1) MAPE: 63.284248
Random Forest (1) R2 score: 0.6850748
        

The scores were already much better, and I didn't even tune any hyperparameters yet! Next I tried One Hot Encoding to see if that made a difference.


model_forest_2 = make_pipeline(
? ? ? ? ? ? ? ? OneHotEncoder(use_cat_names=True),
? ? ? ? ? ? ? ? SimpleImputer(strategy='mean'),
? ? ? ? ? ? ? ? RandomForestRegressor(),
)
        

The scores:


Random Forest (2) MAE: 44,759,313.8
Random Forest (2) MAPE: 48.350349
Random Forest (2) R2 score: 0.6904746
        

One Hot Encoding gave better metrics than Ordinal, so I stuck with that.

Using Grid Search

To find the optimal hyperparameters, I used GridSearchCV() on "model_forest_2." Instead of testing all parameters in an iterative range, instead I used parameter extremes to lower computation time but without leaving things to chance using Random Search.


params = {
? ? "simpleimputer__strategy": ['mean','median'],
? ? "randomforestregressor__n_estimators": [75, 100, 200],
? ? "randomforestregressor__max_depth": [None, 100],
? ? "randomforestregressor__min_samples_leaf": [1, 0.1],
}

model_grid_rf = GridSearchCV(
? ? model_forest_2,
? ? param_grid = params,
? ? n_jobs=-1,
? ? cv=5,
#? ? ?verbose=3
)

model_grid_rf.fit(X_train, y_train)

print(model_grid_rf.best_params_)
        

Here were the printed hyperparameters:


{'randomforestregressor__max_depth': None, 'randomforestregressor__min_samples_leaf': 1, 'randomforestregressor__n_estimators': 100, 'simpleimputer__strategy': 'mean'}
        

The hyperparameters ended up just being the default values. For clarification, I built a third model and explicitly stated the hyperparameters:


model_forest_3 = make_pipeline(
? ? ? ? ? ? ? ? OneHotEncoder(use_cat_names=True),
? ? ? ? ? ? ? ? SimpleImputer(strategy='mean'),
? ? ? ? ? ? ? ? RandomForestRegressor(max_depth=None,
? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?min_samples_leaf=1,
? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?n_estimators=100
? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?)
)
        

Here were the printed results:


Random Forest (3) MAE: 44,809,586.4
Random Forest (3) MAPE: 53.607505
Random Forest (3) R2 score: 0.6880779
        

All 3 scores were a little worse. This is most likely due to the randomness of the regression, as we would most likely get a different result every time anyway. Regardless, to be conservative, we will stick with these values for reporting.

XGBoost

Then I tried XGBoost. I tuned the hyperparameters using GridSearchCV:


xgb_params = {
? ? "xgbregressor__max_depth": [3,6,10],
? ? "xgbregressor__learning_rate": [0.01,0.1,0.3],
? ? "xgbregressor__n_estimators": [100, 500, 1000],
? ? "xgbregressor__colsample_bytree": [0.1, 0.5, 1]
}


xgb_grid_1 = GridSearchCV(model_xgb_1,
? ? ? ? ? ? ? ? ? ? ? ? ?param_grid=xgb_params,
? ? ? ? ? ? ? ? ? ? ? ? ?n_jobs=-1,
? ? ? ? ? ? ? ? ? ? ? ? scoring='r2',
? ? ? ? ? ? ? ? ? ? ? ? ? cv=2,
? ? ? ? ? ? ? ? ? ? ? ? ?verbose=3)


xgb_grid_1.fit(X_train, y_train)
        

Then I printed the results:


XGBoost (2) MAE: 45,012,162.4
XGBoost (2) MAPE: 56.752657
XGBoost (2) R2 score: 0.700572
        

The MAE and MAPE scores were a little worse, but the R2 Score was better. So I used this model on the testing data, and compared it to the Random Forest Model:


Random Forest R2 (Test) = 0.68125

XGBoost R2 (Test) = 0.7136502
        

The XGBoost performed the best! This is not surprising as it's the most common model used to win Kaggle Competitions.

Results

While Random Forest beat the Linear Regression models, XGBoost still outperformed them all. This will be yet another example of the "out-of-the-box" adaptability of the XGBoost model.

For those curious what the top features were, here is a "feature importances" graph, pulled from the Random Forest model:

No alt text provided for this image

As you can see, "budget" was by far the most important feature, with "Total_Gross_Bankability" coming in second.

Further Research

Limitations of my analysis include missing data in all rows, which either had to be dropped or ignored. The dataset also only included movies released in the USA between 1980–2020, and excluded any movies released by streaming companies (like Netflix-originals, since their business model is not built around “box office”).

Area for further research include calculating the Total Gross Bankability based on the movies released previous to one we are trying to predict, log-transforming the box office to test if that impacts the R2 score, and removing movies classified as outliers in a box-plot model (those with box offices more than 1.5 times the interquartile ranges).

You can check out?my GitHub here. For convenience, you can see the entire notebook referenced above here. To see more Data Science projects, follow me on GitHub or connect with me on?LinkedIn!

要查看或添加评论,请登录

Austin Wolff的更多文章

社区洞察

其他会员也浏览了