Predictive Modelling (Part 3/3): The Dark and White Ensemble
In my previous posts for Predictive Modelling Part 1 and Part 2, I have introduced how simple it is to design and deploy a machine learning predictive model with just 2 lines of R code, discussed about the pros and cons for different predictive model algorithms, and finally, how the quality of your data and different machine learning algorithms affects your prediction accuracy. There exists a analytical pipeline, and I named this as the 'artistic' side of data science, when it is appropriately applied, will potentially pack a lot more punch to juice insights from your fountain of data and boost accuracy. Behold the holy grail of model ensemble technique.
In a single predictive model, there might be biases, variability, over/under-sampling issues that could not be captured and hence, different models yields different results leading to fluctuating accuracy. Any predictive model can vary in accuracy with just a slight difference in anything; training data size, number of features/measurements, model parameters/settings, or even a different random seed.
The essence of performing ensemble technique is an attempt to account for and capture all the variability, biases, and unexplained correlations through features engineering by finding a legitimate way of combining all different models together and tune it in a manner that it is the most optimal generalized form that will fit to any data point.
"Among the vast spectrum of ensemble techniques, I would classify them into 2 major groups - White Ensemble and Dark Ensemble techniques."
Dark ensemble technique, as the name suggest, is a form of experimentation, trial and error, hit and run, simple ensemble method that just takes in the weighted/non weighted averages, rank averaging, and voting method from all the single prediction models. If you are lucky, you might hit the right spot and jackpot for the most accurate predictions. It is just simply guessing which model(s) are more/less accurate and assign more weights to those that perform better.
White ensemble technique, on the other hand, are evident-based experimentation that is capable of detecting which model(s) are better in predicting on subsets of data points/properties. It is another model on a higher-order that predicts the results based on the output from the lower-order model predictions. This process can repeat itself over stages and the accuracy from each stage is likely to improve as it progresses up the pipeline. You can find an example from the Kaggle Home Depot challenge winners solutions (Figure 1), and the Homesite Quote challenge solution (Fig1).
I will show an example of how to perform a Dark Ensemble using the previous adult census data in the following section.
As per previous post, let's load the data, build/train a model, make/test predictions, and assess accuracy for a single model.
> library(pROC);library(xgboost)
> dtrain=dat[tr.idx,];dtest=dat[-tr.idx,]
> dtrain=xgb.DMatrix(data.matrix(dat[tr.idx,]),label=y.tr)
> dtest=xgb.DMatrix(data.matrix(dat[-tr.idx,]))
> model1=xgb.train(data=dtrain,objective="binary:logistic"
,eta=0.1,nrounds=1000,eval_metric="auc",print.every.n=50
,watchlist=list(val=dtrain),verbose=1)
> pred.xgb=predict(model1,dtest)
> roc(y.te, pred.xgb)$auc
The AUC for model1 we have built is 92.13% . Now let's build a couple more models by tweaking some parameters, assigning a different random seed, and mix in random forest algorithm.
> set.seed(779)
> library(randomForest)
> model2=randomForest(data.matrix(dat[tr.idx,])
,as.factor(y.tr),ntree=200,mtry=3)
> pred.rf=predict(model2,data.matrix(dat[-tr.idx,]),type="prob")[,2]
> roc(y.te, pred.rf)$auc
> set.seed(121)
> model3=xgb.train(data=dtrain,objective="binary:logistic"
,eta=0.2,nrounds=800,eval_metric="auc",print.every.n=50
,colsample_bytree=0.7,watchlist=list(val=dtrain),verbose=1)
> pred.xgb2=predict(model3,dtest)
> roc(y.te, pred.xgb2)$auc
> pred.avg=((0.5*pred.xgb)+0.2*(pred.rf)+0.3*(pred.xgb2))
> roc(y.te, pred.avg)$auc
In this example, we take the weighted average from 3 models, and the weights are randomly assigned, (0.5, 0.2, 0.3). The final predictions gained a slightly more accurate results from this dark ensemble method. As mentioned earlier, this method is hard to tune because we have an infinite space, but in most cases it will give you an added boost to your accuracy in the simplest way. White ensemble will take your accuracy further but to design/set up a multi-staged stacked model goes a long way.
In summary, it is worth to run a few different prediction models (generalized linear model, random forest, gradient boosting, decision trees, SVM, neural networks, and etc) to test which model fits your data and remember to store those predictions for dark ensemble. If your data quality is not good enough for you to attain a certain acceptable accuracy, then you might want to consider adopting the white ensemble approach.