Predictive Modelling (Part 2/3): Multiple Models & Accuracy Assessments
Predictive Modelling using open sourced R can be achieved by writing just 2 lines of code, as shown in my previous post. The model that was implemented is the most recent state-of-the-art machine learning algorithm called the XGBOOST, a.k.a. eXtreme Gradient BOOSTing, which is close to its relative - gradient boosting machine (GBM), but with a much better control to over-fitting a model due better regularization. It has gained so much popularity among data scientists worldwide all thanks to the Otto Kaggle Challenge, and the author. The ideology about predictive modelling and machine learning algorithms are the same except that predictive modelling takes into a form of supervised learning as opposed to unsupervised learning. There are other classical supervised machine learning models which includes, generalized linear models, random forests, decision trees, K-Nearest-Neighbor (KNN) and etc. With the myriad of supervised predictive models, which model will be much more suited to answer that particular business question? Which model(s) will provide the best predictions and how to interpret those predictions?
Which model will be much more suited to answer that particular business question? Which model(s) will provide the best predictions and how to interpret those predictions?
Find your answers to the questions above with illustrations in the next few sections.
The same adult census data set will used as our sample data to test different predictive models for predicting 2 income groups (<=50k, >50k) from the demographics of working adults. In the previous post (Part 1), the data set is partitioned into 80% train set and 20% test set. Lets build and train the xgboost prediction model first:
> library(pROC);library(xgboost)
> dtrain=dat[tr.idx,];dtest=dat[-tr.idx,]
> dtrain=xgb.DMatrix(data.matrix(dat[tr.idx,]),label=y.tr)
> dtest=xgb.DMatrix(data.matrix(dat[-tr.idx,]))
> model1=xgb.train(data=dtrain,objective="binary:logistic"
,eta=0.1,nrounds=1000,eval_metric="auc",print.every.n=50
,watchlist=list(val=dtrain),verbose=1)
pred.xgb=predict(model1,dtest)
> roc(y.te, pred.xgb)$auc
There are many different model evaluation metrics, but for binary classifications (n outcomes = 2), the evaluation metric can be one of the following: Area Under Receiver Operating Characteristic curve, Classification Rate, Confusion Matrix, Logarithmic Loss, Precision and Recall. As shown here, the AUROC score for this model is 92.13%. Let's try to build a few more models and assess the AUROC score:
> library(randomForest)
> model2=randomForest(data.matrix(dat[tr.idx,])
,as.factor(y.tr),ntree=200,mtry=3)
> pred.rf=predict(model2,data.matrix(dat[-tr.idx,]),type="prob")[,2]
> roc(y.te, pred.rf)$auc
And the AUROC score for the Random Forest method is 90.93%. Not too far off from the XGBoost algorithm! Last but not least, let's switched gear to the Generalized Linear Model (glmnet) algorithm:
> library(caret)
> model3=train(data.matrix(dat[tr.idx,]),y.tr,method="glmnet")
> pred.glmnet=predict(model3,data.matrix(dat[-tr.idx,]))
> roc(y.te, pred.glmnet)$auc
Looks like the glmnet model is quite off from the charts with AUROC score at 83.56%!
So what does the AUROC score mean from each of those models? How does it affect the user's perceived model performance? It is not just a simple math that you can correlate to 90% right predictions given a 90% AUROC score!
Data scientists with a lack of understanding in the context of prediction models accuracy to articulate with clarity to the stakeholders are most likely to end up with more questions and obscurities. Now, let's take a closer look into what 80%~90% of these model accuracy mean in AUROC evaluation.
In order to explain AUROC effectively, we must understand and grasp the basic of hypothesis errors (type I and II statistical hypothesis errors), which we all learn during our graduate programs.
1) Type I error: False Positive, a situation when the prediction inaccurately predicts an event (e.g. the health screening predicting a person having the cancer but turns out to be a false alarm)
2) Type II error: False Negative, a situation when the prediction failed to pick up an event (e.g. the health screening results now shows that a person is completely free from all cancers but a detailed review later shows that the person has a cancer)
As opposed to FP and FN, there are True Positives (TP) and True Negatives (TN) when the algorithm accurately predicts an event or a non event and the ultimate goal to achieve for a prediction algorithm is a 100% in producing an output of TP and TN. Taking a step further in understanding the concept, we need to know the rate for FP and FN, is defined as:
TPR (True Positive Rate) = TP / (TP + FN), the rate which the algorithm is able to pick up all hit rate, given the number of observed events. Lets say that in a given cohort of 1000 individuals, there are 10 individuals having a cancer. A cancer prediction algorithm might be predicting 15 individuals having the cancer but only 8 of them is correct. Thus, the hit rate, a.k.a Recall (TPR) is 8/8 + 2 = 0.8, 80% hit rate. The ideal hit rate as we all might know is 100%, which means that if I tweak my algorithm to be very sensitive to pick up any potential/possibility of a person having a cancer, I will be more likely to have a 100% hit rate. My friend, you are correct but your algorithm might just scare everyone off who took your test and after finding out that it is also high likelihood of your model being too 'pessimistic'. On the other hand, the
FPR (False Positive Rate) = FP / (FP + TN), the rate which your algorithm predicts a false hit given the number of non-hits. In the previous example, out of 15 cancer predictions, only 8 were correct and 7 were considered to be wrong and in the population there are 990 individuals without the cancer and thus, the FPR in this case is 7 / (7 + 990) = 0.007, 0.7%. Notice that this measure is the opposite of the previous rate and the aim is to achieve a ~ 0%. Again, you might be pondering that if I am going to make my model less sensitive to detect the cancer, I will drastically reduce this rate to 0% and make my customers happy and optimistic by telling them that they don't have the cancer. However, you might just made them too happy in advance when they found out later that they have the cancer and the test is not able to pick it up. I called this the 'optimistic' index.
Now that we understand what FPR and TPR is, and how they are interrelated to each other, if we plot them on a graph it will look something like this:
The perfect classification would be a TPR of 1 and FPR of 0 as shown on the top left hand corner of the graph. Taking a step back to our previous predictions from our model, your generated probabilities ranges from 0 to 1 and the one million dollar question - a question you ask that will make you appear intelligent and knowledgeable is, "What does a prediction of 0.5 mean as opposed to 0.8, 0.99, 0.1? How do I differentiate a probability of 0.65 as compared to 0.85? How do I correlate an AUROC score of 0.92 with all the probabilities that this model generated?" We have covered the definition of TPR and FPR in the earlier part and to calculate a the FPR and TPR to plot a point on the ROC curve, we need to convert those probabilities back to 0 and 1 to get our FPR and TPR.
What does a prediction of 0.5 mean as opposed to 0.8, 0.99, 0.1? How do I differentiate a probability of 0.65 as compared to 0.85? How do I correlate an AUROC score of 0.92 with all the probabilities that this model generated?
Following our previous example, to calculate the TPR and FPR and plot a point on the ROC curve, we need to set a threshold to define 0 and 1 from the probabilities. Let's take 0.5 as our cut off, and we will have 5,110 predictions for income group <= 50k and 1,402 predictions for income group > 50k. The calculated TPR and FPR is 0.658 and 0.072. If you put this coordinates on the ROC curve, it will appear somewhere top left (close to the perfect model). Repeat this calculation for TPR and FPR over different cutoff value, 0.01, 0.02, 0.03, ... 0.99 and you will have about 99 point on the ROC curve. If you connect those points, you will obtain an inverted log curve that is skewed to the left. Calculate the area under this curve is the AUROC.