登录查看更多内容

Predictive Modelling (Part 2/3): Multiple Models & Accuracy Assessments

Garrett Teoh Hor Keong

发布日期: 2016年5月18日

Predictive Modelling using open sourced R can be achieved by writing just 2 lines of code, as shown in my previous post. The model that was implemented is the most recent state-of-the-art machine learning algorithm called the XGBOOST, a.k.a. eXtreme Gradient BOOSTing, which is close to its relative - gradient boosting machine (GBM), but with a much better control to over-fitting a model due better regularization. It has gained so much popularity among data scientists worldwide all thanks to the Otto Kaggle Challenge, and the author. The ideology about predictive modelling and machine learning algorithms are the same except that predictive modelling takes into a form of supervised learning as opposed to unsupervised learning. There are other classical supervised machine learning models which includes, generalized linear models, random forests, decision trees, K-Nearest-Neighbor (KNN) and etc. With the myriad of supervised predictive models, which model will be much more suited to answer that particular business question? Which model(s) will provide the best predictions and how to interpret those predictions?

Which model will be much more suited to answer that particular business question? Which model(s) will provide the best predictions and how to interpret those predictions?

Find your answers to the questions above with illustrations in the next few sections.

The same adult census data set will used as our sample data to test different predictive models for predicting 2 income groups (<=50k, >50k) from the demographics of working adults. In the previous post (Part 1), the data set is partitioned into 80% train set and 20% test set. Lets build and train the xgboost prediction model first:

> library(pROC);library(xgboost)
> dtrain=dat[tr.idx,];dtest=dat[-tr.idx,]
> dtrain=xgb.DMatrix(data.matrix(dat[tr.idx,]),label=y.tr)
> dtest=xgb.DMatrix(data.matrix(dat[-tr.idx,]))
> model1=xgb.train(data=dtrain,objective="binary:logistic"
,eta=0.1,nrounds=1000,eval_metric="auc",print.every.n=50
,watchlist=list(val=dtrain),verbose=1)
pred.xgb=predict(model1,dtest)
> roc(y.te, pred.xgb)$auc

There are many different model evaluation metrics, but for binary classifications (n outcomes = 2), the evaluation metric can be one of the following: Area Under Receiver Operating Characteristic curve, Classification Rate, Confusion Matrix, Logarithmic Loss, Precision and Recall. As shown here, the AUROC score for this model is 92.13%. Let's try to build a few more models and assess the AUROC score:

> library(randomForest)
> model2=randomForest(data.matrix(dat[tr.idx,])
,as.factor(y.tr),ntree=200,mtry=3)
> pred.rf=predict(model2,data.matrix(dat[-tr.idx,]),type="prob")[,2]
> roc(y.te, pred.rf)$auc

And the AUROC score for the Random Forest method is 90.93%. Not too far off from the XGBoost algorithm! Last but not least, let's switched gear to the Generalized Linear Model (glmnet) algorithm:

> library(caret)
> model3=train(data.matrix(dat[tr.idx,]),y.tr,method="glmnet")
> pred.glmnet=predict(model3,data.matrix(dat[-tr.idx,]))
> roc(y.te, pred.glmnet)$auc

Looks like the glmnet model is quite off from the charts with AUROC score at 83.56%!

So what does the AUROC score mean from each of those models? How does it affect the user's perceived model performance? It is not just a simple math that you can correlate to 90% right predictions given a 90% AUROC score!

Data scientists with a lack of understanding in the context of prediction models accuracy to articulate with clarity to the stakeholders are most likely to end up with more questions and obscurities. Now, let's take a closer look into what 80%~90% of these model accuracy mean in AUROC evaluation.

In order to explain AUROC effectively, we must understand and grasp the basic of hypothesis errors (type I and II statistical hypothesis errors), which we all learn during our graduate programs.

1) Type I error: False Positive, a situation when the prediction inaccurately predicts an event (e.g. the health screening predicting a person having the cancer but turns out to be a false alarm)

2) Type II error: False Negative, a situation when the prediction failed to pick up an event (e.g. the health screening results now shows that a person is completely free from all cancers but a detailed review later shows that the person has a cancer)

As opposed to FP and FN, there are True Positives (TP) and True Negatives (TN) when the algorithm accurately predicts an event or a non event and the ultimate goal to achieve for a prediction algorithm is a 100% in producing an output of TP and TN. Taking a step further in understanding the concept, we need to know the rate for FP and FN, is defined as:

TPR (True Positive Rate) = TP / (TP + FN), the rate which the algorithm is able to pick up all hit rate, given the number of observed events. Lets say that in a given cohort of 1000 individuals, there are 10 individuals having a cancer. A cancer prediction algorithm might be predicting 15 individuals having the cancer but only 8 of them is correct. Thus, the hit rate, a.k.a Recall (TPR) is 8/8 + 2 = 0.8, 80% hit rate. The ideal hit rate as we all might know is 100%, which means that if I tweak my algorithm to be very sensitive to pick up any potential/possibility of a person having a cancer, I will be more likely to have a 100% hit rate. My friend, you are correct but your algorithm might just scare everyone off who took your test and after finding out that it is also high likelihood of your model being too 'pessimistic'. On the other hand, the

FPR (False Positive Rate) = FP / (FP + TN), the rate which your algorithm predicts a false hit given the number of non-hits. In the previous example, out of 15 cancer predictions, only 8 were correct and 7 were considered to be wrong and in the population there are 990 individuals without the cancer and thus, the FPR in this case is 7 / (7 + 990) = 0.007, 0.7%. Notice that this measure is the opposite of the previous rate and the aim is to achieve a ~ 0%. Again, you might be pondering that if I am going to make my model less sensitive to detect the cancer, I will drastically reduce this rate to 0% and make my customers happy and optimistic by telling them that they don't have the cancer. However, you might just made them too happy in advance when they found out later that they have the cancer and the test is not able to pick it up. I called this the 'optimistic' index.

Now that we understand what FPR and TPR is, and how they are interrelated to each other, if we plot them on a graph it will look something like this:

The perfect classification would be a TPR of 1 and FPR of 0 as shown on the top left hand corner of the graph. Taking a step back to our previous predictions from our model, your generated probabilities ranges from 0 to 1 and the one million dollar question - a question you ask that will make you appear intelligent and knowledgeable is, "What does a prediction of 0.5 mean as opposed to 0.8, 0.99, 0.1? How do I differentiate a probability of 0.65 as compared to 0.85? How do I correlate an AUROC score of 0.92 with all the probabilities that this model generated?" We have covered the definition of TPR and FPR in the earlier part and to calculate a the FPR and TPR to plot a point on the ROC curve, we need to convert those probabilities back to 0 and 1 to get our FPR and TPR.

What does a prediction of 0.5 mean as opposed to 0.8, 0.99, 0.1? How do I differentiate a probability of 0.65 as compared to 0.85? How do I correlate an AUROC score of 0.92 with all the probabilities that this model generated?

Following our previous example, to calculate the TPR and FPR and plot a point on the ROC curve, we need to set a threshold to define 0 and 1 from the probabilities. Let's take 0.5 as our cut off, and we will have 5,110 predictions for income group <= 50k and 1,402 predictions for income group > 50k. The calculated TPR and FPR is 0.658 and 0.072. If you put this coordinates on the ROC curve, it will appear somewhere top left (close to the perfect model). Repeat this calculation for TPR and FPR over different cutoff value, 0.01, 0.02, 0.03, ... 0.99 and you will have about 99 point on the ROC curve. If you connect those points, you will obtain an inverted log curve that is skewed to the left. Calculate the area under this curve is the AUROC.

要查看或添加评论，请登录

Garrett Teoh Hor Keong的更多文章

Deciphering Statistical Significance

2019年3月4日

Deciphering Statistical Significance

Most of the time, business owners, stakeholder, and decision makers misconstrued the meaning of statistical…

2 条评论
Differentiating Analytical Approaches & Are They Worth The Investments?

2016年9月28日

Differentiating Analytical Approaches & Are They Worth The Investments?

Every businesses and organization in the world seems to be overly excited and eager to jump into the "analytical, deep…

2 条评论
Predictive Modelling (Part 3/3): The Dark and White Ensemble

2016年6月13日

Predictive Modelling (Part 3/3): The Dark and White Ensemble

In my previous posts for Predictive Modelling Part 1 and Part 2, I have introduced how simple it is to design and…
Leadership Personalities and Types

2016年5月31日

Leadership Personalities and Types

Many people often asks, what type of leadership personalities are there? If I am to take up a leadership role in my…
Can Money Buy Time? Time Equates Management Skills

2016年5月28日

Can Money Buy Time? Time Equates Management Skills

There is something in common that most of us would agree that we wished we had more (excluding money) - time. I bet…

1 条评论
Game Theory: An Example in R

2016年4月30日

Game Theory: An Example in R

It was not my usual stint to explore optimization problems or looking into game theory. It was something that I would…

3 条评论
Predictive Modelling (Part 1/3): In 2 lines of R code

2016年4月23日

Predictive Modelling (Part 1/3): In 2 lines of R code

In the context of predictive modelling, it is often seen as somewhat a sort of 'black box', 'dark art', 'magic works'…

4 条评论
Feature Engineering Example in R

2016年4月10日

Feature Engineering Example in R

Feature Engineering Example in R Machine learning (deep learning) algorithms are gaining more popularity among the data…
Does Model Accuracy Matter in Big Data Projects?

2016年2月7日

Does Model Accuracy Matter in Big Data Projects?

This is a question that most data scientists elude and adeptly sweep under the carpet. The answer is yes and no…

2 条评论
Cross-Breeds of Data Scientists – Ultimate Unicorn Data Scientist

2016年1月20日

Cross-Breeds of Data Scientists – Ultimate Unicorn Data Scientist

From the beginning of second quarter last year, the characteristics, qualities, and responsibilities of Data Scientists…

See all articles

Predictive Modelling (Part 2/3): Multiple Models & Accuracy Assessments

Garrett Teoh Hor Keong

> library(randomForest)
> model2=randomForest(data.matrix(dat[tr.idx,])
,as.factor(y.tr),ntree=200,mtry=3)
> pred.rf=predict(model2,data.matrix(dat[-tr.idx,]),type="prob")[,2]
> roc(y.te, pred.rf)$auc

> library(caret)
> model3=train(data.matrix(dat[tr.idx,]),y.tr,method="glmnet")
> pred.glmnet=predict(model3,data.matrix(dat[-tr.idx,]))
> roc(y.te, pred.glmnet)$auc

Garrett Teoh Hor Keong的更多文章

社区洞察

其他会员也浏览了

Machine Learning - Hyperparameter Tuning

The Connection Between Machine Learning and Statistics

?? Choosing the Right Clustering Algorithm

Demystifying CatBoost

Statistical Modeling

Machine Learning (Classification models)

Class 20 - MODEL EVALUATION METRICS Notes from the AI Basic Course by Irfan Malik & Dr Sheraz Naseer (Xeven Solutions)

A Tour of Machine Learning Algorithms

Bayesian Model using RAG

Visual Insights into k-NN Classification for Iris Dataset

> library(randomForest)> model2=randomForest(data.matrix(dat[tr.idx,]) ,as.factor(y.tr),ntree=200,mtry=3)> pred.rf=predict(model2,data.matrix(dat[-tr.idx,]),type="prob")[,2]> roc(y.te, pred.rf)$auc

> library(caret)> model3=train(data.matrix(dat[tr.idx,]),y.tr,method="glmnet")> pred.glmnet=predict(model3,data.matrix(dat[-tr.idx,]))> roc(y.te, pred.glmnet)$auc

Garrett Teoh Hor Keong的更多文章

Deciphering Statistical Significance

Differentiating Analytical Approaches & Are They Worth The Investments?

Predictive Modelling (Part 3/3): The Dark and White Ensemble

Leadership Personalities and Types

Can Money Buy Time? Time Equates Management Skills

Game Theory: An Example in R

Predictive Modelling (Part 1/3): In 2 lines of R code

Feature Engineering Example in R

Does Model Accuracy Matter in Big Data Projects?

Cross-Breeds of Data Scientists – Ultimate Unicorn Data Scientist

社区洞察

其他会员也浏览了

Machine Learning - Hyperparameter Tuning

The Connection Between Machine Learning and Statistics

?? Choosing the Right Clustering Algorithm

Demystifying CatBoost

Statistical Modeling

Machine Learning (Classification models)

Class 20 - MODEL EVALUATION METRICS Notes from the AI Basic Course by Irfan Malik & Dr Sheraz Naseer (Xeven Solutions)

A Tour of Machine Learning Algorithms

Bayesian Model using RAG

Visual Insights into k-NN Classification for Iris Dataset

> library(randomForest)
> model2=randomForest(data.matrix(dat[tr.idx,])
,as.factor(y.tr),ntree=200,mtry=3)
> pred.rf=predict(model2,data.matrix(dat[-tr.idx,]),type="prob")[,2]
> roc(y.te, pred.rf)$auc

> library(caret)
> model3=train(data.matrix(dat[tr.idx,]),y.tr,method="glmnet")
> pred.glmnet=predict(model3,data.matrix(dat[-tr.idx,]))
> roc(y.te, pred.glmnet)$auc