ç™»å½•æŸ¥çœ‹æ›´å¤šå†…å®¹

Predictive Modelling (Part 1/3): In 2 lines of R code

Garrett Teoh Hor Keong

å‘å¸ƒæ—¥æœŸ: 2016å¹´4æœˆ23æ—¥

In the context of predictive modelling, it is often seen as somewhat a sort of 'black box', 'dark art', 'magic works', or 'fortune telling' tricks. Data scientists who can design such an algorithm capable of achieving 100% accurate predictions can be assured of a red carpet treatment by companies and it would be like a Marvel superhero turning into reality - Charles Xavier. Thus, in the natural world, predictive modelling are misconstrued as extremely high complexity algorithms that demands a super human ability to write the code for the models. There might be a certain truth to make that assumptions a decade ago when there are no pre-built functions and packages available for building and training predictive models. During the past few years, statisticians, scientists, and mathematics engineer has been putting in tremendous effort on R&D in creating novel methods for statistical analysis and modelling, due to the ever demanding biomedical science industries which requires breakthroughs in drug discoveries and cures for emerging cancers and diseases. In order to move rapidly in drug testings, putting biomedical R&D on full throttle, despite having to digest Terabytes of genetic and molecular level DNA, RNA data, mathematically complex statistical models has to be packaged into sizeable functions and modules that can be called and deployed efficiently. In the remaining sections of this article, I will give an example of how easy it is to implement an efficient predictive model - xgboost (eXtreme Gradient Boosting) in R implementation.

The sample dataset for this demo can be downloaded from UCI: Adult Data Set. In this example, we have a number of demographics information pertaining to a group of working adults. The predictive task here is to predict whether a person makes more than $50,000 a year.

The first few lines of R code below is basically performing the minimal ETL (extract, transform, load) and initialization. The code to implement the predictive model is only one single line and another line to code to perform the prediction. Let's do some initialization and load the data set:

> library(xgboost); library(caret); library(dplyr); library(AUC)
> options(scipen=10); set.seed(794)
> dat=read.table("adult.data",stringsAsFactors=F,header=F,sep=',')
> colnames(dat)=c("age","workclass","fnlwgt","edu","edu-num","marital","occ","relation","race","sex","cap-gain","cap-loss","hours-pw","country","income-rng")
> dat=data.frame(dat); str(dat)

From the summary above, we have 6 integers variables, and 9 categorical (character) variables. The variable "income.rng" contains 2 categories only, less than or equal 50K, and greater than 50K, which later will be converted into a binary class and it will be our target we want to predict. Next, prepare a summary statistics for the data:

> dfinfo=data.frame(featname=colnames(dat))
> dfinfo$feat.class=sapply(dat,function(x) class(x))
> dfinfo$ndistinct=sapply(dat,function(x) n_distinct(x))
> dfinfo$na=sapply(dat,function(x) sum(is.na(x)))
> y=as.numeric(as.factor(dat$income.rng))
> table(y,dat$income.rng);dfinfo[1:10,]
> dat$income.rng=NULL;dfinfo=dfinfo[-15,]

The table above shows the first 10 variables and a very brief summary statistics on each variables/features. Next, let's make a partition for our train set and test set from the pre-processed dataset. We will create 2 partitions, 80% for building and training the predictive model and 20% to test the accuracy of the built model:

> tr.idx=createDataPartition(y,p=0.8)$Resample1
> newdat=dat
> for (f in dfinfo$featname[which(dfinfo$feat.class=="character")]) {
newdat[[f]]=as.numeric(as.factor(newdat[[f]]))}
> y.tr=y[tr.idx]-1;y.te=y[-tr.idx]-1
> dtrain=newdat[tr.idx,];dtest=newdat[-tr.idx,]
> dtrain=xgb.DMatrix(data.matrix(newdat[tr.idx,]),label=y.tr)
> dtest=xgb.DMatrix(data.matrix(newdat[-tr.idx,]))

Once the pre-processed dataset has been partitioned into train and test set, we can write the one line of code to call out the latest gradient boosting tree model:

> model1=xgb.train(data=dtrain,objective="binary:logistic"
,eta=0.1,nrounds=1000,eval_metric="auc",print.every.n=50
,watchlist=list(val=dtrain),verbose=1)

We are calling 1000 iterations in training the predictive model and noticed that this training step will only take less than 1 minute to complete on a intel core i7 laptop. The training model accuracy (AUC evaluation) is close to 99%!! Let's deploy our model now on our test dataset:

> pred1=predict(model1,dtest)
> auc(roc(pred1,as.factor(y.te)))

and the AUC we achieved is,

Voila! An amazing 92.1%! At this point of time, you might be thinking, hmm, not too shabby, at least my AUC is 92.1%, and 7.9% close to becoming Charles Xavier. Well, hold on to that thought because the evaluation metric or accuracy (AUC) does not equate to 92.1% right predictions at all time. It is a calculation based on False Positive Rate (Type I error rate, specificity) and True Positive Rate (sensitivity rate). The AUC/ROC/Sensitivity/Specificity/TPR/FPR is another separate topic which will be discussed later.

Last but not least, to summarize up what is going on with the dataset and reasons why the predictive model works, we can explore the features importance plot by:

> f.names=colnames(data.matrix(newdat[tr.idx,]))
> impt=xgb.importance(feature_names=f.names,model=model1)
> xgb.plot.importance(impt)

The top 3 most significant features are relation, capital gain, and education number. If a detailed data exploratory analysis (a separate discussion itself) is performed, the data would point to you that "husbands" in the relation categorical feature are highly correlated to our target and if a chi-square test is to be performed, the p-value would most likely be in the 0.05 range. If we fit a logistic regression of our target based on capital gain and education number, those two features would have appear significant as well. New features for example, average education for different relations or average capital gain per education level might help the predictive model to learn early in the iterations and thus, improving performance. This type of technique is know as the art of features engineering, which is a widely adopted concept among top notched data scientists and requires profound knowledge and experience in data science.

Garrett Teoh Hor Keong

8 å¹´

@Clive, while the target class is imbalance ~75% (usually it is the case for real data, you wont get 50%/50% for example, churn/attrition rate). The caret package, createDataPartition() function allows you to partition the data in a stratified manner that in you will achieve a over 75% frequency vis-a-vis on both training and test set. This is a concept that is very critical in any cancer/disease research where the allele frequencies in every loci in a genome is examined based on cases/controls status. Thank you guys for appreciating the post, I will make a sequel to it and in my next post, I will talk about assessing prediction's performance.

èµž

å›žå¤

1 æ¬¡å›žåº”

Clive Coelho

Staff Data Engineer at CalAmp

8 å¹´

Nice. However, the data has a class imbalance problem. After fixing it the results would change.

èµž

å›žå¤

Alan Grogan

Senior executive | Accenture, Avanade, KBS | Data & AI business leader | Intelligent Data & Information Platform Services

8 å¹´

We need more of this content on LinkedIn as far less of the ridiculous maths quizzes

èµž

å›žå¤

1 æ¬¡å›žåº”

æŸ¥çœ‹æ›´å¤šè¯„è®º

è¦æŸ¥çœ‹æˆ–æ·»åŠ è¯„è®ºï¼Œè¯·ç™»å½•

Garrett Teoh Hor Keongçš„æ›´å¤šæ–‡ç«

Deciphering Statistical Significance

2019å¹´3æœˆ4æ—¥

Deciphering Statistical Significance

Most of the time, business owners, stakeholder, and decision makers misconstrued the meaning of statisticalâ€¦

2 æ¡è¯„è®º
Differentiating Analytical Approaches & Are They Worth The Investments?

2016å¹´9æœˆ28æ—¥

Differentiating Analytical Approaches & Are They Worth The Investments?

Every businesses and organization in the world seems to be overly excited and eager to jump into the "analytical, deepâ€¦

2 æ¡è¯„è®º
Predictive Modelling (Part 3/3): The Dark and White Ensemble

2016å¹´6æœˆ13æ—¥

Predictive Modelling (Part 3/3): The Dark and White Ensemble

In my previous posts for Predictive Modelling Part 1 and Part 2, I have introduced how simple it is to design andâ€¦
Leadership Personalities and Types

2016å¹´5æœˆ31æ—¥

Leadership Personalities and Types

Many people often asks, what type of leadership personalities are there? If I am to take up a leadership role in myâ€¦
Can Money Buy Time? Time Equates Management Skills

2016å¹´5æœˆ28æ—¥

Can Money Buy Time? Time Equates Management Skills

There is something in common that most of us would agree that we wished we had more (excluding money) - time. I betâ€¦

1 æ¡è¯„è®º
Predictive Modelling (Part 2/3): Multiple Models & Accuracy Assessments

2016å¹´5æœˆ18æ—¥

Predictive Modelling (Part 2/3): Multiple Models & Accuracy Assessments

Predictive Modelling using open sourced R can be achieved by writing just 2 lines of code, as shown in my previousâ€¦
Game Theory: An Example in R

2016å¹´4æœˆ30æ—¥

Game Theory: An Example in R

It was not my usual stint to explore optimization problems or looking into game theory. It was something that I wouldâ€¦

3 æ¡è¯„è®º
Feature Engineering Example in R

2016å¹´4æœˆ10æ—¥

Feature Engineering Example in R

Feature Engineering Example in R Machine learning (deep learning) algorithms are gaining more popularity among the dataâ€¦
Does Model Accuracy Matter in Big Data Projects?

2016å¹´2æœˆ7æ—¥

Does Model Accuracy Matter in Big Data Projects?

This is a question that most data scientists elude and adeptly sweep under the carpet. The answer is yes and noâ€¦

2 æ¡è¯„è®º
Cross-Breeds of Data Scientists â€“ Ultimate Unicorn Data Scientist

2016å¹´1æœˆ20æ—¥

Cross-Breeds of Data Scientists â€“ Ultimate Unicorn Data Scientist

From the beginning of second quarter last year, the characteristics, qualities, and responsibilities of Data Scientistsâ€¦

See all articles

Predictive Modelling (Part 1/3): In 2 lines of R code

Garrett Teoh Hor Keong

> model1=xgb.train(data=dtrain,objective="binary:logistic"
,eta=0.1,nrounds=1000,eval_metric="auc",print.every.n=50
,watchlist=list(val=dtrain),verbose=1)

> pred1=predict(model1,dtest)
> auc(roc(pred1,as.factor(y.te)))

> f.names=colnames(data.matrix(newdat[tr.idx,]))
> impt=xgb.importance(feature_names=f.names,model=model1)
> xgb.plot.importance(impt)

Garrett Teoh Hor Keongçš„æ›´å¤šæ–‡ç«

ç¤¾åŒºæ´žå¯Ÿ

å…¶ä»–ä¼šå‘˜ä¹Ÿæµè§ˆäº†

Data Science in Healthcare â€“ 7 Applications No one will Tell You

Support Vector Machine

100 Key Terms in Statistics and Their Meanings

Person-Centered Approaches: Latent Profile Analysis (LPA) & Latent Class Analysis (LCA) ?(2/5) ??

Understanding frameworks for Statistical Models

Clustering Techniques in Data Science

Data Science Capstone - EdX

Non-Linear Parametric Modeling in R using Genetic Algorithm

Modeling Biases and Decision-Making in Healthcare

Unlocking causal insights : Propensity Score Matching

> model1=xgb.train(data=dtrain,objective="binary:logistic",eta=0.1,nrounds=1000,eval_metric="auc",print.every.n=50,watchlist=list(val=dtrain),verbose=1)

> pred1=predict(model1,dtest)> auc(roc(pred1,as.factor(y.te)))

> f.names=colnames(data.matrix(newdat[tr.idx,]))> impt=xgb.importance(feature_names=f.names,model=model1)> xgb.plot.importance(impt)

Garrett Teoh Hor Keongçš„æ›´å¤šæ–‡ç«

Deciphering Statistical Significance

Differentiating Analytical Approaches & Are They Worth The Investments?

Predictive Modelling (Part 3/3): The Dark and White Ensemble

Leadership Personalities and Types

Can Money Buy Time? Time Equates Management Skills

Predictive Modelling (Part 2/3): Multiple Models & Accuracy Assessments

Game Theory: An Example in R

Feature Engineering Example in R

Does Model Accuracy Matter in Big Data Projects?

Cross-Breeds of Data Scientists â€“ Ultimate Unicorn Data Scientist

ç¤¾åŒºæ´žå¯Ÿ

å…¶ä»–ä¼šå‘˜ä¹Ÿæµè§ˆäº†

Data Science in Healthcare â€“ 7 Applications No one will Tell You

Support Vector Machine

100 Key Terms in Statistics and Their Meanings

Person-Centered Approaches: Latent Profile Analysis (LPA) & Latent Class Analysis (LCA) ?(2/5) ??

Understanding frameworks for Statistical Models

Clustering Techniques in Data Science

Data Science Capstone - EdX

Non-Linear Parametric Modeling in R using Genetic Algorithm

Modeling Biases and Decision-Making in Healthcare

Unlocking causal insights : Propensity Score Matching

> model1=xgb.train(data=dtrain,objective="binary:logistic"
,eta=0.1,nrounds=1000,eval_metric="auc",print.every.n=50
,watchlist=list(val=dtrain),verbose=1)

> pred1=predict(model1,dtest)
> auc(roc(pred1,as.factor(y.te)))

> f.names=colnames(data.matrix(newdat[tr.idx,]))
> impt=xgb.importance(feature_names=f.names,model=model1)
> xgb.plot.importance(impt)

å…¶ä»–ä¼šå‘˜ä¹Ÿæµè§ˆäº†