登录查看更多内容

Predictive Modelling (Part 1/3): In 2 lines of R code

Garrett Teoh Hor Keong

发布日期: 2016年4月23日

In the context of predictive modelling, it is often seen as somewhat a sort of 'black box', 'dark art', 'magic works', or 'fortune telling' tricks. Data scientists who can design such an algorithm capable of achieving 100% accurate predictions can be assured of a red carpet treatment by companies and it would be like a Marvel superhero turning into reality - Charles Xavier. Thus, in the natural world, predictive modelling are misconstrued as extremely high complexity algorithms that demands a super human ability to write the code for the models. There might be a certain truth to make that assumptions a decade ago when there are no pre-built functions and packages available for building and training predictive models. During the past few years, statisticians, scientists, and mathematics engineer has been putting in tremendous effort on R&D in creating novel methods for statistical analysis and modelling, due to the ever demanding biomedical science industries which requires breakthroughs in drug discoveries and cures for emerging cancers and diseases. In order to move rapidly in drug testings, putting biomedical R&D on full throttle, despite having to digest Terabytes of genetic and molecular level DNA, RNA data, mathematically complex statistical models has to be packaged into sizeable functions and modules that can be called and deployed efficiently. In the remaining sections of this article, I will give an example of how easy it is to implement an efficient predictive model - xgboost (eXtreme Gradient Boosting) in R implementation.

The sample dataset for this demo can be downloaded from UCI: Adult Data Set. In this example, we have a number of demographics information pertaining to a group of working adults. The predictive task here is to predict whether a person makes more than $50,000 a year.

The first few lines of R code below is basically performing the minimal ETL (extract, transform, load) and initialization. The code to implement the predictive model is only one single line and another line to code to perform the prediction. Let's do some initialization and load the data set:

> library(xgboost); library(caret); library(dplyr); library(AUC)
> options(scipen=10); set.seed(794)
> dat=read.table("adult.data",stringsAsFactors=F,header=F,sep=',')
> colnames(dat)=c("age","workclass","fnlwgt","edu","edu-num","marital","occ","relation","race","sex","cap-gain","cap-loss","hours-pw","country","income-rng")
> dat=data.frame(dat); str(dat)

From the summary above, we have 6 integers variables, and 9 categorical (character) variables. The variable "income.rng" contains 2 categories only, less than or equal 50K, and greater than 50K, which later will be converted into a binary class and it will be our target we want to predict. Next, prepare a summary statistics for the data:

> dfinfo=data.frame(featname=colnames(dat))
> dfinfo$feat.class=sapply(dat,function(x) class(x))
> dfinfo$ndistinct=sapply(dat,function(x) n_distinct(x))
> dfinfo$na=sapply(dat,function(x) sum(is.na(x)))
> y=as.numeric(as.factor(dat$income.rng))
> table(y,dat$income.rng);dfinfo[1:10,]
> dat$income.rng=NULL;dfinfo=dfinfo[-15,]

The table above shows the first 10 variables and a very brief summary statistics on each variables/features. Next, let's make a partition for our train set and test set from the pre-processed dataset. We will create 2 partitions, 80% for building and training the predictive model and 20% to test the accuracy of the built model:

> tr.idx=createDataPartition(y,p=0.8)$Resample1
> newdat=dat
> for (f in dfinfo$featname[which(dfinfo$feat.class=="character")]) {
newdat[[f]]=as.numeric(as.factor(newdat[[f]]))}
> y.tr=y[tr.idx]-1;y.te=y[-tr.idx]-1
> dtrain=newdat[tr.idx,];dtest=newdat[-tr.idx,]
> dtrain=xgb.DMatrix(data.matrix(newdat[tr.idx,]),label=y.tr)
> dtest=xgb.DMatrix(data.matrix(newdat[-tr.idx,]))

Once the pre-processed dataset has been partitioned into train and test set, we can write the one line of code to call out the latest gradient boosting tree model:

> model1=xgb.train(data=dtrain,objective="binary:logistic"
,eta=0.1,nrounds=1000,eval_metric="auc",print.every.n=50
,watchlist=list(val=dtrain),verbose=1)

We are calling 1000 iterations in training the predictive model and noticed that this training step will only take less than 1 minute to complete on a intel core i7 laptop. The training model accuracy (AUC evaluation) is close to 99%!! Let's deploy our model now on our test dataset:

> pred1=predict(model1,dtest)
> auc(roc(pred1,as.factor(y.te)))

and the AUC we achieved is,

Voila! An amazing 92.1%! At this point of time, you might be thinking, hmm, not too shabby, at least my AUC is 92.1%, and 7.9% close to becoming Charles Xavier. Well, hold on to that thought because the evaluation metric or accuracy (AUC) does not equate to 92.1% right predictions at all time. It is a calculation based on False Positive Rate (Type I error rate, specificity) and True Positive Rate (sensitivity rate). The AUC/ROC/Sensitivity/Specificity/TPR/FPR is another separate topic which will be discussed later.

Last but not least, to summarize up what is going on with the dataset and reasons why the predictive model works, we can explore the features importance plot by:

> f.names=colnames(data.matrix(newdat[tr.idx,]))
> impt=xgb.importance(feature_names=f.names,model=model1)
> xgb.plot.importance(impt)

The top 3 most significant features are relation, capital gain, and education number. If a detailed data exploratory analysis (a separate discussion itself) is performed, the data would point to you that "husbands" in the relation categorical feature are highly correlated to our target and if a chi-square test is to be performed, the p-value would most likely be in the 0.05 range. If we fit a logistic regression of our target based on capital gain and education number, those two features would have appear significant as well. New features for example, average education for different relations or average capital gain per education level might help the predictive model to learn early in the iterations and thus, improving performance. This type of technique is know as the art of features engineering, which is a widely adopted concept among top notched data scientists and requires profound knowledge and experience in data science.

Garrett Teoh Hor Keong

8 年

@Clive, while the target class is imbalance ~75% (usually it is the case for real data, you wont get 50%/50% for example, churn/attrition rate). The caret package, createDataPartition() function allows you to partition the data in a stratified manner that in you will achieve a over 75% frequency vis-a-vis on both training and test set. This is a concept that is very critical in any cancer/disease research where the allele frequencies in every loci in a genome is examined based on cases/controls status. Thank you guys for appreciating the post, I will make a sequel to it and in my next post, I will talk about assessing prediction's performance.

1 次回应

Clive Coelho

Staff Data Engineer at CalAmp

8 年

Nice. However, the data has a class imbalance problem. After fixing it the results would change.

Alan Grogan

Senior executive | Accenture, Avanade, KBS | Data & AI business leader | Intelligent Data & Information Platform Services

8 年

We need more of this content on LinkedIn as far less of the ridiculous maths quizzes

1 次回应

查看更多评论

要查看或添加评论，请登录

Garrett Teoh Hor Keong的更多文章

Deciphering Statistical Significance

2019年3月4日

Deciphering Statistical Significance

Most of the time, business owners, stakeholder, and decision makers misconstrued the meaning of statistical…

2 条评论
Differentiating Analytical Approaches & Are They Worth The Investments?

2016年9月28日

Differentiating Analytical Approaches & Are They Worth The Investments?

Every businesses and organization in the world seems to be overly excited and eager to jump into the "analytical, deep…

2 条评论
Predictive Modelling (Part 3/3): The Dark and White Ensemble

2016年6月13日

Predictive Modelling (Part 3/3): The Dark and White Ensemble

In my previous posts for Predictive Modelling Part 1 and Part 2, I have introduced how simple it is to design and…
Leadership Personalities and Types

2016年5月31日

Leadership Personalities and Types

Many people often asks, what type of leadership personalities are there? If I am to take up a leadership role in my…
Can Money Buy Time? Time Equates Management Skills

2016年5月28日

Can Money Buy Time? Time Equates Management Skills

There is something in common that most of us would agree that we wished we had more (excluding money) - time. I bet…

1 条评论
Predictive Modelling (Part 2/3): Multiple Models & Accuracy Assessments

2016年5月18日

Predictive Modelling (Part 2/3): Multiple Models & Accuracy Assessments

Predictive Modelling using open sourced R can be achieved by writing just 2 lines of code, as shown in my previous…
Game Theory: An Example in R

2016年4月30日

Game Theory: An Example in R

It was not my usual stint to explore optimization problems or looking into game theory. It was something that I would…

3 条评论
Feature Engineering Example in R

2016年4月10日

Feature Engineering Example in R

Feature Engineering Example in R Machine learning (deep learning) algorithms are gaining more popularity among the data…
Does Model Accuracy Matter in Big Data Projects?

2016年2月7日

Does Model Accuracy Matter in Big Data Projects?

This is a question that most data scientists elude and adeptly sweep under the carpet. The answer is yes and no…

2 条评论
Cross-Breeds of Data Scientists – Ultimate Unicorn Data Scientist

2016年1月20日

Cross-Breeds of Data Scientists – Ultimate Unicorn Data Scientist

From the beginning of second quarter last year, the characteristics, qualities, and responsibilities of Data Scientists…

See all articles

Predictive Modelling (Part 1/3): In 2 lines of R code

Garrett Teoh Hor Keong

> model1=xgb.train(data=dtrain,objective="binary:logistic"
,eta=0.1,nrounds=1000,eval_metric="auc",print.every.n=50
,watchlist=list(val=dtrain),verbose=1)

> pred1=predict(model1,dtest)
> auc(roc(pred1,as.factor(y.te)))

> f.names=colnames(data.matrix(newdat[tr.idx,]))
> impt=xgb.importance(feature_names=f.names,model=model1)
> xgb.plot.importance(impt)

Garrett Teoh Hor Keong的更多文章

社区洞察

其他会员也浏览了

Data Science in Healthcare – 7 Applications No one will Tell You

Support Vector Machine

100 Key Terms in Statistics and Their Meanings

Person-Centered Approaches: Latent Profile Analysis (LPA) & Latent Class Analysis (LCA) ?(2/5) ??

Understanding frameworks for Statistical Models

Clustering Techniques in Data Science

Data Science Capstone - EdX

Non-Linear Parametric Modeling in R using Genetic Algorithm

Modeling Biases and Decision-Making in Healthcare

Unlocking causal insights : Propensity Score Matching

> model1=xgb.train(data=dtrain,objective="binary:logistic",eta=0.1,nrounds=1000,eval_metric="auc",print.every.n=50,watchlist=list(val=dtrain),verbose=1)

> pred1=predict(model1,dtest)> auc(roc(pred1,as.factor(y.te)))

> f.names=colnames(data.matrix(newdat[tr.idx,]))> impt=xgb.importance(feature_names=f.names,model=model1)> xgb.plot.importance(impt)

Garrett Teoh Hor Keong的更多文章

Deciphering Statistical Significance

Differentiating Analytical Approaches & Are They Worth The Investments?

Predictive Modelling (Part 3/3): The Dark and White Ensemble

Leadership Personalities and Types

Can Money Buy Time? Time Equates Management Skills

Predictive Modelling (Part 2/3): Multiple Models & Accuracy Assessments

Game Theory: An Example in R

Feature Engineering Example in R

Does Model Accuracy Matter in Big Data Projects?

Cross-Breeds of Data Scientists – Ultimate Unicorn Data Scientist

社区洞察

其他会员也浏览了

Data Science in Healthcare – 7 Applications No one will Tell You

Support Vector Machine

100 Key Terms in Statistics and Their Meanings

Person-Centered Approaches: Latent Profile Analysis (LPA) & Latent Class Analysis (LCA) ?(2/5) ??

Understanding frameworks for Statistical Models

Clustering Techniques in Data Science

Data Science Capstone - EdX

Non-Linear Parametric Modeling in R using Genetic Algorithm

Modeling Biases and Decision-Making in Healthcare

Unlocking causal insights : Propensity Score Matching

> model1=xgb.train(data=dtrain,objective="binary:logistic"
,eta=0.1,nrounds=1000,eval_metric="auc",print.every.n=50
,watchlist=list(val=dtrain),verbose=1)

> pred1=predict(model1,dtest)
> auc(roc(pred1,as.factor(y.te)))

> f.names=colnames(data.matrix(newdat[tr.idx,]))
> impt=xgb.importance(feature_names=f.names,model=model1)
> xgb.plot.importance(impt)