Predictive Modelling (Part 1/3): In 2 lines of R code
hypersonic55.wordpress.com - Charles Xavier in X-Men First Class

Predictive Modelling (Part 1/3): In 2 lines of R code

In the context of predictive modelling, it is often seen as somewhat a sort of 'black box', 'dark art', 'magic works', or 'fortune telling' tricks. Data scientists who can design such an algorithm capable of achieving 100% accurate predictions can be assured of a red carpet treatment by companies and it would be like a Marvel superhero turning into reality - Charles Xavier. Thus, in the natural world, predictive modelling are misconstrued as extremely high complexity algorithms that demands a super human ability to write the code for the models. There might be a certain truth to make that assumptions a decade ago when there are no pre-built functions and packages available for building and training predictive models. During the past few years, statisticians, scientists, and mathematics engineer has been putting in tremendous effort on R&D in creating novel methods for statistical analysis and modelling, due to the ever demanding biomedical science industries which requires breakthroughs in drug discoveries and cures for emerging cancers and diseases. In order to move rapidly in drug testings, putting biomedical R&D on full throttle, despite having to digest Terabytes of genetic and molecular level DNA, RNA data, mathematically complex statistical models has to be packaged into sizeable functions and modules that can be called and deployed efficiently. In the remaining sections of this article, I will give an example of how easy it is to implement an efficient predictive model - xgboost (eXtreme Gradient Boosting) in R implementation.

The sample dataset for this demo can be downloaded from UCI: Adult Data Set. In this example, we have a number of demographics information pertaining to a group of working adults. The predictive task here is to predict whether a person makes more than $50,000 a year.

The first few lines of R code below is basically performing the minimal ETL (extract, transform, load) and initialization. The code to implement the predictive model is only one single line and another line to code to perform the prediction. Let's do some initialization and load the data set:

> library(xgboost); library(caret); library(dplyr); library(AUC)
> options(scipen=10); set.seed(794)
> dat=read.table("adult.data",stringsAsFactors=F,header=F,sep=',')
> colnames(dat)=c("age","workclass","fnlwgt","edu","edu-num","marital","occ","relation","race","sex","cap-gain","cap-loss","hours-pw","country","income-rng")
> dat=data.frame(dat); str(dat)

From the summary above, we have 6 integers variables, and 9 categorical (character) variables. The variable "income.rng" contains 2 categories only, less than or equal 50K, and greater than 50K, which later will be converted into a binary class and it will be our target we want to predict. Next, prepare a summary statistics for the data:

> dfinfo=data.frame(featname=colnames(dat))
> dfinfo$feat.class=sapply(dat,function(x) class(x))
> dfinfo$ndistinct=sapply(dat,function(x) n_distinct(x))
> dfinfo$na=sapply(dat,function(x) sum(is.na(x)))
> y=as.numeric(as.factor(dat$income.rng))
> table(y,dat$income.rng);dfinfo[1:10,]
> dat$income.rng=NULL;dfinfo=dfinfo[-15,]

The table above shows the first 10 variables and a very brief summary statistics on each variables/features. Next, let's make a partition for our train set and test set from the pre-processed dataset. We will create 2 partitions, 80% for building and training the predictive model and 20% to test the accuracy of the built model:

> tr.idx=createDataPartition(y,p=0.8)$Resample1
> newdat=dat
> for (f in dfinfo$featname[which(dfinfo$feat.class=="character")]) {
 newdat[[f]]=as.numeric(as.factor(newdat[[f]]))}
> y.tr=y[tr.idx]-1;y.te=y[-tr.idx]-1
> dtrain=newdat[tr.idx,];dtest=newdat[-tr.idx,]
> dtrain=xgb.DMatrix(data.matrix(newdat[tr.idx,]),label=y.tr)
> dtest=xgb.DMatrix(data.matrix(newdat[-tr.idx,]))

Once the pre-processed dataset has been partitioned into train and test set, we can write the one line of code to call out the latest gradient boosting tree model:

> model1=xgb.train(data=dtrain,objective="binary:logistic"
,eta=0.1,nrounds=1000,eval_metric="auc",print.every.n=50
,watchlist=list(val=dtrain),verbose=1)

 

 

 

 

 

 

 

We are calling 1000 iterations in training the predictive model and noticed that this training step will only take less than 1 minute to complete on a intel core i7 laptop. The training model accuracy (AUC evaluation) is close to 99%!! Let's deploy our model now on our test dataset:

> pred1=predict(model1,dtest)
> auc(roc(pred1,as.factor(y.te)))

and the AUC we achieved is,

Voila! An amazing 92.1%! At this point of time, you might be thinking, hmm, not too shabby, at least my AUC is 92.1%, and 7.9% close to becoming Charles Xavier. Well, hold on to that thought because the evaluation metric or accuracy (AUC) does not equate to 92.1% right predictions at all time. It is a calculation based on False Positive Rate (Type I error rate, specificity) and True Positive Rate (sensitivity rate). The AUC/ROC/Sensitivity/Specificity/TPR/FPR is another separate topic which will be discussed later.

Last but not least, to summarize up what is going on with the dataset and reasons why the predictive model works, we can explore the features importance plot by:

> f.names=colnames(data.matrix(newdat[tr.idx,]))
> impt=xgb.importance(feature_names=f.names,model=model1)
> xgb.plot.importance(impt)

 The top 3 most significant features are relation, capital gain, and education number. If a detailed data exploratory analysis (a separate discussion itself) is performed, the data would point to you that "husbands" in the relation categorical feature are highly correlated to our target and if a chi-square test is to be performed, the p-value would most likely be in the 0.05 range. If we fit a logistic regression of our target based on capital gain and education number, those two features would have appear significant as well. New features for example, average education for different relations or average capital gain per education level might help the predictive model to learn early in the iterations and thus, improving performance. This type of technique is know as the art of features engineering, which is a widely adopted concept among top notched data scientists and requires profound knowledge and experience in data science.

@Clive, while the target class is imbalance ~75% (usually it is the case for real data, you wont get 50%/50% for example, churn/attrition rate). The caret package, createDataPartition() function allows you to partition the data in a stratified manner that in you will achieve a over 75% frequency vis-a-vis on both training and test set. This is a concept that is very critical in any cancer/disease research where the allele frequencies in every loci in a genome is examined based on cases/controls status. Thank you guys for appreciating the post, I will make a sequel to it and in my next post, I will talk about assessing prediction's performance.

Clive Coelho

Staff Data Engineer at CalAmp

8 å¹´

Nice. However, the data has a class imbalance problem. After fixing it the results would change.

赞
回复
Alan Grogan

Senior executive | Accenture, Avanade, KBS | Data & AI business leader | Intelligent Data & Information Platform Services

8 å¹´

We need more of this content on LinkedIn as far less of the ridiculous maths quizzes

要查看或添加评论,请登录

Garrett Teoh Hor Keong的更多文章

  • Deciphering Statistical Significance

    Deciphering Statistical Significance

    Most of the time, business owners, stakeholder, and decision makers misconstrued the meaning of statistical…

    2 条评论
  • Differentiating Analytical Approaches & Are They Worth The Investments?

    Differentiating Analytical Approaches & Are They Worth The Investments?

    Every businesses and organization in the world seems to be overly excited and eager to jump into the "analytical, deep…

    2 条评论
  • Predictive Modelling (Part 3/3): The Dark and White Ensemble

    Predictive Modelling (Part 3/3): The Dark and White Ensemble

    In my previous posts for Predictive Modelling Part 1 and Part 2, I have introduced how simple it is to design and…

  • Leadership Personalities and Types

    Leadership Personalities and Types

    Many people often asks, what type of leadership personalities are there? If I am to take up a leadership role in my…

  • Can Money Buy Time? Time Equates Management Skills

    Can Money Buy Time? Time Equates Management Skills

    There is something in common that most of us would agree that we wished we had more (excluding money) - time. I bet…

    1 条评论
  • Predictive Modelling (Part 2/3): Multiple Models & Accuracy Assessments

    Predictive Modelling (Part 2/3): Multiple Models & Accuracy Assessments

    Predictive Modelling using open sourced R can be achieved by writing just 2 lines of code, as shown in my previous…

  • Game Theory: An Example in R

    Game Theory: An Example in R

    It was not my usual stint to explore optimization problems or looking into game theory. It was something that I would…

    3 条评论
  • Feature Engineering Example in R

    Feature Engineering Example in R

    Feature Engineering Example in R Machine learning (deep learning) algorithms are gaining more popularity among the data…

  • Does Model Accuracy Matter in Big Data Projects?

    Does Model Accuracy Matter in Big Data Projects?

    This is a question that most data scientists elude and adeptly sweep under the carpet. The answer is yes and no…

    2 条评论
  • Cross-Breeds of Data Scientists – Ultimate Unicorn Data Scientist

    Cross-Breeds of Data Scientists – Ultimate Unicorn Data Scientist

    From the beginning of second quarter last year, the characteristics, qualities, and responsibilities of Data Scientists…

社区洞察

其他会员也浏览了