Predictive Modelling (Part 1/3): In 2 lines of R code
In the context of predictive modelling, it is often seen as somewhat a sort of 'black box', 'dark art', 'magic works', or 'fortune telling' tricks. Data scientists who can design such an algorithm capable of achieving 100% accurate predictions can be assured of a red carpet treatment by companies and it would be like a Marvel superhero turning into reality - Charles Xavier. Thus, in the natural world, predictive modelling are misconstrued as extremely high complexity algorithms that demands a super human ability to write the code for the models. There might be a certain truth to make that assumptions a decade ago when there are no pre-built functions and packages available for building and training predictive models. During the past few years, statisticians, scientists, and mathematics engineer has been putting in tremendous effort on R&D in creating novel methods for statistical analysis and modelling, due to the ever demanding biomedical science industries which requires breakthroughs in drug discoveries and cures for emerging cancers and diseases. In order to move rapidly in drug testings, putting biomedical R&D on full throttle, despite having to digest Terabytes of genetic and molecular level DNA, RNA data, mathematically complex statistical models has to be packaged into sizeable functions and modules that can be called and deployed efficiently. In the remaining sections of this article, I will give an example of how easy it is to implement an efficient predictive model - xgboost (eXtreme Gradient Boosting) in R implementation.
The sample dataset for this demo can be downloaded from UCI: Adult Data Set. In this example, we have a number of demographics information pertaining to a group of working adults. The predictive task here is to predict whether a person makes more than $50,000 a year.
The first few lines of R code below is basically performing the minimal ETL (extract, transform, load) and initialization. The code to implement the predictive model is only one single line and another line to code to perform the prediction. Let's do some initialization and load the data set:
> library(xgboost); library(caret); library(dplyr); library(AUC)
> options(scipen=10); set.seed(794)
> dat=read.table("adult.data",stringsAsFactors=F,header=F,sep=',')
> colnames(dat)=c("age","workclass","fnlwgt","edu","edu-num","marital","occ","relation","race","sex","cap-gain","cap-loss","hours-pw","country","income-rng")
> dat=data.frame(dat); str(dat)
From the summary above, we have 6 integers variables, and 9 categorical (character) variables. The variable "income.rng" contains 2 categories only, less than or equal 50K, and greater than 50K, which later will be converted into a binary class and it will be our target we want to predict. Next, prepare a summary statistics for the data:
> dfinfo=data.frame(featname=colnames(dat))
> dfinfo$feat.class=sapply(dat,function(x) class(x))
> dfinfo$ndistinct=sapply(dat,function(x) n_distinct(x))
> dfinfo$na=sapply(dat,function(x) sum(is.na(x)))
> y=as.numeric(as.factor(dat$income.rng))
> table(y,dat$income.rng);dfinfo[1:10,]
> dat$income.rng=NULL;dfinfo=dfinfo[-15,]
The table above shows the first 10 variables and a very brief summary statistics on each variables/features. Next, let's make a partition for our train set and test set from the pre-processed dataset. We will create 2 partitions, 80% for building and training the predictive model and 20% to test the accuracy of the built model:
> tr.idx=createDataPartition(y,p=0.8)$Resample1
> newdat=dat
> for (f in dfinfo$featname[which(dfinfo$feat.class=="character")]) {
newdat[[f]]=as.numeric(as.factor(newdat[[f]]))}
> y.tr=y[tr.idx]-1;y.te=y[-tr.idx]-1
> dtrain=newdat[tr.idx,];dtest=newdat[-tr.idx,]
> dtrain=xgb.DMatrix(data.matrix(newdat[tr.idx,]),label=y.tr)
> dtest=xgb.DMatrix(data.matrix(newdat[-tr.idx,]))
Once the pre-processed dataset has been partitioned into train and test set, we can write the one line of code to call out the latest gradient boosting tree model:
> model1=xgb.train(data=dtrain,objective="binary:logistic"
,eta=0.1,nrounds=1000,eval_metric="auc",print.every.n=50
,watchlist=list(val=dtrain),verbose=1)
We are calling 1000 iterations in training the predictive model and noticed that this training step will only take less than 1 minute to complete on a intel core i7 laptop. The training model accuracy (AUC evaluation) is close to 99%!! Let's deploy our model now on our test dataset:
> pred1=predict(model1,dtest)
> auc(roc(pred1,as.factor(y.te)))
and the AUC we achieved is,
Voila! An amazing 92.1%! At this point of time, you might be thinking, hmm, not too shabby, at least my AUC is 92.1%, and 7.9% close to becoming Charles Xavier. Well, hold on to that thought because the evaluation metric or accuracy (AUC) does not equate to 92.1% right predictions at all time. It is a calculation based on False Positive Rate (Type I error rate, specificity) and True Positive Rate (sensitivity rate). The AUC/ROC/Sensitivity/Specificity/TPR/FPR is another separate topic which will be discussed later.
Last but not least, to summarize up what is going on with the dataset and reasons why the predictive model works, we can explore the features importance plot by:
> f.names=colnames(data.matrix(newdat[tr.idx,]))
> impt=xgb.importance(feature_names=f.names,model=model1)
> xgb.plot.importance(impt)
The top 3 most significant features are relation, capital gain, and education number. If a detailed data exploratory analysis (a separate discussion itself) is performed, the data would point to you that "husbands" in the relation categorical feature are highly correlated to our target and if a chi-square test is to be performed, the p-value would most likely be in the 0.05 range. If we fit a logistic regression of our target based on capital gain and education number, those two features would have appear significant as well. New features for example, average education for different relations or average capital gain per education level might help the predictive model to learn early in the iterations and thus, improving performance. This type of technique is know as the art of features engineering, which is a widely adopted concept among top notched data scientists and requires profound knowledge and experience in data science.
@Clive, while the target class is imbalance ~75% (usually it is the case for real data, you wont get 50%/50% for example, churn/attrition rate). The caret package, createDataPartition() function allows you to partition the data in a stratified manner that in you will achieve a over 75% frequency vis-a-vis on both training and test set. This is a concept that is very critical in any cancer/disease research where the allele frequencies in every loci in a genome is examined based on cases/controls status. Thank you guys for appreciating the post, I will make a sequel to it and in my next post, I will talk about assessing prediction's performance.
Staff Data Engineer at CalAmp
8 年Nice. However, the data has a class imbalance problem. After fixing it the results would change.
Senior executive | Accenture, Avanade, KBS | Data & AI business leader | Intelligent Data & Information Platform Services
8 年We need more of this content on LinkedIn as far less of the ridiculous maths quizzes