How to find the most important variables in R

Amit Jain

Actively looking for new job | 6.9+ YoE as a Data Scientist

发布日期: 2019年1月11日

How to find the most important variables in R

Find the most important variables that contribute most significantly to a response variable

Selecting the most important predictor variables that explains the major part of variance of the response variable can be key to identify and build high performing models.

1. Random Forest Method

Random forest can be very effective to find a set of predictors that best explains the variance in the response variable.

library(caret)

library(randomForest)

library(varImp)

regressor <- randomForest(Target ~ . , data       ?= data, importance=TRUE) # fit the random forest with default parameter

varImp(regressor) # get variable importance, based on mean decrease in accuracy

varImp(regressor, conditional=TRUE) # conditional=True, adjusts for correlations between predictors

varimpAUC(regressor) # more robust towards class imbalance.

2. xgboost Method

library(caret)

library(xgboost)

regressor=train(Target~., data        ?= data, method = "xgbTree",trControl = trainControl("cv", number = 10),scale=T)

varImp(regressor)

3. Relative Importance Method

Using calc.relimp {relaimpo}, the relative importance of variables fed into lm model can be determined as a relative percentage.

library(relaimpo)

regressor <- lm(Target ~ . , data       ?= data) # fit lm() model

relImportance <- calc.relimp(regressor, type = "lmg", rela = TRUE) # calculate relative importance scaled to 100

sort(relImportance$lmg, decreasing=TRUE) # relative importance

4. MARS (earth package) Method

The earth package implements variable importance based on Generalized cross validation (GCV), number of subset models the variable occurs (nsubsets) and residual sum of squares (RSS).

library(earth)

regressor <- earth(Target ~ . , data       ?= data) # build model

ev <- evimp (regressor) # estimate variable importance

plot (ev)

5. Step-wise Regression Method

If you have large number of predictors , split the Data in chunks of 10 predictors with each chunk holding the responseVar.

base.mod <- lm(Target ~ 1 , data       ?= data) # base intercept only model

all.mod <- lm(Target ~ . , data       ?= data) # full model with all predictors

stepMod <- step(base.mod, scope = list(lower = base.mod, upper = all.mod), direction = "both", trace = 1, steps = 1000) # perform step-wise algorithm

shortlistedVars <- names(unlist(stepMod[[1]])) # get the shortlisted variable.

shortlistedVars <- shortlistedVars[!shortlistedVars %in% "(Intercept)"] # remove intercept

The output might include levels within categorical variables, since ‘stepwise’ is a linear regression based technique.

If you have a large number of predictor variables, the above code may need to be placed in a loop that will run stepwise on sequential chunks of predictors. The shortlisted variables can be accumulated for further analysis towards the end of each iteration. This can be very effective method, if you want to

· Be highly selective about discarding valuable predictor variables.

· Build multiple models on the response variable.

6. Boruta Method

The ‘Boruta’ method can be used to decide if a variable is important or not.

library(Boruta)

# Decide if a variable is important or not using Boruta

boruta_output <- Boruta(Target ~ . , data  ?= data, doTrace=2) # perform Boruta search

boruta_signif <- names(boruta_output$finalDecision[boruta_output$finalDecision %in% c("Confirmed", "Tentative")]) # collect Confirmed and Tentative variables

# for faster calculation(classification only)

library(rFerns)

boruta.train <- Boruta(factor(Target)~., data  ?=data, doTrace = 2, getImp=getImpFerns, holdHistory = F)
boruta.train
 
boruta_signif <- names(boruta.train$finalDecision[boruta.train$finalDecision %in% c("Confirmed", "Tentative")]) # collect Confirmed and Tentative variables
 
boruta_signif

##
getSelectedAttributes(boruta_signif, withTentative = F)

boruta.df <- attStats(boruta_signif)

print(boruta.df)

7. Information value and Weight of evidence Method

library(devtools)

library(woe)

library(riv)

iv_df <- iv.mult(data, y="Target", summary=TRUE, verbose=TRUE)

iv <- iv.mult(data, y="Target", summary=FALSE, verbose=TRUE)

iv_df

iv.plot.summary(iv_df) # Plot information value summary

Calculate weight of evidence variables

data_iv <- iv.replace.woe(data, iv, verbose=TRUE) # add woe variables to original data frame.

The newly created woe variables can alternatively be in place of the original factor variables.

8. Learning Vector Quantization (LVQ) Method

library(caret)
control <- trainControl(method="repeatedcv", number=10, repeats=3)

# train the model

regressor<- train(Target~., data       ?=data, method="lvq", preProcess="scale", trControl=control)

# estimate variable importance

importance <- varImp(regressor, scale=FALSE)

9. Recursive Feature Elimination RFE Method

library(caret)

# define the control using a random forest selection function

control <- rfeControl(functions=rfFuncs, method="cv", number=10)

# run the RFE algorithm

results <- rfe(data[,1:n-1], data[,n], sizes=c(1:8), rfeControl=control)

# summarize the results

# list the chosen features
predictors(results)

# plot the results
plot(results, type=c("g", "o"))

10. DALEX Method

library(randomForest)

library(DALEX)

regressor <- randomForest(Target ~ . , data       ?= data, importance=TRUE) # fit the random forest with default parameter


# Variable importance with DALEX

explained_rf <- explain(regressor, data   ?=data, y=data$target)



# Get the variable importances

varimps = variable_dropout(explained_rf, type='raw')



print(varimps)

plot(varimps)

11. VITA

library(vita)

regressor <- randomForest(Target ~ . , data    ?= data, importance=TRUE) # fit the random forest with default parameter

pimp.varImp.reg<-PIMP(data,data$target,regressor,S=10, parallel=TRUE)
pimp.varImp.reg

pimp.varImp.reg$VarImp

pimp.varImp.reg$VarImp
sort(pimp.varImp.reg$VarImp,decreasing = T)

12. Genetic Algorithm

library(caret)

# Define control function

ga_ctrl <- gafsControl(functions = rfGA, # another option is `caretGA`.

            method = "cv",

            repeats = 3)



# Genetic Algorithm feature selection

ga_obj <- gafs(x=data[, 1:n-1], 

        y=data[, n], 

        iters = 3,  # normally much higher (100+)

        gafsControl = ga_ctrl)



ga_obj

# Optimal variables

ga_obj$optVariables

13. Simulated Annealing

library(caret)

# Define control function

sa_ctrl <- safsControl(functions = rfSA,

            method = "repeatedcv",

            repeats = 3,

            improve = 5) # n iterations without improvement before a reset



# Simulated Annealing Feature Selection

set.seed(100)

sa_obj <- safs(x=data[, 1:n-1], 

        y=data[, n],

        safsControl = sa_ctrl)



sa_obj

# Optimal variables

print(sa_obj$optVariables)

14. Correlation Method

library(caret)

# calculate correlation matrix

correlationMatrix <- cor(data [,1:n-1])

# summarize the correlation matrix

print(correlationMatrix)

# find attributes that are highly corrected (ideally >0.75)

highlyCorrelated <- findCorrelation(correlationMatrix, cutoff=0.5)

# print indexes of highly correlated attributes

print(highlyCorrelated)

Md Ahsanul I.

Data Analyst | Statistician

1 年

Thank you!

Víctor Hugo Moncayo

4 年

Thanks, good post.? But I would like you to explain these techniques using independent categorical attributes, since many of them do not work with this kind of data.

1 次回应

查看更多评论

How to find the most important variables in R

Amit Jain

Actively looking for new job | 6.9+ YoE as a Data Scientist

更多精彩文章

社区洞察

其他会员也浏览了

Several ways to use lm function

Bias-Variance tradeoff

Why, How and When to apply Feature Selection

Random Forest

Dimension Reduction - Principal Component Analysis (aka PCA)

Understanding the Minimum Description Length Principle: A Balance Between Model Complexity and Data Fit

Why more data can end up in poor forecasting: Curse of Dimensionality

STATIONARITY OF A TIME SERIES DATASET

Data, decisions, egos and pinball.

All you need to know about DECISION TREE Part 2

How to install WML(Watson Machine Learning) using catalog in Openshift

2022年9月14日

Using Fast loading libraries like Vaex

2021年12月15日

Shapash : Machine Learning Interpretable & Understandable

2021年12月15日

Azure Cognitive Services

2021年12月14日

Autoviz & Autovizwidget

2021年11月24日

Exploratory Data Analysis using pandas visual analysis library

2021年11月12日

Exploratory Data Analysis Using D-Tale Library

2021年11月11日

Exploratory Data Analysis Using Pandas Profiling

2021年11月10日

Exploratory Data Analysis with Sweetviz

2021年9月8日

Python program to check available slots for Covid vaccination centers in your nearest pin code

2021年5月3日