How to find the most important variables in R

How to find the most important variables in R

How to find the most important variables in R

Find the most important variables that contribute most significantly to a response variable

Selecting the most important predictor variables that explains the major part of variance of the response variable can be key to identify and build high performing models.

1. Random Forest Method

Random forest can be very effective to find a set of predictors that best explains the variance in the response variable.

library(caret)

library(randomForest)

library(varImp)

regressor <- randomForest(Target ~ . , data       ?= data, importance=TRUE) # fit the random forest with default parameter

varImp(regressor) # get variable importance, based on mean decrease in accuracy

varImp(regressor, conditional=TRUE) # conditional=True, adjusts for correlations between predictors

varimpAUC(regressor) # more robust towards class imbalance.

 2. xgboost Method

library(caret)

library(xgboost)

regressor=train(Target~., data        ?= data, method = "xgbTree",trControl = trainControl("cv", number = 10),scale=T)

varImp(regressor)

3. Relative Importance Method

Using calc.relimp {relaimpo}, the relative importance of variables fed into lm model can be determined as a relative percentage.

library(relaimpo)

regressor <- lm(Target ~ . , data       ?= data) # fit lm() model

relImportance <- calc.relimp(regressor, type = "lmg", rela = TRUE) # calculate relative importance scaled to 100

sort(relImportance$lmg, decreasing=TRUE) # relative importance

4. MARS (earth package) Method

The earth package implements variable importance based on Generalized cross validation (GCV), number of subset models the variable occurs (nsubsets) and residual sum of squares (RSS).

library(earth)

regressor <- earth(Target ~ . , data       ?= data) # build model

ev <- evimp (regressor) # estimate variable importance

plot (ev)

5. Step-wise Regression Method

If you have large number of predictors , split the Data in chunks of 10 predictors with each chunk holding the responseVar.

base.mod <- lm(Target ~ 1 , data       ?= data) # base intercept only model

all.mod <- lm(Target ~ . , data       ?= data) # full model with all predictors

stepMod <- step(base.mod, scope = list(lower = base.mod, upper = all.mod), direction = "both", trace = 1, steps = 1000) # perform step-wise algorithm

shortlistedVars <- names(unlist(stepMod[[1]])) # get the shortlisted variable.

shortlistedVars <- shortlistedVars[!shortlistedVars %in% "(Intercept)"] # remove intercept

The output might include levels within categorical variables, since ‘stepwise’ is a linear regression based technique.

If you have a large number of predictor variables, the above code may need to be placed in a loop that will run stepwise on sequential chunks of predictors. The shortlisted variables can be accumulated for further analysis towards the end of each iteration. This can be very effective method, if you want to

·        Be highly selective about discarding valuable predictor variables.

·        Build multiple models on the response variable.

6. Boruta Method

The ‘Boruta’ method can be used to decide if a variable is important or not.

library(Boruta)

# Decide if a variable is important or not using Boruta

boruta_output <- Boruta(Target ~ . , data  ?= data, doTrace=2) # perform Boruta search

boruta_signif <- names(boruta_output$finalDecision[boruta_output$finalDecision %in% c("Confirmed", "Tentative")]) # collect Confirmed and Tentative variables

# for faster calculation(classification only)

library(rFerns)

boruta.train <- Boruta(factor(Target)~., data  ?=data, doTrace = 2, getImp=getImpFerns, holdHistory = F)
boruta.train
 
boruta_signif <- names(boruta.train$finalDecision[boruta.train$finalDecision %in% c("Confirmed", "Tentative")]) # collect Confirmed and Tentative variables
 
boruta_signif

##
getSelectedAttributes(boruta_signif, withTentative = F)

boruta.df <- attStats(boruta_signif)

print(boruta.df)

7. Information value and Weight of evidence Method

library(devtools)

library(woe)

library(riv)

iv_df <- iv.mult(data, y="Target", summary=TRUE, verbose=TRUE)

iv <- iv.mult(data, y="Target", summary=FALSE, verbose=TRUE)

iv_df

iv.plot.summary(iv_df) # Plot information value summary

Calculate weight of evidence variables

data_iv <- iv.replace.woe(data, iv, verbose=TRUE) # add woe variables to original data frame.

The newly created woe variables can alternatively be in place of the original factor variables.

8. Learning Vector Quantization (LVQ) Method

library(caret)
control <- trainControl(method="repeatedcv", number=10, repeats=3)

# train the model

regressor<- train(Target~., data       ?=data, method="lvq", preProcess="scale", trControl=control)

# estimate variable importance

importance <- varImp(regressor, scale=FALSE)

9. Recursive Feature Elimination RFE Method

library(caret)

# define the control using a random forest selection function

control <- rfeControl(functions=rfFuncs, method="cv", number=10)

# run the RFE algorithm

results <- rfe(data[,1:n-1], data[,n], sizes=c(1:8), rfeControl=control)

# summarize the results

# list the chosen features
predictors(results)

# plot the results
plot(results, type=c("g", "o"))

10. DALEX Method

library(randomForest)

library(DALEX)

regressor <- randomForest(Target ~ . , data       ?= data, importance=TRUE) # fit the random forest with default parameter


# Variable importance with DALEX

explained_rf <- explain(regressor, data   ?=data, y=data$target)



# Get the variable importances

varimps = variable_dropout(explained_rf, type='raw')



print(varimps)

plot(varimps)

11. VITA

library(vita)

regressor <- randomForest(Target ~ . , data    ?= data, importance=TRUE) # fit the random forest with default parameter

pimp.varImp.reg<-PIMP(data,data$target,regressor,S=10, parallel=TRUE)
pimp.varImp.reg

pimp.varImp.reg$VarImp

pimp.varImp.reg$VarImp
sort(pimp.varImp.reg$VarImp,decreasing = T)

12. Genetic Algorithm

library(caret)

# Define control function

ga_ctrl <- gafsControl(functions = rfGA, # another option is `caretGA`.

            method = "cv",

            repeats = 3)



# Genetic Algorithm feature selection

ga_obj <- gafs(x=data[, 1:n-1], 

        y=data[, n], 

        iters = 3,  # normally much higher (100+)

        gafsControl = ga_ctrl)



ga_obj

# Optimal variables

ga_obj$optVariables

13. Simulated Annealing

library(caret)

# Define control function

sa_ctrl <- safsControl(functions = rfSA,

            method = "repeatedcv",

            repeats = 3,

            improve = 5) # n iterations without improvement before a reset



# Simulated Annealing Feature Selection

set.seed(100)

sa_obj <- safs(x=data[, 1:n-1], 

        y=data[, n],

        safsControl = sa_ctrl)



sa_obj

# Optimal variables

print(sa_obj$optVariables)


14. Correlation Method

library(caret)

# calculate correlation matrix

correlationMatrix <- cor(data [,1:n-1])

# summarize the correlation matrix

print(correlationMatrix)

# find attributes that are highly corrected (ideally >0.75)

highlyCorrelated <- findCorrelation(correlationMatrix, cutoff=0.5)

# print indexes of highly correlated attributes

print(highlyCorrelated)
Md Ahsanul I.

Data Analyst | Statistician

1 年

Thank you!

回复
Víctor Hugo Moncayo

Desarrollador Backend | Analista de Datos | Científico de Datos | Python | SQL | Java | Siempre en busca de nuevo desafíos

4 年

Thanks, good post.? But I would like you to explain these techniques using independent categorical attributes, since many of them do not work with this kind of data.

要查看或添加评论,请登录

社区洞察

其他会员也浏览了