How to find the most important variables in R
How to find the most important variables in R
Find the most important variables that contribute most significantly to a response variable
Selecting the most important predictor variables that explains the major part of variance of the response variable can be key to identify and build high performing models.
1. Random Forest Method
Random forest can be very effective to find a set of predictors that best explains the variance in the response variable.
library(caret)
library(randomForest)
library(varImp)
regressor <- randomForest(Target ~ . , data ?= data, importance=TRUE) # fit the random forest with default parameter
varImp(regressor) # get variable importance, based on mean decrease in accuracy
varImp(regressor, conditional=TRUE) # conditional=True, adjusts for correlations between predictors
varimpAUC(regressor) # more robust towards class imbalance.
2. xgboost Method
library(caret)
library(xgboost)
regressor=train(Target~., data ?= data, method = "xgbTree",trControl = trainControl("cv", number = 10),scale=T)
varImp(regressor)
3. Relative Importance Method
Using calc.relimp {relaimpo}, the relative importance of variables fed into lm model can be determined as a relative percentage.
library(relaimpo)
regressor <- lm(Target ~ . , data ?= data) # fit lm() model
relImportance <- calc.relimp(regressor, type = "lmg", rela = TRUE) # calculate relative importance scaled to 100
sort(relImportance$lmg, decreasing=TRUE) # relative importance
4. MARS (earth package) Method
The earth package implements variable importance based on Generalized cross validation (GCV), number of subset models the variable occurs (nsubsets) and residual sum of squares (RSS).
library(earth)
regressor <- earth(Target ~ . , data ?= data) # build model
ev <- evimp (regressor) # estimate variable importance
plot (ev)
5. Step-wise Regression Method
If you have large number of predictors , split the Data in chunks of 10 predictors with each chunk holding the responseVar.
base.mod <- lm(Target ~ 1 , data ?= data) # base intercept only model
all.mod <- lm(Target ~ . , data ?= data) # full model with all predictors
stepMod <- step(base.mod, scope = list(lower = base.mod, upper = all.mod), direction = "both", trace = 1, steps = 1000) # perform step-wise algorithm
shortlistedVars <- names(unlist(stepMod[[1]])) # get the shortlisted variable.
shortlistedVars <- shortlistedVars[!shortlistedVars %in% "(Intercept)"] # remove intercept
The output might include levels within categorical variables, since ‘stepwise’ is a linear regression based technique.
If you have a large number of predictor variables, the above code may need to be placed in a loop that will run stepwise on sequential chunks of predictors. The shortlisted variables can be accumulated for further analysis towards the end of each iteration. This can be very effective method, if you want to
· Be highly selective about discarding valuable predictor variables.
· Build multiple models on the response variable.
6. Boruta Method
The ‘Boruta’ method can be used to decide if a variable is important or not.
library(Boruta)
# Decide if a variable is important or not using Boruta
boruta_output <- Boruta(Target ~ . , data ?= data, doTrace=2) # perform Boruta search
boruta_signif <- names(boruta_output$finalDecision[boruta_output$finalDecision %in% c("Confirmed", "Tentative")]) # collect Confirmed and Tentative variables
# for faster calculation(classification only)
library(rFerns)
boruta.train <- Boruta(factor(Target)~., data ?=data, doTrace = 2, getImp=getImpFerns, holdHistory = F)
boruta.train
boruta_signif <- names(boruta.train$finalDecision[boruta.train$finalDecision %in% c("Confirmed", "Tentative")]) # collect Confirmed and Tentative variables
boruta_signif
##
getSelectedAttributes(boruta_signif, withTentative = F)
boruta.df <- attStats(boruta_signif)
print(boruta.df)
7. Information value and Weight of evidence Method
library(devtools)
library(woe)
library(riv)
iv_df <- iv.mult(data, y="Target", summary=TRUE, verbose=TRUE)
iv <- iv.mult(data, y="Target", summary=FALSE, verbose=TRUE)
iv_df
iv.plot.summary(iv_df) # Plot information value summary
Calculate weight of evidence variables
data_iv <- iv.replace.woe(data, iv, verbose=TRUE) # add woe variables to original data frame.
The newly created woe variables can alternatively be in place of the original factor variables.
8. Learning Vector Quantization (LVQ) Method
library(caret)
control <- trainControl(method="repeatedcv", number=10, repeats=3)
# train the model
regressor<- train(Target~., data ?=data, method="lvq", preProcess="scale", trControl=control)
# estimate variable importance
importance <- varImp(regressor, scale=FALSE)
9. Recursive Feature Elimination RFE Method
library(caret)
# define the control using a random forest selection function
control <- rfeControl(functions=rfFuncs, method="cv", number=10)
# run the RFE algorithm
results <- rfe(data[,1:n-1], data[,n], sizes=c(1:8), rfeControl=control)
# summarize the results
# list the chosen features
predictors(results)
# plot the results
plot(results, type=c("g", "o"))
10. DALEX Method
library(randomForest)
library(DALEX)
regressor <- randomForest(Target ~ . , data ?= data, importance=TRUE) # fit the random forest with default parameter
# Variable importance with DALEX
explained_rf <- explain(regressor, data ?=data, y=data$target)
# Get the variable importances
varimps = variable_dropout(explained_rf, type='raw')
print(varimps)
plot(varimps)
11. VITA
library(vita)
regressor <- randomForest(Target ~ . , data ?= data, importance=TRUE) # fit the random forest with default parameter
pimp.varImp.reg<-PIMP(data,data$target,regressor,S=10, parallel=TRUE)
pimp.varImp.reg
pimp.varImp.reg$VarImp
pimp.varImp.reg$VarImp
sort(pimp.varImp.reg$VarImp,decreasing = T)
12. Genetic Algorithm
library(caret)
# Define control function
ga_ctrl <- gafsControl(functions = rfGA, # another option is `caretGA`.
method = "cv",
repeats = 3)
# Genetic Algorithm feature selection
ga_obj <- gafs(x=data[, 1:n-1],
y=data[, n],
iters = 3, # normally much higher (100+)
gafsControl = ga_ctrl)
ga_obj
# Optimal variables
ga_obj$optVariables
13. Simulated Annealing
library(caret)
# Define control function
sa_ctrl <- safsControl(functions = rfSA,
method = "repeatedcv",
repeats = 3,
improve = 5) # n iterations without improvement before a reset
# Simulated Annealing Feature Selection
set.seed(100)
sa_obj <- safs(x=data[, 1:n-1],
y=data[, n],
safsControl = sa_ctrl)
sa_obj
# Optimal variables
print(sa_obj$optVariables)
14. Correlation Method
library(caret)
# calculate correlation matrix
correlationMatrix <- cor(data [,1:n-1])
# summarize the correlation matrix
print(correlationMatrix)
# find attributes that are highly corrected (ideally >0.75)
highlyCorrelated <- findCorrelation(correlationMatrix, cutoff=0.5)
# print indexes of highly correlated attributes
print(highlyCorrelated)
Data Analyst | Statistician
1 年Thank you!
Desarrollador Backend | Analista de Datos | Científico de Datos | Python | SQL | Java | Siempre en busca de nuevo desafíos
4 年Thanks, good post.? But I would like you to explain these techniques using independent categorical attributes, since many of them do not work with this kind of data.