How important is that variable?
Andrés Gutiérrez
ECLAC Regional Adviser on Social Statistics - Vicepresident of the International Association of Survey Statisticians (2023 - 2025) - Elected Member of the International Statistical Institute
When building a model that includes explanatory variables related to the phenomenon of interest, one question arises: which auxiliary variables have the most impact on the response? This thread is not concerned with significance testing; rather, on knowing the ranking and importance of each variable that influences the response. There are many methods available for answering this question. In this discussion, we will focus on isolating units from variables - a straightforward approach. For example purposes only, let's assume a linear model structure with two explanatory variables.
If you assume that the model is true, you can determine the impact of variable x on response y, by isolating units from variables. One method involves fitting a model using standardized variables (both explanatory and response) and comparing the regression coefficients directly. Another approach involves using this expression:
领英推荐
As an illustration, suppose we have a model y = -500 x1 + 50 x2 + e. In this case, the relative importance of the first and second variables can be computed as approximately 0.9 for x1, and 0.1 for x2, by dividing their coefficients by the sum of all coefficients.
To perform this analysis in R, you can use the following code that generates a dataset of size n with two independent variables (x1 and x2) and one dependent variable (y). It then fits a linear model to the data using an intercept-free specification. Next, the two aforementioned methods are used to compute relative importance scores for each variable:
# Set sample size and generate dat
n <- 10000
x1 <- runif(n)
x2 <- runif(n)
y <- -500 * x1 + 50 * x2 + rnorm(n)
# Fit linear model without intercept term
model <- lm(y ~ 0 + x1 + x2)
### Method 1: Standardized betas ###
# Compute standardized betas from coefficient estimates in fitted model object
sd.betas <- summary(model)$coe[,2]
betas?? <- coef(model)
imp???? <- abs(betas) / sd.betas # absolute value of beta divided by its standard deviation gives importance measure for each variable
imp???? <- imp / sum(imp) # divide by total to get relative importance scores
imp # display results
# Alternatively, calculate relative importance directly using formula based on unstandardized coefficients and variable standard deviations
imp1??? <- abs(model$coefficients[1] * sd(x1)/sd(y))
imp2??? <- abs(model$coefficients[2] * sd(x2)/sd(y))
rel_imp_1_to_2 = imp1 / (imp1 + imp2);
rel_imp_2_to_1 = imp2 / (imp1+ imp2);
rel_imp_1_to_2;
rel_imp_2_to_1;
### Method 02: Standardized variables ###
# Fit a new linear model using standardized variables obtained via scaling with mean=0 and SD=10.
model_std_vars?? = lm(I(scale(y)) ~ I(scale(x1)) + I(scale(x)))
summary(model_std_vars) # print summary statistics
abs(coef(model_std_vars))/sum(abs(coef(model_std_vars))) # compute relative importance scores as ratio of absolute coefficients to their sum.a
Leader | Researcher | Consultant | Professor
1 年Thanks for bringing attention to this. That is a useful answer for one common question from a lot of social researchers.