How important is that variable?

How important is that variable?

When building a model that includes explanatory variables related to the phenomenon of interest, one question arises: which auxiliary variables have the most impact on the response? This thread is not concerned with significance testing; rather, on knowing the ranking and importance of each variable that influences the response. There are many methods available for answering this question. In this discussion, we will focus on isolating units from variables - a straightforward approach. For example purposes only, let's assume a linear model structure with two explanatory variables.

No alt text provided for this image

If you assume that the model is true, you can determine the impact of variable x on response y, by isolating units from variables. One method involves fitting a model using standardized variables (both explanatory and response) and comparing the regression coefficients directly. Another approach involves using this expression:

No alt text provided for this image

As an illustration, suppose we have a model y = -500 x1 + 50 x2 + e. In this case, the relative importance of the first and second variables can be computed as approximately 0.9 for x1, and 0.1 for x2, by dividing their coefficients by the sum of all coefficients.

To perform this analysis in R, you can use the following code that generates a dataset of size n with two independent variables (x1 and x2) and one dependent variable (y). It then fits a linear model to the data using an intercept-free specification. Next, the two aforementioned methods are used to compute relative importance scores for each variable:

  • Method 1 calculates standardized betas from coefficient estimates in the fitted model object. The importance measure for each variable is obtained by taking the absolute value of beta divided by its standard deviation, normalized by their sum. Alternatively, we can calculate relative importance directly using formula based on unstandardized coefficients and variable standard deviations.
  • Method 2 involves fitting a new linear model after scaling (centering & normalizing) both explanatory variables as well as response variable. This method yields standardized regression coefficients that can be used to compare the relative influence of different predictors; here again, we normalize these coefficients so that they sum up to one.

# Set sample size and generate dat
n <- 10000
x1 <- runif(n)
x2 <- runif(n)
y <- -500 * x1 + 50 * x2 + rnorm(n)

# Fit linear model without intercept term
model <- lm(y ~ 0 + x1 + x2)

### Method 1: Standardized betas ###

# Compute standardized betas from coefficient estimates in fitted model object
sd.betas <- summary(model)$coe[,2]
betas?? <- coef(model)
imp???? <- abs(betas) / sd.betas # absolute value of beta divided by its standard deviation gives importance measure for each variable
imp???? <- imp / sum(imp) # divide by total to get relative importance scores

imp # display results


# Alternatively, calculate relative importance directly using formula based on unstandardized coefficients and variable standard deviations
imp1??? <- abs(model$coefficients[1] * sd(x1)/sd(y))
imp2??? <- abs(model$coefficients[2] * sd(x2)/sd(y))

rel_imp_1_to_2 = imp1 / (imp1 + imp2);
rel_imp_2_to_1 = imp2 / (imp1+ imp2);

rel_imp_1_to_2;
rel_imp_2_to_1;

### Method 02: Standardized variables ###

# Fit a new linear model using standardized variables obtained via scaling with mean=0 and SD=10.
model_std_vars?? = lm(I(scale(y)) ~ I(scale(x1)) + I(scale(x)))

summary(model_std_vars) # print summary statistics

abs(coef(model_std_vars))/sum(abs(coef(model_std_vars))) # compute relative importance scores as ratio of absolute coefficients to their sum.a        
Leandro Marino

Leader | Researcher | Consultant | Professor

1 年

Thanks for bringing attention to this. That is a useful answer for one common question from a lot of social researchers.

要查看或添加评论,请登录

Andrés Gutiérrez的更多文章

  • Milan Kundera and Probability

    Milan Kundera and Probability

    Today, Milan Kundera passed away. He was 94.

    1 条评论
  • Lord's paradox in R

    Lord's paradox in R

    In an article entitled "A Paradox in the Interpretation of Group Comparisons" published in Psychological Bulletin, Lord…

  • You smart? How about your children? The law of regression to the mean

    You smart? How about your children? The law of regression to the mean

    Francis Galton -British scientist and cousin of Charles Darwin-, in the late 19th century cleverly coined the term…

    1 条评论
  • Beyond the coefficient of variation

    Beyond the coefficient of variation

    It is evident that in order to have reliable and accurate official statistical systems, quality criteria must be used…

社区洞察

其他会员也浏览了