10 - Multiple Regression in SAS with PROC REG, PROC GLM and PROC PLM

10 - Multiple Regression in SAS with PROC REG, PROC GLM and PROC PLM

This time we will look at the relationship between a continuous response variable and multiple continuous predictor variables.

The Multiple Linear Regression Model

In a multiple linear regression model with two predictors, the relationship can be expressed as:


multiple linear regression model with two predictors,

Here:

  • Y is the response variable.
  • X1 and X2 are the predictor variables.
  • β0 is the intercept, representing the expected value of Y when both X1 and X2 are 0.
  • β1 is the slope coefficient for X1, showing how much Y changes for a one-unit increase in X1, holding X2 constant.
  • β2 is the slope coefficient for X2, showing how much Y changes for a one-unit increase in X2, holding X1 constant.
  • ε is the error term, accounting for the variability in Y not explained by X1 and X2.

The model fits a plane in three-dimensional space (since there are two predictors), and this plane is used to predict the response variable Y based on the values of X1 and X2. Each coefficient represents the partial effect of one predictor variable, adjusting for the others.

The Multiple Linear Regression Model

The slopes in a multiple regression model have a slightly different meaning compared to a simple regression with just one predictor. The regression coefficients, or slopes, represent the average change in YYY for a one-unit increase in each predictor variable. Specifically:

  • β1 is the average change in Y for a one-unit increase in X1, while keeping X2 constant and

  • β2 is the average change in Y for a one-unit increase in X2, while keeping X1 constant.

If there is no relationship between Y, X1, and X2, meaning the slopes β1 and β2 are both zero, the model becomes a flat plane at the level where Y equals β0.

When the slopes β1 and β2 equal 0,the model is a horizontal plane passing through the point where Y equals β0

When there is a linear relationship between Y, X1, and X2, the model forms an inclined plane. In this scenario, X1, X2, or both have an influence on Y, causing the plane to slope.

When the slopes β1 and β2 equal 0,the model is a inclined plane.

In a multiple regression model, the response variable Y is expressed as a linear combination of k predictor variables, X1 through Xk. To explore the relationship between the predictors and the response, a k-dimensional surface is used for prediction. The model involves k+1 parameters, which include the slopes (regression coefficients) and the intercept.

You can also extend linear regression to capture non-linear relationships by incorporating polynomial terms, such as squared or cubed versions of the predictors, or by adding interaction terms. For example, a polynomial model with X1, X1 square, X2 and X2 square, is still considered a linear model, even though the predictors have exponents.

This is because the model remains linear in the parameters, meaning the coefficients of the terms are still linear in form.

Hypothesis Testing for Multiple Regression

The assumptions for multiple regression are similar to those for simple linear regression. The null hypothesis states that the multiple regression model does not provide a better fit to the data than the baseline model (which would be a flat regression plane without any slope).

Essentially, it means all the slope coefficients are equal to zero, implying that the predictor variables do not account for a significant portion of the variation in the response variable.

Conversely, the alternative hypothesis suggests that the regression model does offer a better fit than the baseline, indicating that at least one slope coefficient differs from zero, meaning that at least one predictor variable significantly explains the variation in the response variable.

For the multiple regression analysis to be valid, four key assumptions must hold true:

  • the response variable, Y, should be accurately represented by a linear function of the predictor variables, Xs;
  • the random error term, ε, must follow a normal distribution with a mean of zero;
  • the variance of ε, σ2, must be constant (homoscedasticity);
  • and the errors must be independent of each other.

Assumptions of Multiple Linear Regression

Multiple Linear Regression versus Simple Linear Regression

Why opt for multiple linear regression instead of conducting a series of simple linear regressions? The key benefit is that multiple regression allows you to assess the relationship between a predictor and the response while accounting for all other predictors in the model. Sometimes hidden relationships emerge, or a previously strong connection weakens once other variables are factored in. This method enables you to evaluate whether a relationship exists between the response variable and several predictors at once. Additionally, you can test for interactions, similar to what is done in ANOVA.

However, there are drawbacks. As the number of predictors increases, interpreting the model becomes more complex. For example, with one response variable and seven potential predictors, you’d have 127 possible models with at least one predictor. This added complexity can make it harder to interpret the results and to select the best model.


Later, we’ll discuss strategies for selecting the "best" model, which often depends on the goals of the analysis and domain knowledge. Despite these challenges, the benefits of using multiple regression over a series of simple regressions far outweigh the downsides. In real-world situations, the response variable typically depends on multiple factors that may interact.

When should you use multiple regression? It’s a powerful tool for both exploratory analysis and prediction. In exploratory analysis, you develop a model to test the statistical significance of the coefficients and assess whether a relationship exists between the response and predictor variables. For instance, does an increase in police officers reduce crime rates? Here, the goal is to understand the relationship rather than predict crime rates. When interpreting the coefficients, you’ll consider their magnitude and direction.

On the other hand, when using multiple regression for prediction, the focus shifts to the model's predictive accuracy. For example, to estimate body fat percentage, you might create a model using skin-fold measurements. In this case, the significance of individual coefficients is less important than the model’s ability to predict future values. You might choose a model with some non-significant terms if it improves prediction. Multiple regression can serve both exploratory and predictive purposes effectively.

Adjusted R-Square

Let's say a previous linear regression analysis found that Abdomen plays a significant role in explaining PctBodyFat2. Remember, the R-squared value indicates the proportion of variation in the response variable that the independent variables account for. When the R-squared value is close to 0, it means the independent variables explain little of the variability in the data. Conversely, when it's close to 1, it means the independent variables explain a substantial portion of the variability.

In this case, the R-squared value of 0.0642 suggests that Abdomen accounts for about 6.4% of the variation in PctBodyFat2. Since this percentage is relatively low, you might consider adding another variable to improve the model. However, be cautious because while R-squared always increases or stays the same when you add more variables, maximizing it isn’t the only goal.

To choose a better model, you can compare adjusted R-squared values, which, unlike R-squared, takes into account both the number of variables and how well the model fits. The formula for adjusted R-squared includes the number of observations (n), whether there’s an intercept (i), and the number of parameters in the model (p).

Think of adjusted R-squared as a “penalized” version of R-squared, where adding more parameters increases the penalty. It only increases if the added terms enhance the model enough to justify the increased complexity.

Exploring the Data

First lets explore our data and set our macro variables.

proc sql number;
describe table STAT1.bodyfat2;
select * from STAT1.bodyfat2(obs=5);
quit;

%let interval = PctBodyFat1 PctBodyFat2 Density Age Weight Height
    			 Adioposity FatFreeWt Neck Chest Abdomen Hip Thigh 
    			 Knee Ankle Biceps Forearm Wrist;        

Here are the first five rows of STAT1.bodyfat2 data.


First five rows of the STAT1.bodyfat2 data


List of variables in the

Next lets get some descriptive statistics.

proc means data=STAT1.bodyfat2 
            min max mean median var std nway maxdec=2;
    var &interval;
title 'Descriptive Statistics of Interval Variables in BodyFat2 Dataset';
run;

title;        


Next lets see the correlation between the variables.

proc corr data=stat1.bodyfat2;
var &interval;
run;        

We select the variables Abdomen and Chest with the PctBodyFat2 variable with the help of correlation table .

Latly we use PROC Univariate procedure to observe the extreme values of these three variables. We set nextrobs=10 to see 10 extreme values from top and bottom each.

proc univariate data=stat1.bodyfat2 nextrobs=10;
var PctBodyFat2 Chest Abdomen;
run;        


Extreme observations from PctBodyFat2


Extreme observations from Chest


Extreme observations from Abdomen

Fitting a Multiple Linear Regression Model Using PROC REG

In this program, we'll use PROC REG to run a linear regression model with two predictors. Then, we'll apply PROC GLM to fit the same model again, which will allow us to display some additional plots that PROC REG doesn't offer. We'll also save our results in an item store and use PROC PLM for further analysis.

In the PROC REG step, the MODEL statement designates PctBodyFat2 as the response variable, with Chest and Abdomen as predictors.

ods graphics on;

proc reg data=stat1.bodyfat2 ;
    model PctBodyFat2 = Chest Abdomen;
    title "Model with Chest and Abdomen";
run;
quit;        

The ANOVA table reveals that the model is statistically significant at the 0.05 level.

The ANOVA table

The R-square value of 0.6728 indicates that 67% of the variation in PctBodyFat2 is explained by the two predictors, Chest and Abdomen. R-square value when there is one predictor variable Chest alone accounted for just 49.37%. But is the higher R-square value due to a better model, or simply because we added more predictors?

R-square value when there are two predictor variables Chest, and Abdomen.
R-square value when there is one predictor variable Chest.

To address this, we can compare the adjusted R-square values. The simpler model with Chest had an adjusted R-square of 0.4917, while the multiple regression model has an adjusted R-square of 0.6702, indicating that adding Chest improved the model without unnecessarily increasing complexity.

Parameter estimates when Chest and Abdomen are the predictor variables.

Now, let's examine the Parameter Estimates tables. In the earlier analysis, Chest showed a significant correlation with PctBodyFat2. However, when Abdomen is added to the model, the Chest estimate changes from 0.69 in the simple model to -0.26 here), and it is still statistically significant. This happens because the estimates for each predictor adjust to account for the presence of the other variable.

Parameter estimates when Chest was the only predictor variable.

While Abdomen remains a significant predictor of PctBodyFat2, Chest's estimate value changes after controlling for Abdomen. This suggests that the two predictors are correlated. Before giving any decisions for which variables to be added in the model, it’s better to wait until more predictors are included.

This step demonstrates how adding predictors can affect the significance and estimates of the parameters, offering a more refined understanding of the data.

Chest and Abdomen are highly correlated.

The residual plots provide evidence of constant variance, and the Q-Q plot suggests normally distributed errors. While Abdomen does show a few outliers, the overall assumptions seem reasonable.


Model with Chest and Abdomen, Dependent Variable: PctBodyFat2

Next, we see the residuals plotted against the predictor variables. Patterns in these plots are indications of an inadequate model.

Residuals plotted against the predictor variables.

Running the same model in PROC GLM, we generate a contour plot that visually represents how well the model predicts PctBodyFat2. This plot uses colors to show predicted values, with dots representing actual data. While PROC GLM and PROC REG yield similar statistical results, the graphical capabilities of PROC GLM provide more insight into the model’s performance. PROC GLM doesn't report an adjusted R-square value.


a contour plot that visually represents how well the model predicts
proc glm data=STAT1.bodyfat2 
         plots(only)=(contourfit);
    model PctBodyFat2 = Abdomen Chest;
    store out=multiple;
    title "Model with Chest and Abdomen";
run;
quit;        

Finally, using PROC PLM, we create additional visualizations like contour plots and slice plots to further explore the relationships between predictors and the response variable. These tools help visualize the effect of predictors at different levels, allowing for more intuitive interpretation of the results.

Ultimately, to move forward with a more complex model, our choice will depend on the research objective and subject-matter expertise, and we’ll explore model selection techniques to identify the best candidate models.

Next, we'll utilize PROC PLM to handle the item store generated by PROC GLM and produce additional visualizations. The EFFECTPLOT statement provides a way to display the fitted model with various customization options. Using the EFFECTPLOT option CONTOUR, we can create a contour plot that shows predicted values based on two continuous variables. We'll place Chest on the Y-axis and Abdomen on the X-axis.

The SLICEFIT option, on the other hand, will draw a curve showing predicted values for a continuous variable, grouped by the levels of another variable. In this case, we'll look at the effect of Abdomen at different levels of Chest, with tick marks ranging from 250 to 1000, increasing in steps of 250.

proc plm restore=multiple plots=all;
    effectplot contour (y=Chest x=Abdomen);
    effectplot slicefit(x=Abdomen sliceby=Chest=75 to 150 by 25);
run; 

title;        

Take note that the contour plot lines may be positioned differently compared to those in the PROC GLM plot. Since the item store doesn’t include the raw data, PROC PLM can only show predicted values rather than actual observed data. While the PROC GLM contour plot is more informative, when the raw data isn’t available, PROC PLM provides a good overview of the relationships between the predictor variables and predicted outcomes.


The final plot, a slice plot, offers another way to visualize the two-predictor regression model. It shows PctBodyFat2 against Abdomen, with the regression lines reflecting different levels of Chest as specified in the code. These tools give you multiple ways to effectively communicate and understand your regression results.



We’ve built a multiple regression model using two predictors, but where do we go from here? Next, we can incorporate the remaining predictors into a more comprehensive multiple regression model. With 16 predictors in total, there are numerous potential models to examine. As we’ve observed, the significance and coefficients of each predictor can vary based on which other predictors are included. So how do we determine the most appropriate model to proceed with? The answer depends largely on our research objectives and domain expertise. Fortunately, there are tools available to help narrow down the models to a more manageable set of options.

Having Missing Values:

If you have missing values for some subjects in a multiple regression, PROC GLM by default uses listwise deletion, meaning it will exclude all subjects with missing values in any of the predictor or response variables involved in the model. This can result in a loss of data and potentially reduce the power of your analysis.

However, there are alternative approaches to handle missing data more effectively:

  1. Imputation: You can impute the missing values using techniques like mean substitution, regression imputation, or multiple imputation (e.g., using PROC MI in SAS for multiple imputation and PROC MIANALYZE to combine results).
  2. Using PROC MIXED: If your data involves repeated measures or hierarchical structure, PROC MIXED can handle missing data under the assumption that it is missing at random (MAR). This method uses maximum likelihood estimation, which doesn't require complete data.
  3. Handling Missing Data in PROC GLM: If you want to use PROC GLM, you’ll need to impute the missing values beforehand because PROC GLM does not have built-in support for missing data beyond listwise deletion.


Recommendation:

For your analysis, consider:

  • If the missingness is minimal, you might accept the listwise deletion approach.
  • If the missingness is substantial, use PROC MI to perform multiple imputations, followed by PROC GLM for analysis.


Here's an example of multiple imputation:

proc mi data=stat1.bodyfat2 out=imputed_data nimpute=5 seed=12345;
   var Abdomen Chest PctBodyFat2; /* Variables with potential missing data */
run;

proc glm data=imputed_data;
   class imputation;
   model PctBodyFat2 = Abdomen Chest;
   title "Analysis with Imputed Data";
run;
        

This approach helps maintain the integrity of your dataset while still leveraging PROC GLM for the regression analysis.

G?KHAN YAZGAN

PL-300 Microsoft Certified Power BI Data Analyst Associate | Global SAS Certified Specialist: Base Programming Using SAS 9.4

1 个月

Hi Lori, If you have missing values for some subjects in a multiple regression, PROC GLM by default uses listwise deletion, meaning it will exclude all subjects with missing values in any of the predictor or response variables involved in the model. This can result in a loss of data and potentially reduce the power of your analysis. However, there are alternative approaches to handle missing data more effectively: Imputation: You can impute the missing values using techniques like mean substitution, regression imputation, or multiple imputation (e.g., using PROC MI in SAS for multiple imputation and PROC MIANALYZE to combine results). Using PROC MIXED: If your data involves repeated measures or hierarchical structure, PROC MIXED can handle missing data under the assumption that it is missing at random (MAR). This method uses maximum likelihood estimation, which doesn't require complete data. Handling Missing Data in PROC GLM: If you want to use PROC GLM, you’ll need to impute the missing values beforehand because PROC GLM does not have built-in support for missing data beyond listwise deletion.

回复
Lori Asarian

Experienced researcher with high innovation skills for target development

1 个月

Dear Gokhan, if I have missing values for some subjects in a multiple regression, would I be able to use PROC GLM to run the analysis without having those subjects completely removed from the analysis? Many thanks for any suggestions. Best, Lori

回复

要查看或添加评论,请登录

G?KHAN YAZGAN的更多文章

社区洞察

其他会员也浏览了