9 - Two-Way ANOVA Using PROC GLM and Interactions
G?KHAN YAZGAN
PL-300 Microsoft Certified Power BI Data Analyst Associate | Global SAS Certified Specialist: Base Programming Using SAS 9.4
When we have two categorical predictor variables with multiple groups, then we use Two-Way Anova. We don't use one-way anova for each variables because we should take into account of possible interactions between categorical predictor variables. For example assume there are 4 levels in variable m, and 5 levels in variable n then we should examine 4 * 5 = 20 possible interactions or treatment groups of these two categorical predictor variables.
Applying the Two-Way ANOVA Model
Here is the model taking into account of possible interactions between predictor variables.
So lets see an example from the dataset STAT1.drug. We want to analyze patient blood pressure considering the effect of four drug dosages and three heart diseases.
1. First let us see the list of variables and what our data looks like:
proc sql number;
describe table STAT1.drug;
select * from STAT1.drug(obs=5);
quit;
We want to examine the relationships between BloodP (Response-Continuous Variable) and DrugDose and Disease (Categorical Predictor Variables).
2. Lets see how many levels in each group.
PROC SQL NUMBER;
SELECT DISTINCT DrugDose, Disease
FROM STAT1.drug
WHERE DrugDose IS NOT NULL AND Disease IS NOT NULL ;
QUIT;
After running this code we see that there are four levels in DrugDose and three levels in Disease with 4 * 3 = 12 treatment groups.
3. Lets review some modelling terms
4. Lets try to apply the Two-Way ANOVA Model to our dataset.
Let's apply the Two-Way ANOVA Model to STAT1.DRUG dataset in order to consider the effect of Drug Dose and Disease Type on Blood Pressure.
As with one-way ANOVA, there are three assumptions.
The null hypothesis for two-way ANOVA without interaction is that none of the effects in the model are statistically different. That is, no differences exist among the group means of Drug Dose and Disease Type.
When testing main effects only, we're looking for differences in Blood Pressure means among the four Drug Dose levels, or among the three Disease Type levels. In a model with interactions, the null hypothesis is that no differences exist among the 12 different combinations of Drug Dose and Disease Type.
Performing a Two-Way ANOVA Using PROC GLM
First we start by exploring the data with the PROC MEANS procedure in SAS.
ods graphics off;
proc means data=STAT1.drug
mean var std nway;
class DrugDose Disease ;
var BloodP;
title 'Descriptive Statistics of Blood Pressure';
run;
Hera are the results.
We use the SGPLOT procedure to plot the mean Blood Pressure by DrugDose in a vertical line chart with the bars grouped by Disease Type. The MARKERS option adds data point markers to the chart.
proc sgplot data=STAT1.drug;
vline DrugDose / group=Disease stat=mean response=BloodP markers;
format DrugDose dosefmt.;
run;
We see that for Disease A increasing the Drug Dosage lowers the Blood Pressure, where as for Disease B it has the opposite effect. This is only one possible illustration of an interaction, but any non-parallel lines indicate an interaction.
Well use PROC GLM to first test only the main effects of DrugDose and Disease . Later, we'll incorporate the interaction suggested by our plot.
ods graphics on;
proc glm data=STAT1.drug order=internal;
class DrugDose Disease;
model BloodP = Disease DrugDose ;
lsmeans DrugDose / diff adjust=tukey;
title "Model with Disease and DrugDose as Predictors";
format DrugDose dosefmt.;
run;
quit;
title;
Degrees of freedom is 5 because 4 + 3 - 2 = 5. We have 7 total levels minus one for each factor so minus 2 is equal to 5. The statistically significant p-value indicates not all means are equal for each predictor variable, but it doesn't indicate which mean values are significantly different.
We can determine which means differ by looking at the table showing tests of individual factors. The R-square value, 0.184375, indicates approximately 18% of the variability in Blood Pressure is explained by the two categorical predictors.
Let's look at the tables of individual effect tests based on Type 1 and Type 3 sums of squares. In the Type I table, each effect is tested sequentially, and adjusts for all preceding listed effects. In other words, the order of the effects matters. The model specification determines the order in this table.
The test of Disease is an unadjusted test, because there are no other terms above it, whereas the Drug Dose test adjusts for the Disease, which appears before it. The test for Drug Dose asks whether the Drug Dose can explain the leftover variation in Blood Pressure after Disease has explained as much of the Blood Pressure variation as possible.
Typically, only Type III sums of squares tables are interpreted and reported for ANOVA. Type I sums of squares are more useful in say, polynomial regression models when we want to understand how high-order terms sequentially benefit the model.
Unlike Type I sums of squares, the Type III values are not generally additive, and the values do not necessarily sum to the model sums of squares. In the Type III table, all listed effects are adjusted for all other effects in the table, so order is not important.
The Type III sums of squares for a variable, also called the partial sums of squares, is the increase in the model sum of squares due to adding the variable to a model which already contains all the other variables.
Judging from the p-values in the Type III sums of squares table, there seems to be no significant differences across levels of Drug Dose, with a p-value of 0.9425, but there are significant differences across the Disease variable, with a p-value less than .0001. That is, even after you control for the effects of Drug Dose, the Disease variable still explains significant differences in Blood Pressure.
From our previous graphic there was a suspicion of interaction, lets find out.
When we take interaction into account, this is the formula. We used an asterisk to specify the interaction effect, but we could use a vertical bar between the main effects to specify the factorial representation. DIFF option in the LSMEANS statement computes and compares least squares means of the model effects.
Including the interaction term in the LSMEANS statement provides the least square means of all 12 groups of the crossed factors. We added the SLICE= option to slice the interaction effect by the different levels of Disease. Each slice will have one Disease level and will show the Drug Dose effect across that slice.
The STORE statement saves the results in an item store named interact so that we can perform further analysis post model fitting.
ods graphics on;
proc glm data=STAT1.drug
order=internal
plots(only)=intplot;
class DrugDose Disease;
model BloodP = Disease DrugDose Disease*DrugDose;
lsmeans Disease*DrugDose / diff slice=Disease;
format DrugDose dosefmt.;
store out=interact;
title "Model with Disease and Drug Dose as Interacting Predictors";
run;
quit;
The degrees of freedom for the model is now 11. This includes three and two degrees of freedom for each main effect, and 3 times 2, or six degrees of freedom for the interaction term. The overall model is statistically significant.
The R-square, 0.347918, tells us that this model explains approximately 35% of the variability in Blood Pressure. This is an improvement from the 18% that was explained by the model with only main effects.
What about the interaction term? Is it significant, or should it be removed? In this case, the p-value less than, 0.0001, indicates that the interaction effect is statistically significant at the .05 alpha level.
This means that the effect of Drug Dose differs at different levels of Disease, and vice versa. Given the significance of the interaction, it should stay in the model. To maintain model hierarchy, all effects contained within significant interactions should also remain in the model, regardless of their p-value.
The interaction model reflects the data more accurately than the main effects model. In the previous main effects model, it seemed that Drug Dose wasn't related to Blood Pressure. Now we can see that it is related, but in a more complex way.
Drug Dose is important, but only for some Diseases, and we see that only through the interaction. So, what does the significant interaction mean? Let's dissect the Drug Dose crossed with Disease interaction in three ways.
Let's look at the interaction plot for Blood Pressure, a line plot overlaid with all the observations of the data set.
This plot shows that types of Diseases, and the Drug Dose applied to patients had different effects Blood Pressure of patients.
The least squares means table displays the mean Blood Pressure for every combination of Drug Dose and Diseases,
The matrix displays p-values for every comparison between the means.
For example, if we want to test the null hypothesis that the mean Blood Pressure of patients given Placebo with Disease A is equal to the mean smean Blood Pressure of patients given 100 mg Drug Dose with Disease A, we would compare mean 1 to mean 7 The means are 1.33 and -26.23 respectively. The p-value for this comparison is p=0.0012, which indicates a statistically significant difference. These tables were produced by the DIFF option in the LSMEANS statement.
We can also make sense out of the interaction by looking at tests of simple effects that were requested through the SLICE option. These tests compare the means for one factor at a particular level of the other factor. Let's focus on the slice analysis of this model. The displayed tests are of Drug Dose within each slice, or level of Diseases. The first p-value looks at the homogeneity of means within the Diseases group, A, across all the levels of Drug Dose.
This p-value shows that there is no significant difference in the Blood Pressure means across Drug Dose when Disease is C. There is a statistically significant Drug Dose effect for patients with Diseases A and B.
This table supports our finding from visually interpreting the interaction plot. That is, the sBlood Pressure of patients might be affected by the interaction of Diseases and the Drug Dose a patient is subjected to. The note below the table reminds you that these p-values are not adjusted for multiple tests. Later we'll use the item store created in this step to adjust for multiple comparison tests.
Let's check the log to verify that the item store was created. We see that the results were saved in the temporary item store, work.interact.
The STORE Statement
The STORE statement in SAS/STAT procedures saves model fit information into an item store, which allows for future access without re-running the model or needing the original data. It can be used with various procedures like GENMOD, LOGISTIC, MIXED, and others. This is useful when analysis takes a long time or when data access is limited. Once saved, you can use PROC PLM to perform further tests and analyses on the stored model. The syntax is: STORE <OUT=>item-store-name </ LABEL='label'>. If a store already exists with the same name, it gets replaced.
Performing Post-Processing Analysis Using PROC PLM
In the previous two-way ANOVA, we observed a larger drug dose effect for patients with A and B diseases, but the p-values weren't adjusted for multiple tests. We'll use PROC PLM to access the item store and make the adjustments without refitting the ANOVA model.
We ran the previous PROC GLM step earlier and saved the results in a temporary item store. We're in the same SAS session, so the item store is still available.
In the PROC PLM statement, the RESTORE= option specifies the item store, interact. The PLOTS= option produces all the available ODS plots for the statements that we include in the step. This includes an effect plot by default, but we've added an EFFECTPLOT statement to request interaction plots sliced by disease, and the CLM option to request confidence limits for the means. The SLICE statement requests tables for the interaction term, Disease crossed with Drug Dose, sliced by the different levels of disease. The ADJUST=TUKEY option will adjust the p-values for multiple comparison tests. Remember that the SLICEBY= syntax in the SLICE statement is different from SLICE= in the LSMEANS statement. Lets run PROC PLM.
proc plm restore=interact plots=all;
slice Disease*DrugDose / sliceby=Disease adjust=tukey;
effectplot interaction(sliceby=Disease) / clm;
run;
title;
The Store Information table describes the item store, including its name, location, the data set from which it was created, the procedure that was used to create it, the response and class variables, and the model effects.
Then we see a class level information table and a series of F tests, least squares means, and diffograms. There's one set for each disease slice. The diffograms were produced because we specified PLOTS=ALL and a SLICE statement.
Let's look at the first slice, where the disease is A. In the overall F test for Drug Dose by Disease, the p-value is the same as in the PROC GLM results.
The least squares means table shows all the pairwise comparisons of Drug Dose within the Disease level, A. Here we get the unadjusted p-values as well as the Tukey adjusted p-values.
The diffogram shows there are no significant differences among Blood Pressure means when we hold Disease constant at A except Placebo-100 mg and Placebo-200 mg.
Lets look at to the analysis for Disease B. Looking at the pairwise comparisons of Drug Dose within the Disease level, B, we see that the only statistically significant pairwise comparisons are between Placebo-100 mg and Placebo-200 mg, We should look at Adj P values (Tukey Kramer test values for multiple comparisons)
The blue line in the corresponding diffogram indicates a significant difference in the mean Blood Pressure of patients with Disease B that were exposed to Placebo-100 mg and Placebo-200 mg. Patients with Disease B that were exposed to Placebo-100 mg and Placebo-200 have a decrease in Blood Pressure -32.92 and -31.36 on average.
What about Disease C? The least squares means table shows all the pairwise comparisons of Drug Dose within the Disease level, C, adjusted p values show there are no significant differences between drug doses.
The diffogram also shows there are no significant differences among Blood Pressure means when we hold Disease constant at C.
Finally, the fit plot, produced by the EFFECTPLOT statement with the CLM option, includes the confidence limits for the means.
We can see larger confidence intervals for the Blood Pressure means that were based on small sample sizes. We can clearly see the effects of differenf Drug Doses on different Diseases.