5 - One-Way Anova in SAS
G?KHAN YAZGAN
PL-300 Microsoft Certified Power BI Data Analyst Associate | Global SAS Certified Specialist: Base Programming Using SAS 9.4
What Does One-Way Mean
In "One-way ANOVA," the term "one-way" indicates that only a single independent variable (factor) is used to test differences between groups. This means that the test is concerned with determining whether there is a statistically significant difference between the levels or categories of just one factor.
For example, if you're conducting a drug trial with three different doses (0 mg, 50 mg, 100 mg), you would use a "one-way ANOVA" to compare the effects of these doses. The "one-way" term signifies that only one factor (drug dose) is being examined.
It means that we are examining the effects of different groups of one independent-predictor variable to the dependent-response variable. There may and will be more than one, but for now we examine just one of them.
When Do We Use One-Way Anova and not Two Sample t-test
We don't always have two groups in the categorical predictor variables, there are times that we have more than two and when we want the examine if the means of these different groups are equal to each other than we use One-Way Anova method instead of two sample t-test.
We want to determine if one or more groups of a predictor variable is significantly different than others or not.
The Anova Hypothesis
Splitting Variation: Within-Groups Variance vs. Between-Groups Variance
In ANOVA, the objective is to assess whether the means of different groups are significantly different from each other. This is done by breaking down the overall variation in the response variable (measured by the Total Sum of Squares) into two parts:
By comparing these two sources of variation, we can determine whether the differences between the group means are large enough to reject the null hypothesis, which states that all group means are equal. Comparing the sources of variability enables us to evaluate the null hypothesis.
Within-groups variance measures how much the data points in each group differ from their respective group mean on the other hand Between-groups variance measures the differences between the group means themselves.
Comparing the sources of variability enables us to evaluate the null hypothesis. If the within group variability is large and mean for each group is close than we have evidence in favour of our null hypothesis which implies that the means of each group in that variable are not significantly different.
Coefficient of Determination
The coefficient of determination shows how well the independent variables explain the changes in the dependent variable.
Where:
The r square value is close to 0 if the predictor variables do not explain much variability in the data, and close to 1 if the predictor variables explain a relatively large proportion of variability in the data.
F Statistic and Critical Values
After we calculate Model anf Error Sum of Squares we are able to evaluate our hypothesis by creating the Analysis of Variance Table.
The p value is less than the alfa value than we we can conclude that the group means are significantly different.
The ANOVA Model
ANOVA, which stands for Analysis of Variance, is a statistical method used to compare the means of three or more groups to determine if there are statistically significant differences between them. The main goal of ANOVA is to identify whether the variation in the data is due to the differences between the group means or due to random variation within the groups.
Building the following mathematical model is a way to represent the relationship between the response and predictor variables in ANOVA
Residuals always sum to 0, regardless of the number of observations. It's impossible to know the sum of the squared residuals without knowing the value of each residual.
Assumptions of ANOVA
领英推荐
Performing a One-Way ANOVA Using PROC GLM in SAS
We want to see the effect of different smoking status groups on age of death under 60, is there a difference?
So lets observe sashelp.heart table and bring the list of variables to the log.
proc sql number;
describe table sashelp.heart;
select * from sashelp.heart(obs=10);
quit;
We are now ready to perform a One-Way ANOVA test Using PROC GLM. We want to see if differet groups of Smoking Status have equal Age At Death means or not.
Ho: All the means of different Smoking Status levels are equal.
Ha: At least one of them is not equal.
ods graphics;
proc glm data=SASHELP.HEART plots=diagnostics;
class Smoking_Status ;
model AgeAtDeath=Smoking_Status ;
means Smoking_Status / hovtest=levene;
title "One-Way ANOVA with Smoking Status as Predictor";
run;
quit;
title;
Let review the results: The first table specifies the number of levels and the values of the class variable. The second table shows the number of observations read and the number of observations used in the model. If any row has missing data for a predictor or response variable, SAS drops that row from the analysis.
The second part of the output includes all the necessary information to test whether the treatment means are equal.
Since the p-value is less than 0.05 (p < .0001), you reject the null hypothesis of no difference betweent the means. Evidence suggests that at least one AgeAtDeath mean is different for the five levels of smoking status.
Lets move to the next tables below.
Before we can rely on the p-value for our model, we need to evaluate the underlying assumptions. To do this, let's examine the diagnostic plots. The plot in the upper left panel displays the residuals plotted against the fitted values from the ANOVA model.
Essentially, we're looking for a random scatter within each group. Any patterns or trends in this plot could suggest model misspecification. To assess the normality assumption, we should examine the residual histogram and the Q-Q plot, located at the bottom left and middle left, respectively. The histogram appears approximately symmetric, and the data points in the Q-Q plot closely follow the diagonal reference line. Both plots support the assumption of normally distributed errors.
The other plots can be used to further assess assumptions and also identify possible outliers. The default plot that was created with this code is a box plot. In the box plots, potential outliers are evident in all groups except for Very-Heavy, but the variability appear similar for all five levels of Smoking Status..
In the next table we can check the assumption of equal variances. The output in the Levenes Test for Homogeneity of AgeAtDeath Variance table is the result of the HOVTEST option in the MEANS statement. The null hypothesis is that the variances are equal for all Smoking Status groups.
The p-value of 0.1341 is not smaller than our alpha level of 0.05, and therefore, we do not reject the null hypothesis. Evidence suggests that the variances within each group of Smoking Status are not statistically different.
Since the model assumptions of independence, normal residuals, and constant variance have been satisfied, we can confidently trust our analysis results and conclude that there are statistically significant differences in Age At Death among individuals with different Smoking Status.
At this stage, we can determine that at least one group among Non-Smoker, Light, Moderate, Heavy, and Very Heavy Smoking Status differs from the others. However, to identify which specific group or groups are different, we'll perform multiple comparisons using ANOVA post hoc tests in the next article.
Class Statement
The CLASS statement generates a set of design variables, often called dummy variables, that capture the information within any categorical variables. Linear regression is then conducted on these design variables. ANOVA can essentially be seen as linear regression on dummy variables, with the main difference lying in the interpretation of the model.
Even if categorical variables are represented by numerical values like 1, 2, or 3, the CLASS statement instructs SAS to create design variables to represent these categories. If a numerically coded categorical variable is not included in the CLASS statement, PROC GLM would treat it as a continuous variable during regression analysis.
In PROC GLM, the number of design variables created corresponds to the number of levels of the CLASS variable. For instance, if a variable like Smoking Status has five levels, five design variables will be generated. Each design variable acts as a binary indicator for belonging to a specific level of the CLASS variable. Each observation in the dataset will be assigned values for all five of these new variables in PROC GLM.
However, in this parameterization scheme, the fifth design variable is always redundant when the other four are present. For example, if you know that Smoking Status is not 1 and is not 2, 3, 4 there's no need for a fifth variable to indicate that Smoking Status is 5. Because the design variables are read sequentially, the fifth design variable is considered redundant.
Note: If you want to see the regression equation estimates for the design variables, you can add the SOLUTION option to the MODEL statement in PROC GLM.
proc glm data=SASHELP.HEART plots=diagnostics;
class Smoking_Status;
model AgeAtDeath=Smoking_Status / solution;
means Smoking_Status / hovtest=levene;
title "One-Way ANOVA with Smoking Status as Predictor";
run;
quit;
title;