5 - One-Way Anova in SAS
One-Way Anova in SAS

5 - One-Way Anova in SAS

What Does One-Way Mean

In "One-way ANOVA," the term "one-way" indicates that only a single independent variable (factor) is used to test differences between groups. This means that the test is concerned with determining whether there is a statistically significant difference between the levels or categories of just one factor.

For example, if you're conducting a drug trial with three different doses (0 mg, 50 mg, 100 mg), you would use a "one-way ANOVA" to compare the effects of these doses. The "one-way" term signifies that only one factor (drug dose) is being examined.

It means that we are examining the effects of different groups of one independent-predictor variable to the dependent-response variable. There may and will be more than one, but for now we examine just one of them.


When Do We Use One-Way Anova and not Two Sample t-test

We don't always have two groups in the categorical predictor variables, there are times that we have more than two and when we want the examine if the means of these different groups are equal to each other than we use One-Way Anova method instead of two sample t-test.

We want to determine if one or more groups of a predictor variable is significantly different than others or not.


The Anova Hypothesis

  • Null Hypothesis (H?): All group means are equal. In other words, there is no significant difference between the means of the different groups.
  • Alternative Hypothesis (Ha): At least one group mean is different.This implies that there is a significant difference between at least one pair of group means.


The Anova Hypothesis


Splitting Variation: Within-Groups Variance vs. Between-Groups Variance

In ANOVA, the objective is to assess whether the means of different groups are significantly different from each other. This is done by breaking down the overall variation in the response variable (measured by the Total Sum of Squares) into two parts:

  1. Between Group Variation (shown in the ANOVA table as the Model Sum of Squares) – This reflects how much the group means differ from the overall mean.
  2. Within Group Variation (shown as the Error Sum of Squares) – This reflects how much the individual data points within each group differ from their group mean.

By comparing these two sources of variation, we can determine whether the differences between the group means are large enough to reject the null hypothesis, which states that all group means are equal. Comparing the sources of variability enables us to evaluate the null hypothesis.

Within-groups variance measures how much the data points in each group differ from their respective group mean on the other hand Between-groups variance measures the differences between the group means themselves.

Within-Groups Variance vs. Between-Groups Variance


Relationships between the Total Sum of Squares

  • Total Sum of Squares (SST) is the sum of the squared differences between each observed value and the overall mean. It is the sum of the model and error?sum of squares.
  • Model Sum of Squares (SSM) is the weighted sum of the squared differences between the mean for each group and the overall mean.
  • Error Sum of Squares (SSE) is the sum of the squared differences between each observed value and the mean for its group.

Comparing the sources of variability enables us to evaluate the null hypothesis. If the within group variability is large and mean for each group is close than we have evidence in favour of our null hypothesis which implies that the means of each group in that variable are not significantly different.


Coefficient of Determination

The coefficient of determination shows how well the independent variables explain the changes in the dependent variable.

Sum of the squared differences between the mean for each group and the overall mean / Sum of the squared differences between each observed value and the overall mean.

Where:

  • Model Sum of Squares (SSM) measures the amount of variation in the dependent variable that is explained by the independent variables in the model.
  • Total Sum of Squares (SST) measures the total variation in the dependent variable.

The r square value is close to 0 if the predictor variables do not explain much variability in the data, and close to 1 if the predictor variables explain a relatively large proportion of variability in the data.

  • R square is commonly used in regression analysis to evaluate the goodness-of-fit of a model. It helps determine how well the independent variables predict the dependent variable.
  • However, a high r square does not necessarily mean the model is the best; other factors like overfitting and the number of predictors also need to be considered.
  • Also judging the magnitude of r square is evaluated depending on the context.

F Statistic and Critical Values

After we calculate Model anf Error Sum of Squares we are able to evaluate our hypothesis by creating the Analysis of Variance Table.

Table shows how to calculate ANOVA table

The p value is less than the alfa value than we we can conclude that the group means are significantly different.

The ANOVA Model

ANOVA, which stands for Analysis of Variance, is a statistical method used to compare the means of three or more groups to determine if there are statistically significant differences between them. The main goal of ANOVA is to identify whether the variation in the data is due to the differences between the group means or due to random variation within the groups.

Building the following mathematical model is a way to represent the relationship between the response and predictor variables in ANOVA

The ANOVA Model


Components of ANOVA Model

Residuals always sum to 0, regardless of the number of observations. It's impossible to know the sum of the squared residuals without knowing the value of each residual.

How each value is evaluated in ANOVA Model

Assumptions of ANOVA

  • Independent observations. Errors for each group (εik) is uncorrelated.
  • Error terms are normally distributed. Histograms of the residuals and Q-Q Plots are used to check this assumption.
  • Error terms have equal variances across treatments or groups. PROC GLM is used to conduct a test of equal variances to verify this assumption.


The null hypothesis is that variances are equal for all populations.


Residuals must have equal variances across groups.


Performing a One-Way ANOVA Using PROC GLM in SAS

We want to see the effect of different smoking status groups on age of death under 60, is there a difference?

So lets observe sashelp.heart table and bring the list of variables to the log.

proc sql number;
describe table sashelp.heart;
select * from sashelp.heart(obs=10);
quit;        


First five observations and list of variables in SASHELP.HEART

We are now ready to perform a One-Way ANOVA test Using PROC GLM. We want to see if differet groups of Smoking Status have equal Age At Death means or not.


Ho: All the means of different Smoking Status levels are equal.

Ha: At least one of them is not equal.


  • In the PROC GLM statement, we specify the hearts data set, and include the PLOTS=diagnostics option to produce a panel display of the diagnostic plots.
  • The CLASS statement indicates our categorical predictor variable, Smoking_Status.
  • In the MODEL statement, we specify the dependent and independent variables as indicated in the ANOVA model, AgeAtDeath=Smoking_Status.
  • The MEANS statement computes the unadjusted means, or arithmetic means, of the dependent variable AgeAtDeath for each level of the specified effect, Smoking_Status.
  • We can also use the MEANS statement to test the assumption of equal variances. To do so, we add the HOVTEST=levene option. This option performs Levene's test for homogeneity of variances by default. The null hypothesis is that all group variances are equal. If the resulting p-value of Levene's test is greater than some critical value, typically0.05, we fail to reject the null hypothesis and conclude that group variances are not statistically different.

ods graphics;

proc glm data=SASHELP.HEART plots=diagnostics;
    class Smoking_Status ;
    model AgeAtDeath=Smoking_Status ;
    means Smoking_Status  / hovtest=levene;
    title "One-Way ANOVA with Smoking Status  as Predictor";
run;
quit;

title;        

  • Independence of observations will be checked through histograms and Q-Q Plots.
  • Normaly distributed error terms will be checked from residual versus predicted values and residual versus predicted plots.
  • Levene's test of homogeneity will be used to asses constant variance.


Let review the results: The first table specifies the number of levels and the values of the class variable. The second table shows the number of observations read and the number of observations used in the model. If any row has missing data for a predictor or response variable, SAS drops that row from the analysis.


The second part of the output includes all the necessary information to test whether the treatment means are equal.

Since the p-value is less than 0.05 (p < .0001), you reject the null hypothesis of no difference betweent the means. Evidence suggests that at least one AgeAtDeath mean is different for the five levels of smoking status.


  • Based on the R square value (0.088) we can say Smoking Status explains about 9% of the variability of Age At Death.
  • The Root MSE (10.06) is simply the square root of the mean squared error from the ANOVA table above, its an estimate of the standard deviation for all treatment groups.
  • The coefficient of variation (14.24) is represented as a percent of the mean, so the Root MSE (10.06) divided by the Age At Death Mean (70.60) times 100. Its a unitless measure thats useful in comparing the variability of two sets of data with different units of measurement.
  • The Age At Death Mean (70.60) is the overall mean ignoring the type of smoking status.

Lets move to the next tables below.

  • The Type I sum of squares specify the sums of squares accounted for by adding effects into the model sequentially. However, for a one-way analysis of variance, only a single effect is included in the model. Therefore, the values in this table are an exact duplicate of the model line in the ANOVA table above.

Before we can rely on the p-value for our model, we need to evaluate the underlying assumptions. To do this, let's examine the diagnostic plots. The plot in the upper left panel displays the residuals plotted against the fitted values from the ANOVA model.

Essentially, we're looking for a random scatter within each group. Any patterns or trends in this plot could suggest model misspecification. To assess the normality assumption, we should examine the residual histogram and the Q-Q plot, located at the bottom left and middle left, respectively. The histogram appears approximately symmetric, and the data points in the Q-Q plot closely follow the diagonal reference line. Both plots support the assumption of normally distributed errors.

The other plots can be used to further assess assumptions and also identify possible outliers. The default plot that was created with this code is a box plot. In the box plots, potential outliers are evident in all groups except for Very-Heavy, but the variability appear similar for all five levels of Smoking Status..


In the next table we can check the assumption of equal variances. The output in the Levenes Test for Homogeneity of AgeAtDeath Variance table is the result of the HOVTEST option in the MEANS statement. The null hypothesis is that the variances are equal for all Smoking Status groups.

The p-value of 0.1341 is not smaller than our alpha level of 0.05, and therefore, we do not reject the null hypothesis. Evidence suggests that the variances within each group of Smoking Status are not statistically different.

Since the model assumptions of independence, normal residuals, and constant variance have been satisfied, we can confidently trust our analysis results and conclude that there are statistically significant differences in Age At Death among individuals with different Smoking Status.

At this stage, we can determine that at least one group among Non-Smoker, Light, Moderate, Heavy, and Very Heavy Smoking Status differs from the others. However, to identify which specific group or groups are different, we'll perform multiple comparisons using ANOVA post hoc tests in the next article.


Class Statement

The CLASS statement generates a set of design variables, often called dummy variables, that capture the information within any categorical variables. Linear regression is then conducted on these design variables. ANOVA can essentially be seen as linear regression on dummy variables, with the main difference lying in the interpretation of the model.

Even if categorical variables are represented by numerical values like 1, 2, or 3, the CLASS statement instructs SAS to create design variables to represent these categories. If a numerically coded categorical variable is not included in the CLASS statement, PROC GLM would treat it as a continuous variable during regression analysis.

In PROC GLM, the number of design variables created corresponds to the number of levels of the CLASS variable. For instance, if a variable like Smoking Status has five levels, five design variables will be generated. Each design variable acts as a binary indicator for belonging to a specific level of the CLASS variable. Each observation in the dataset will be assigned values for all five of these new variables in PROC GLM.

However, in this parameterization scheme, the fifth design variable is always redundant when the other four are present. For example, if you know that Smoking Status is not 1 and is not 2, 3, 4 there's no need for a fifth variable to indicate that Smoking Status is 5. Because the design variables are read sequentially, the fifth design variable is considered redundant.

Note: If you want to see the regression equation estimates for the design variables, you can add the SOLUTION option to the MODEL statement in PROC GLM.

proc glm data=SASHELP.HEART plots=diagnostics;
    class Smoking_Status;
    model AgeAtDeath=Smoking_Status / solution;
	means Smoking_Status  / hovtest=levene;
    title "One-Way ANOVA with Smoking Status as Predictor";
run;
quit;

title;        



要查看或添加评论,请登录

G?KHAN YAZGAN的更多文章

社区洞察

其他会员也浏览了