6 - ANOVA Post Hoc Tests
It is all about the trade off decision between Type I and Type II Error Rates depending on the context.

6 - ANOVA Post Hoc Tests

Post hoc tests, also known as multiple-comparison procedures, are used to identify which specific pairs of groups differ significantly from each other. Additionally, these tests help control the experiment-wise Type I error rate, ensuring that the probability of making at least one false-positive conclusion across all tests remains below the chosen alpha level, typically set at 0.05.

Multiple Comparison Methods

We will check the difference in means between groups pairwise to determine which group is different than the other.

However, when you conduct a single statistical test at an α level of 0.05, there is a 5% chance of incorrectly rejecting the null hypothesis, assuming that the null hypothesis is actually true. So without adjustments when compared groups increase the chance of making type 1 error increases.

Multiple comparisons can increase the Type I error rate for the experiment if not properly controlled with post hoc techniques. This means that without adjustments, the likelihood of incorrectly rejecting the null hypothesis when assessing differences in means will rise.

The comparisonwise error rate, or the CER, is the probability of a Type 1 error on a single pairwise test. The experimentwise error rate, or EER, is the probability of making at least one Type 1 error when you perform the entire set of comparisons.

EER is 1 minus the complement, where α is the significance level. nc is the number of comparisons,

The experimentwise error rate (EER)

We need to use a method that controls the EER at a level like 0.05.

Tukey's and Dunnett's Multiple Comparison Methods

Tukey's and Dunnett's Multiple Comparison Methods are statistical techniques used in the analysis of experimental data, particularly when comparing multiple groups. These methods help identify significant differences between group means but are applied in different contexts.

Tukey's Multiple Comparison Method (Tukey's HSD)

Tukey's HSD (Honestly Significant Difference) method is used to compare all possible pairs of group means after conducting an analysis of variance (ANOVA). It’s designed to identify significant differences between any two groups among several.

  • Application: Used when you want to compare all possible pairs of group means.
  • Purpose: To determine if there are statistically significant differences between the means of every possible pair of groups.
  • Features: Calculates p-values for all possible pairwise comparisons.Controls the type I error rate, meaning it reduces the likelihood of falsely identifying a difference as significant. Works well under the assumption of equal variances across groups.

Dunnett's Multiple Comparison Method

Dunnett's Test is specifically used to compare each treatment group to a single control group. Unlike Tukey's method, it does not compare every possible pair of groups but focuses on differences between the control group and each treatment group.

  • Application: Used when comparisons are made between a control group and several treatment groups.
  • Purpose: To determine if there are significant differences between the control group and each of the other groups.
  • Features: Only involves comparisons between the control group and each of the other groups, which reduces the number of tests. Also controls the type I error rate. Particularly suitable for studies where the primary interest is in comparing treatment groups against a control.

Differences

  • Number of Comparisons:
  • Use Cases:
  • Error Rate Control:

In summary, the choice between Tukey's and Dunnett's methods depends on the study design and the specific comparisons of interest. Tukey's HSD is used when comparing all groups to each other, while Dunnett's Test is used when the primary interest is in comparing several groups to a single control group.

Both groups controls EER to at most α level.

Numerous other multiple comparison methods are available. The various techniques differ in the extent to which they manage the experimentwise error rate. Decreasing the Type 1 error rate raises the Type 2 error rate, meaning it lowers the statistical power. In certain scenarios, a Type 1 error is more detrimental than a Type 2 error, or the opposite.

We can say Tukey adjustment is for more pairwise comparisons than the Dunnett adjustment so the Dunnett comparisons show the same pairs with smaller p-values.


Lowering the Type 1 error rate increases the Type 2 error rate, that is, it reduces the statistical power.

Situational Considerations:

  • In some studies, particularly in clinical trials, a Type 1 error can have severe consequences (e.g., incorrectly concluding that a treatment is effective). In these cases, methods that strongly control for Type 1 error, like Bonferroni or Holm’s method, are preferable.
  • In exploratory research, where the goal is to identify potential differences for further study, a higher Type 2 error might be acceptable, and methods like Tukey’s HSD or Dunnett’s test could be more appropriate.

Examples of Multiple Comparison Procedures:

  • Bonferroni Correction: Very conservative, controls Type 1 error strictly by dividing the alpha level by the number of comparisons.
  • Holm’s Method: A step-down procedure that is less conservative than Bonferroni but still controls the experimentwise error rate.
  • Tukey's HSD: Controls the experimentwise error rate for all pairwise comparisons.
  • Dunnett’s Test: Specifically controls Type 1 error for comparisons against a control group.

The choice of method depends on the study's context, the consequences of making errors, and the need to balance sensitivity (power) with the control of false positives.

Balance between Type 1 and Type 2 errors and how the importance of these errors can vary depending on the situation

  • Type 1 Error: This occurs when we incorrectly conclude that there is a difference or effect when there isn’t one (i.e., a false positive). For example, believing that a plant compound cures cancer when it actually doesn’t.
  • Type 2 Error: This occurs when we fail to detect a real difference or effect (i.e., a false negative). For instance, not identifying an effective cancer treatment.

Key points:

  1. Balancing the Error Types: When you try to reduce Type 1 errors, you often increase the risk of Type 2 errors. This means that by trying to avoid false positives (Type 1 errors), you may lower your ability to detect true effects (statistical power). Statistical power is the test’s ability to correctly identify a true effect.
  2. Priorities Depending on the Situation:

  • Early Stages of Research: In fields like cancer research, it’s more important to have high power to detect an effective compound (avoiding Type 2 errors) in the early stages. At this point, even if some false positives occur, they can be weeded out in later, more rigorous testing.
  • Follow-Up Studies: Later, when these compounds are tested on patients, it becomes more crucial to minimize Type 1 errors to avoid recommending a treatment that doesn’t actually work. At this stage, the consequences of a false positive are much more serious, so reducing Type 1 errors is more critical.

In summary, the importance of understanding the balance between error types in statistical analysis and how the significance of each type of error can vary depending on the stage of research or the specific context.

Diffograms and Control Plots

Diffograms can be utilized to visually determine whether the means of different group pairs differ statistically.

A control plot illustrates the least squares mean along with decision limits. It compares each treatment group to the control group using Dunnett's method.

Performing a Post Hoc Pairwise Comparison Using PROC GLM

Lets start by writing our PROC GLM code for our One-Way Anova Post-Hoc analysis of AgeAtDeath = Smoking_Status. Our data is SASHELP.HEART, we want control and diffoogram plots, our categorical predictor variable is smoking_status, we normally demand only tukey but for this time dunnet also to see non-smokers position as control group.

We already determined from a significant overall ANOVA result that at least one smoking status was different before studies (article number 5). Lets use PROC GLM to determine which pairs are significantly different from each other in their mean AgeAtDeath.

ods graphics;

ods select lsmeans diff diffplot controlplot;
proc glm data=SASHELP.HEART 
         plots(only)=(diffplot(center) controlplot);
    class Smoking_Status;
    model AgeAtDeath=Smoking_Status;
    lsmeans Smoking_Status / pdiff=all 
                         adjust=tukey;
    lsmeans Smoking_Status / pdiff=control('Non-smoker') 
                         adjust=dunnett;
    title "Post-Hoc Analysis of ANOVA - Smoking Status as Predictor";
run;
quit;

title;        

The first table shows the means for each group,and each mean is assigned a number to refer to it in the next table. We can see that the average AgeAtDeath of patients with Non-Smoker Smoking Status is the highest, at approximately 73.76. Patients with Very Heavy Smoking Status have the lowest average AgeAtDeath, at approximately 65.41.


Means for each group

The second table shows the p-values from pairwise comparisons of all possible combinations of means. The nonsignificant pairwise differences are between Heavy and Moderate, Light and Moderate Smoking Status Groups.. These p-values are adjusted using the Tukey method and are, therefore, larger than the unadjusted p-values for the same comparisons. However, the experimentwise Type 1 error rate is held fixed at alpha (0.05).


P-values from pairwise comparisons of all possible combinations of means.

The comparisons of least square means are also shown graphically in the diffogram. Ten comparisons are shown ((n* n-1) / 2) so we have 5 groups, (5 * 4 / 2 = 10 comparisons are shown)).


The blue solid lines denote significant differences between smoking status levels, because these confidence intervals for the difference do not cross the diagonal equivalence line. Red dashed lines indicate a non-significant difference between treatments.

Starting at the top, left to right, we can see Very Heavy is significantly different from especially Non-smoker's and from other groups. Heavy is significantly different from Non-smoker's, Light and Very-Heavy. Moderate is significantly different from non-smoker group whereas light and moderate, also moderate and heavy groups means are not significantly different.

Lets look at the Dunnetts LSMEANS comparisons as well. In this case, all other smoking status levels are compared to Non-Smoker group. We can see that all the groups are significantly different from Non-Smoker control level.


Non-Smoker group is the control group here.

The control plot corresponds to the tables that were summarized. The horizontal line is drawn at the least squares mean for Non-Smoker, which is 73.76. The other four means are represented by the ends of the vertical lines extending from the horizontal control line.

Blue areas are the non-significance zones vary in size. This is because different comparisons involve different sample sizes. Smaller sample sizes require larger mean differences to reach statistical significance. This control plot shows that all the other groups are significantly different from Non-Smoker control group.


要查看或添加评论,请登录

G?KHAN YAZGAN的更多文章

社区洞察

其他会员也浏览了