A Deep Dive into ANOVA(part 2)

A Deep Dive into ANOVA(part 2)

In part 1 of the ANOVA series, our discussion encompassed the principles of one-way ANOVA, along with the implementation intricacies, logic, and specific details. The following part 2 discuses a two-way ANOVA with interaction which aims to extend the capabilities of one-way ANOVA, providing a more nuanced understanding of the sources of variability within a dataset. In this article, we will discuss underlying model, notation, derivations, and develop an R code from scratch.


Model

The model for two factor ANOVA with interactions can be written as:

Two-way ANOVA with interaction model

Factors A and B are assumed to be fixed factors, i.e. independent variables with specific and predetermined levels of interest effects of which are primary of interest for the researchers. Errors are assumed to be normally distributed with mean zero and constant variance.

Distribution of error terms

Schematic representation of the factorial experiment, assuming for simplicity that the number of repeated measures is the same and is equal to n, is presented below.


Factorial experiment

Formulas

The formulas for the sums of squares below utilize dot notation and are employed in ANOVA to compute either average effect at a specific level of one factor across all levels of another factor or cell average/overall average.

Dot notation

Assuming an equal number of repeated measures (n), the following formulas are being used in ANOVA:

Sums of squares (corrected)
Degrees of freedom
Partitioning of total (corrected) sums of squares

Hypotheses

Note that we only need to calculate four sums out of five.

A number of hypotheses can be tested. For example:

  1. H0: Means of factor A are equal
  2. H0: Means of factor B are equal
  3. H0: There is no interactions between factor A and B

Alternative hypotheses are:

  1. Ha: at least two means of factor A are not equal
  2. Ha: at least two means of factor B are not equal
  3. There is an interaction between factor A and B


Interaction

The interaction term in a two-way ANOVA reveals whether the influence of one independent variable on the dependent variable remains consistent across all levels of the other independent variable, and vice versa. When the interaction term is significant in a two-way ANOVA, it suggests that the combined effect of the two independent variables on the dependent variable is not additive. In other words, the impact of one variable on the dependent variable is influenced by the presence or level of the other variable. For instance, consider a specific medicine type that interacts with gender, leading to different effects on males compared to females. The recommended course of action in case of significant interaction includes running simple effects and post hoc analysis which will be discussed in part 3 of the ANOVA series.


Implementation

Note that the code provided below calculates so called Type 1 Sum of Squares. We will use the CO2: Carbon Dioxide Uptake in Grass Plants dataset to test our custom function.

# Two-way ANOVA function with interaction
two_way_anova <- function(data, response_col, factor1_col, factor2_col) {
  # Extract data
  response <- data[[response_col]]
  factor1 <- data[[factor1_col]]
  factor2 <- data[[factor2_col]]
  
  # Unique levels of factors
  levels_factor1 <- unique(factor1)
  levels_factor2 <- unique(factor2)
  
  # Calculate means
  grand_mean <- mean(response)
  means_factor1 <- tapply(response, factor1, mean)
  means_factor2 <- tapply(response, factor2, mean)
  
  # preallocate
  means_interaction <- matrix(0, nrow = length(levels_factor1), 
                                 ncol = length(levels_factor2))
  
  for (i in 1:length(levels_factor1)) {
    for (j in 1:length(levels_factor2)) {
      means_interaction[i, j] <- mean(response[factor1 ==levels_factor1[i] & factor2 == levels_factor2[j]])
    }
  }
  
  # Calculate sums of squares
  ss_total <- sum((response - grand_mean)^2)
  ss_factor1 <- sum((means_factor1 - grand_mean)^2 * table(factor1))
  ss_factor2 <- sum((means_factor2 - grand_mean)^2 * table(factor2))
  
  # preallocate
  ss_interaction_mat <- matrix(0, nrow = length(levels_factor1), 
                                  ncol = length(levels_factor2))
  
   for (i in 1:length(levels_factor1)) {
    for (j in 1:length(levels_factor2)) {
      ss_interaction_mat[i, j] <- table(factor1,factor2)[i,j]*(means_interaction[i,j]-means_factor1[i]-means_factor2[j]+grand_mean)^2
    }
   }
  
  ss_interaction <- sum(ss_interaction_mat)
  
  # Note that SST = SSA+SSB+SSAB+SSE
  ss_error <- ss_total - ss_factor1 - ss_factor2 - ss_interaction
  
  # Calculate degrees of freedom
  df_factor1 <- length(levels_factor1) - 1
  df_factor2 <- length(levels_factor2) - 1
  df_interaction <- df_factor1 * df_factor2
  df_error <- length(response) - (length(levels_factor1) * length(levels_factor2))
  
  # Calculate mean squares
  ms_factor1 <- ss_factor1 / df_factor1
  ms_factor2 <- ss_factor2 / df_factor2
  ms_interaction <- ss_interaction / df_interaction
  ms_error <- ss_error / df_error
  
  # Calculate F-statistics
  f_factor1 <- ms_factor1 / ms_error
  f_factor2 <- ms_factor2 / ms_error
  f_interaction <- ms_interaction / ms_error
  
  # Calculate p-values
  p_factor1 <- 1 - pf(f_factor1, df_factor1, df_error)
  p_factor2 <- 1 - pf(f_factor2, df_factor2, df_error)
  p_interaction <- 1 - pf(f_interaction, df_interaction, df_error)
  
  # Create results data frame
  results <- data.frame(
    Factor = c(factor1_col, factor2_col, "Interaction", "Error"),
    Df = c(df_factor1, df_factor2, df_interaction, df_error),
    SumSq = c(ss_factor1, ss_factor2, ss_interaction, ss_error),
    MeanSq = c(ms_factor1, ms_factor2, ms_interaction, ms_error),
    Fvalue = c(f_factor1, f_factor2, f_interaction, NA),
    Pval= c(p_factor1, p_factor2, p_interaction, NA)
  )
  
  return(results)
}        

Test the custom function:

data(CO2)
result <- two_way_anova(CO2, "uptake", "Type", "Treatment")
print(result)        

Custom results:

Two-way ANOVA with interaction table via custom function

Compare to the results produced by built-in R function:

print(summary(aov(uptake~Type*Treatment, data=CO2)), digits = 8)        
Two-way ANOVA with interaction via built-in function

As you can see the results are identical.


Bonus

Try the following:

library("ggpubr")

ggline(CO2, x = "Treatment", y = "uptake", color = "Type",
       add = c("mean_se", "dotplot"),
       palette = c("red", "green"))        

What do you observe?


Conclusion

In conclusion, the exploration of two-way analysis of variance (ANOVA) delves into additive and non-additive models, highlighting the intricate interplay between two categorical factors and a continuous response variable. Significant interaction effect reveals nuanced relationships where the joint impact of factors transcends individual effects. The consideration of various types of sums of squares, including Type I, Type II, and Type III, adds depth to the analysis, with the choice hinging on experimental design and research objectives. Developed from scratch R code provides in-depth details of mathematics and logic behind the two-way ANOVA with interaction model.

In part 3 of the ANOVA series we will discuss various post hoc procedures.

Stay tuned!


要查看或添加评论,请登录

Vadim Tyuryaev的更多文章

  • A Deep Dive into ANOVA(part 3)

    A Deep Dive into ANOVA(part 3)

    In Parts 1 and 2, we engaged in a detailed and methodological discourse on one-way and two-way Analysis of Variance…

  • A Deep Dive into ANOVA (part 1)

    A Deep Dive into ANOVA (part 1)

    Analysis of Variance (ANOVA) is a statistical method used to assess the equality of means across multiple groups. In…

  • The Importance of Avoiding Data Snooping and the Vast Search Effect in Data Science

    The Importance of Avoiding Data Snooping and the Vast Search Effect in Data Science

    In the dynamic landscape of data analysis, researchers and analysts often find themselves grappling with the challenges…

  • MS VBA to reorder columns in Excel

    MS VBA to reorder columns in Excel

    INTRODUCTION Recently, I encountered a challenge when using SharePoint surveys. It came to my attention that SharePoint…

  • Versioning Large Files with Git LFS

    Versioning Large Files with Git LFS

    Git is a popular version control system used for managing code repositories. However, one limitation of Git is that it…

  • Selenium Basic for Chrome browser

    Selenium Basic for Chrome browser

    Modern data science demands a high degree of flexibility, knowledge of multiple programming languages such as Python…

    9 条评论
  • GPU version of TensorFlow? for R

    GPU version of TensorFlow? for R

    Modern statistical and machine learning (ML) algorithms require fast, reliable and efficient computations. The very…

社区洞察

其他会员也浏览了