登录查看更多内容

A Deep Dive into ANOVA(part 2)

Vadim Tyuryaev

Data Scientist | PhD Candidate in Statistics | Executive MBA candidate | ML & AI Expert | Digital Innovation Advocate | International Educator |

发布日期: 2024年1月15日

In part 1 of the ANOVA series, our discussion encompassed the principles of one-way ANOVA, along with the implementation intricacies, logic, and specific details. The following part 2 discuses a two-way ANOVA with interaction which aims to extend the capabilities of one-way ANOVA, providing a more nuanced understanding of the sources of variability within a dataset. In this article, we will discuss underlying model, notation, derivations, and develop an R code from scratch.

Model

The model for two factor ANOVA with interactions can be written as:

Factors A and B are assumed to be fixed factors, i.e. independent variables with specific and predetermined levels of interest effects of which are primary of interest for the researchers. Errors are assumed to be normally distributed with mean zero and constant variance.

Schematic representation of the factorial experiment, assuming for simplicity that the number of repeated measures is the same and is equal to n, is presented below.

Formulas

The formulas for the sums of squares below utilize dot notation and are employed in ANOVA to compute either average effect at a specific level of one factor across all levels of another factor or cell average/overall average.

Assuming an equal number of repeated measures (n), the following formulas are being used in ANOVA:

Partitioning of total (corrected) sums of squares

Hypotheses

Note that we only need to calculate four sums out of five.

A number of hypotheses can be tested. For example:

H0: Means of factor A are equal
H0: Means of factor B are equal
H0: There is no interactions between factor A and B

Alternative hypotheses are:

Ha: at least two means of factor A are not equal
Ha: at least two means of factor B are not equal
There is an interaction between factor A and B

领英推荐

Building Data Foundation for Biology

Andrii Buvailo, Ph.D. 10 个月前

From Bits to Biology: A New Era of Biological…

Milad Alucozai 5 个月前

12 Companies Using Quantum Theory To Accelerate Drug…

Andrii Buvailo, Ph.D. 3 年前

Interaction

The interaction term in a two-way ANOVA reveals whether the influence of one independent variable on the dependent variable remains consistent across all levels of the other independent variable, and vice versa. When the interaction term is significant in a two-way ANOVA, it suggests that the combined effect of the two independent variables on the dependent variable is not additive. In other words, the impact of one variable on the dependent variable is influenced by the presence or level of the other variable. For instance, consider a specific medicine type that interacts with gender, leading to different effects on males compared to females. The recommended course of action in case of significant interaction includes running simple effects and post hoc analysis which will be discussed in part 3 of the ANOVA series.

Implementation

Note that the code provided below calculates so called Type 1 Sum of Squares. We will use the CO2: Carbon Dioxide Uptake in Grass Plants dataset to test our custom function.

# Two-way ANOVA function with interaction
two_way_anova <- function(data, response_col, factor1_col, factor2_col) {
  # Extract data
  response <- data[[response_col]]
  factor1 <- data[[factor1_col]]
  factor2 <- data[[factor2_col]]
  
  # Unique levels of factors
  levels_factor1 <- unique(factor1)
  levels_factor2 <- unique(factor2)
  
  # Calculate means
  grand_mean <- mean(response)
  means_factor1 <- tapply(response, factor1, mean)
  means_factor2 <- tapply(response, factor2, mean)
  
  # preallocate
  means_interaction <- matrix(0, nrow = length(levels_factor1), 
                                 ncol = length(levels_factor2))
  
  for (i in 1:length(levels_factor1)) {
    for (j in 1:length(levels_factor2)) {
      means_interaction[i, j] <- mean(response[factor1 ==levels_factor1[i] & factor2 == levels_factor2[j]])
    }
  }
  
  # Calculate sums of squares
  ss_total <- sum((response - grand_mean)^2)
  ss_factor1 <- sum((means_factor1 - grand_mean)^2 * table(factor1))
  ss_factor2 <- sum((means_factor2 - grand_mean)^2 * table(factor2))
  
  # preallocate
  ss_interaction_mat <- matrix(0, nrow = length(levels_factor1), 
                                  ncol = length(levels_factor2))
  
   for (i in 1:length(levels_factor1)) {
    for (j in 1:length(levels_factor2)) {
      ss_interaction_mat[i, j] <- table(factor1,factor2)[i,j]*(means_interaction[i,j]-means_factor1[i]-means_factor2[j]+grand_mean)^2
    }
   }
  
  ss_interaction <- sum(ss_interaction_mat)
  
  # Note that SST = SSA+SSB+SSAB+SSE
  ss_error <- ss_total - ss_factor1 - ss_factor2 - ss_interaction
  
  # Calculate degrees of freedom
  df_factor1 <- length(levels_factor1) - 1
  df_factor2 <- length(levels_factor2) - 1
  df_interaction <- df_factor1 * df_factor2
  df_error <- length(response) - (length(levels_factor1) * length(levels_factor2))
  
  # Calculate mean squares
  ms_factor1 <- ss_factor1 / df_factor1
  ms_factor2 <- ss_factor2 / df_factor2
  ms_interaction <- ss_interaction / df_interaction
  ms_error <- ss_error / df_error
  
  # Calculate F-statistics
  f_factor1 <- ms_factor1 / ms_error
  f_factor2 <- ms_factor2 / ms_error
  f_interaction <- ms_interaction / ms_error
  
  # Calculate p-values
  p_factor1 <- 1 - pf(f_factor1, df_factor1, df_error)
  p_factor2 <- 1 - pf(f_factor2, df_factor2, df_error)
  p_interaction <- 1 - pf(f_interaction, df_interaction, df_error)
  
  # Create results data frame
  results <- data.frame(
    Factor = c(factor1_col, factor2_col, "Interaction", "Error"),
    Df = c(df_factor1, df_factor2, df_interaction, df_error),
    SumSq = c(ss_factor1, ss_factor2, ss_interaction, ss_error),
    MeanSq = c(ms_factor1, ms_factor2, ms_interaction, ms_error),
    Fvalue = c(f_factor1, f_factor2, f_interaction, NA),
    Pval= c(p_factor1, p_factor2, p_interaction, NA)
  )
  
  return(results)
}

Test the custom function:

data(CO2)
result <- two_way_anova(CO2, "uptake", "Type", "Treatment")
print(result)

Custom results:

Two-way ANOVA with interaction table via custom function

Compare to the results produced by built-in R function:

print(summary(aov(uptake~Type*Treatment, data=CO2)), digits = 8)

Two-way ANOVA with interaction via built-in function

As you can see the results are identical.

Bonus

Try the following:

library("ggpubr")

ggline(CO2, x = "Treatment", y = "uptake", color = "Type",
       add = c("mean_se", "dotplot"),
       palette = c("red", "green"))

What do you observe?

Conclusion

In conclusion, the exploration of two-way analysis of variance (ANOVA) delves into additive and non-additive models, highlighting the intricate interplay between two categorical factors and a continuous response variable. Significant interaction effect reveals nuanced relationships where the joint impact of factors transcends individual effects. The consideration of various types of sums of squares, including Type I, Type II, and Type III, adds depth to the analysis, with the choice hinging on experimental design and research objectives. Developed from scratch R code provides in-depth details of mathematics and logic behind the two-way ANOVA with interaction model.

In part 3 of the ANOVA series we will discuss various post hoc procedures.

Stay tuned!

要查看或添加评论，请登录

Vadim Tyuryaev的更多文章

A Deep Dive into ANOVA(part 3)

2024年2月22日

A Deep Dive into ANOVA(part 3)

In Parts 1 and 2, we engaged in a detailed and methodological discourse on one-way and two-way Analysis of Variance…
A Deep Dive into ANOVA (part 1)

2024年1月12日

A Deep Dive into ANOVA (part 1)

Analysis of Variance (ANOVA) is a statistical method used to assess the equality of means across multiple groups. In…
The Importance of Avoiding Data Snooping and the Vast Search Effect in Data Science

2024年1月9日

The Importance of Avoiding Data Snooping and the Vast Search Effect in Data Science

In the dynamic landscape of data analysis, researchers and analysts often find themselves grappling with the challenges…
MS VBA to reorder columns in Excel

2023年6月16日

MS VBA to reorder columns in Excel

INTRODUCTION Recently, I encountered a challenge when using SharePoint surveys. It came to my attention that SharePoint…
Versioning Large Files with Git LFS

2023年3月28日

Versioning Large Files with Git LFS

Git is a popular version control system used for managing code repositories. However, one limitation of Git is that it…
Selenium Basic for Chrome browser

2022年6月16日

Selenium Basic for Chrome browser

Modern data science demands a high degree of flexibility, knowledge of multiple programming languages such as Python…

9 条评论
GPU version of TensorFlow? for R

2022年6月8日

GPU version of TensorFlow? for R

Modern statistical and machine learning (ML) algorithms require fast, reliable and efficient computations. The very…

See all articles

A Deep Dive into ANOVA(part 2)

Vadim Tyuryaev

Data Scientist | PhD Candidate in Statistics | Executive MBA candidate | ML & AI Expert | Digital Innovation Advocate | International Educator |

Model

Formulas

Hypotheses

领英推荐

Interaction

Implementation

Bonus

Conclusion

Vadim Tyuryaev的更多文章

社区洞察

其他会员也浏览了

AlphaFold2 + ZINC20, open a new era of virtual drug screening!

AlphaFold 2: The Frontier of Protein Structure Prediction and Beyond

Pushing Boundaries in Small Molecule Prediction

The Language of Cells: Protein-Protein Interactions Unraveled ????

Life Sciences Trends 2023: Unveiling the Key Insights – Do the Experts Agree?

Mastering Challenges: A Guide to Seven Key Problem-Solving Models in Diverse Disciplines

February 2025 Newsletter

?? BIOCOMPUTING for BUSINESS

The Future of Scientific Discovery: Stanford's Virtual Lab Pairs Human and AI Researchers

Model

Formulas

Hypotheses

领英推荐

Interaction

Implementation

Bonus

Conclusion

Vadim Tyuryaev的更多文章

A Deep Dive into ANOVA(part 3)

A Deep Dive into ANOVA (part 1)

The Importance of Avoiding Data Snooping and the Vast Search Effect in Data Science

MS VBA to reorder columns in Excel

Versioning Large Files with Git LFS

Selenium Basic for Chrome browser

GPU version of TensorFlow? for R

社区洞察

其他会员也浏览了

AlphaFold2 + ZINC20, open a new era of virtual drug screening!

AlphaFold 2: The Frontier of Protein Structure Prediction and Beyond

Pushing Boundaries in Small Molecule Prediction

The Language of Cells: Protein-Protein Interactions Unraveled ????

Life Sciences Trends 2023: Unveiling the Key Insights – Do the Experts Agree?

Mastering Challenges: A Guide to Seven Key Problem-Solving Models in Diverse Disciplines

February 2025 Newsletter

?? BIOCOMPUTING for BUSINESS

The Future of Scientific Discovery: Stanford's Virtual Lab Pairs Human and AI Researchers