A Deep Dive into ANOVA (part 1)
https://deepai.org/machine-learning-model/text2img

A Deep Dive into ANOVA (part 1)

Analysis of Variance (ANOVA) is a statistical method used to assess the equality of means across multiple groups. In this article we will focus on one-way ANOVA which deals with situations where there is one independent variable with more than two levels ( for comparison between just two levels please refer to t-test ). We are going to delve into the theoretical foundations of one-way ANOVA, including the formulas, derivations and assumptions, and then implement it from scratch to analyze R's mtcars built-in dataset.


Formulas and Derivations

Consider a scenario with k groups.

Hypotheses (i=1,...,k):

Null hypothesis (all means are equal)
Alternative hypothesis (at least two means are different)

The overall mean is defined as the average of all observations:

Overall mean, where N is the total number of observations

The group mean for the i-th group is calculated as follows:

Group mean

The sum of squares between groups (SSB) is the sum of squared deviations of group means from the overall mean, weighted by the group sample size:

SSB indicates how much the group means differ from the overall mean

The sum of squares within groups (SSW) is the sum of squared deviations of individual observations from their respective group means:

SSW captures how much individual observations deviate from their respective group means.

Degrees of freedom (the maximum number of independent values that can vary) for SSB and SSW are:

Degrees of freedom for SSB and SSW

The mean squares (MS) for between groups and within groups are calculated as the sum of squares divided by their respective degrees of freedom:

Mean squares

The F-statistic is the ratio of mean squares:

Test statistic

F-test statistic is compared against the critical value from the F-distribution to determine statistical significance. The critical value is determined based on the selected significance level (alpha) and the degrees of freedom for the SSW and SSB. Correspondent p-value is also obtained.

It is important to keep in mind that p-value does not represent the probability that the result it due to chance which is a very common misconception. What it represents is the probability that, given a chance model, results as extreme as the observed results could occur. In the context of ANOVA, p-value is the probability of observing an F-statistic as extreme as, or more extreme than, the one calculated from the sample data under the null hypothesis.

Meaning of p-value

Assumptions:

  1. Normality: The data within each group should be approximately normally distributed. This assumption is more robust with larger sample sizes due to the Central Limit Theorem.
  2. Homogeneity of Variances: The variances within each group should be roughly equal. This assumption, also known as homoscedasticity, is crucial for the validity of the F-test.
  3. Independence: Observations within each group should be independent of each other.


Why does ANOVA work?

According to the Cochran’s theorem under the assumption of data normality, the various quadratic forms (in our case mean squares) are independent and Chi-squared distributed. By taking ratio of two independent Chi-squared distributed variables divided by their correspondent degrees of freedom F distribution is obtained.


ANOVA in R from scratch

We will rely on tapply() function which is used to apply a function over subsets of a vector (for example, to calculate means for each group):

# Function to perform one-way ANOVA
one_way_anova <- function(response, group) {
  # Data preparation
  data <- data.frame(response, group)
  unique_groups <- unique(group)
  
  # Calculate overall mean
  overall_mean <- mean(response)
  
  # Calculate sum of squares between groups (SSB)
  SSB <- sum(table(data$group) * (tapply(data$response, data$group, mean)-overall_mean)^2)
  
  # Calculate sum of squares within groups (SSW)
  SSW <- sum(tapply(data$response, data$group, function(x) sum((x - mean(x))^2)))
  
  # Degrees of freedom
  df_between <- length(unique_groups) - 1
  df_within <- length(response) - length(unique_groups)
  
  # Mean squares
  MS_between <- SSB / df_between
  MS_within <- SSW / df_within
  
  # F-statistic
  F <- MS_between / MS_within
  
  # p-value
  p_value <- pf(F, df_between, df_within, lower.tail = FALSE)
  
  # Return results
  return(list(F_statistic = F, p_value = p_value, df_between = df_between, df_within = df_within))
}        

Example using mtcars:

data(mtcars)
result_custom_anova <- one_way_anova(mtcars$mpg, mtcars$gear)
print(result_custom_anova)        

Result (statistically significant at 1%):

Custom ANOVA function ran on mtcars dataset (comparing mpg at different gears)

Let's compare with build-in aov() function. Note that group (gear in our case) should be factorized.

result_2 <- aov(mpg ~ factor(gear), data = mtcars)

summary(result_2)        
Built-in ANOVA results

As you can see the results are identical. Therefore, there does exist a difference in average mpg for different number of forward gears which we can additionally verify visually.

library(ggplot2)

# Create a boxplot with different colors for each level of gears and add data points
ggplot(mtcars, aes(x = factor(gear), y = mpg, fill = factor(gear))) +
  geom_boxplot() +
  geom_jitter(position = position_jitter(width = 0.2), alpha = 0.5) +  # Add data points
  scale_fill_manual(values = c("red", "green", "blue")) +  # Specify colors for each level
  labs(title = "Boxplot of MPG at Different Gears",
       x = "Gears",
       y = "MPG") +
  theme_minimal() +
  theme(
    plot.title = element_text(hjust = 0.5)  # Center the title
  )        
MPG for different number of forward gears

Conclusion

One limitation of analysis of variance (ANOVA) is that it doesn't explicitly identify which specific group means are different from each other. While ANOVA determines whether there are statistically significant differences in at least one pair of group means, it does not pinpoint the specific pairs that exhibit these differences. This limitation prompts the need for additional post hoc tests or pairwise comparisons.

Post hoc tests are employed after ANOVA to perform detailed pairwise comparisons between group means. These tests help identify which groups differ significantly from one another. Commonly used post hoc tests include Tukey's Honestly Significant Difference (HSD), Bonferroni correction, Scheffé's test, and others. Each of these tests has its own strengths and considerations, and the choice often depends on the specific characteristics of the data and the research question.

In part 2 we will discuss two-way ANOVA and in part 3 we will talk about post hoc test.

Stay tuned!


要查看或添加评论,请登录

Vadim Tyuryaev的更多文章

  • A Deep Dive into ANOVA(part 3)

    A Deep Dive into ANOVA(part 3)

    In Parts 1 and 2, we engaged in a detailed and methodological discourse on one-way and two-way Analysis of Variance…

  • A Deep Dive into ANOVA(part 2)

    A Deep Dive into ANOVA(part 2)

    In part 1 of the ANOVA series, our discussion encompassed the principles of one-way ANOVA, along with the…

  • The Importance of Avoiding Data Snooping and the Vast Search Effect in Data Science

    The Importance of Avoiding Data Snooping and the Vast Search Effect in Data Science

    In the dynamic landscape of data analysis, researchers and analysts often find themselves grappling with the challenges…

  • MS VBA to reorder columns in Excel

    MS VBA to reorder columns in Excel

    INTRODUCTION Recently, I encountered a challenge when using SharePoint surveys. It came to my attention that SharePoint…

  • Versioning Large Files with Git LFS

    Versioning Large Files with Git LFS

    Git is a popular version control system used for managing code repositories. However, one limitation of Git is that it…

  • Selenium Basic for Chrome browser

    Selenium Basic for Chrome browser

    Modern data science demands a high degree of flexibility, knowledge of multiple programming languages such as Python…

    9 条评论
  • GPU version of TensorFlow? for R

    GPU version of TensorFlow? for R

    Modern statistical and machine learning (ML) algorithms require fast, reliable and efficient computations. The very…

社区洞察

其他会员也浏览了