A Deep Dive into ANOVA (part 1)
Vadim Tyuryaev
Data Scientist | PhD Candidate in Statistics | Executive MBA candidate | ML & AI Expert | Digital Innovation Advocate | International Educator |
Analysis of Variance (ANOVA) is a statistical method used to assess the equality of means across multiple groups. In this article we will focus on one-way ANOVA which deals with situations where there is one independent variable with more than two levels ( for comparison between just two levels please refer to t-test ). We are going to delve into the theoretical foundations of one-way ANOVA, including the formulas, derivations and assumptions, and then implement it from scratch to analyze R's mtcars built-in dataset.
Formulas and Derivations
Consider a scenario with k groups.
Hypotheses (i=1,...,k):
The overall mean is defined as the average of all observations:
The group mean for the i-th group is calculated as follows:
The sum of squares between groups (SSB) is the sum of squared deviations of group means from the overall mean, weighted by the group sample size:
The sum of squares within groups (SSW) is the sum of squared deviations of individual observations from their respective group means:
Degrees of freedom (the maximum number of independent values that can vary) for SSB and SSW are:
The mean squares (MS) for between groups and within groups are calculated as the sum of squares divided by their respective degrees of freedom:
The F-statistic is the ratio of mean squares:
F-test statistic is compared against the critical value from the F-distribution to determine statistical significance. The critical value is determined based on the selected significance level (alpha) and the degrees of freedom for the SSW and SSB. Correspondent p-value is also obtained.
It is important to keep in mind that p-value does not represent the probability that the result it due to chance which is a very common misconception. What it represents is the probability that, given a chance model, results as extreme as the observed results could occur. In the context of ANOVA, p-value is the probability of observing an F-statistic as extreme as, or more extreme than, the one calculated from the sample data under the null hypothesis.
领英推荐
Assumptions:
Why does ANOVA work?
According to the Cochran’s theorem under the assumption of data normality, the various quadratic forms (in our case mean squares) are independent and Chi-squared distributed. By taking ratio of two independent Chi-squared distributed variables divided by their correspondent degrees of freedom F distribution is obtained.
ANOVA in R from scratch
We will rely on tapply() function which is used to apply a function over subsets of a vector (for example, to calculate means for each group):
# Function to perform one-way ANOVA
one_way_anova <- function(response, group) {
# Data preparation
data <- data.frame(response, group)
unique_groups <- unique(group)
# Calculate overall mean
overall_mean <- mean(response)
# Calculate sum of squares between groups (SSB)
SSB <- sum(table(data$group) * (tapply(data$response, data$group, mean)-overall_mean)^2)
# Calculate sum of squares within groups (SSW)
SSW <- sum(tapply(data$response, data$group, function(x) sum((x - mean(x))^2)))
# Degrees of freedom
df_between <- length(unique_groups) - 1
df_within <- length(response) - length(unique_groups)
# Mean squares
MS_between <- SSB / df_between
MS_within <- SSW / df_within
# F-statistic
F <- MS_between / MS_within
# p-value
p_value <- pf(F, df_between, df_within, lower.tail = FALSE)
# Return results
return(list(F_statistic = F, p_value = p_value, df_between = df_between, df_within = df_within))
}
Example using mtcars:
data(mtcars)
result_custom_anova <- one_way_anova(mtcars$mpg, mtcars$gear)
print(result_custom_anova)
Result (statistically significant at 1%):
Let's compare with build-in aov() function. Note that group (gear in our case) should be factorized.
result_2 <- aov(mpg ~ factor(gear), data = mtcars)
summary(result_2)
As you can see the results are identical. Therefore, there does exist a difference in average mpg for different number of forward gears which we can additionally verify visually.
library(ggplot2)
# Create a boxplot with different colors for each level of gears and add data points
ggplot(mtcars, aes(x = factor(gear), y = mpg, fill = factor(gear))) +
geom_boxplot() +
geom_jitter(position = position_jitter(width = 0.2), alpha = 0.5) + # Add data points
scale_fill_manual(values = c("red", "green", "blue")) + # Specify colors for each level
labs(title = "Boxplot of MPG at Different Gears",
x = "Gears",
y = "MPG") +
theme_minimal() +
theme(
plot.title = element_text(hjust = 0.5) # Center the title
)
Conclusion
One limitation of analysis of variance (ANOVA) is that it doesn't explicitly identify which specific group means are different from each other. While ANOVA determines whether there are statistically significant differences in at least one pair of group means, it does not pinpoint the specific pairs that exhibit these differences. This limitation prompts the need for additional post hoc tests or pairwise comparisons.
Post hoc tests are employed after ANOVA to perform detailed pairwise comparisons between group means. These tests help identify which groups differ significantly from one another. Commonly used post hoc tests include Tukey's Honestly Significant Difference (HSD), Bonferroni correction, Scheffé's test, and others. Each of these tests has its own strengths and considerations, and the choice often depends on the specific characteristics of the data and the research question.
In part 2 we will discuss two-way ANOVA and in part 3 we will talk about post hoc test.
Stay tuned!