登录查看更多内容

A Deep Dive into ANOVA (part 1)

Vadim Tyuryaev

Data Scientist | PhD Candidate in Statistics | Executive MBA candidate | ML & AI Expert | Digital Innovation Advocate | International Educator |

发布日期: 2024年1月12日

Analysis of Variance (ANOVA) is a statistical method used to assess the equality of means across multiple groups. In this article we will focus on one-way ANOVA which deals with situations where there is one independent variable with more than two levels ( for comparison between just two levels please refer to t-test ). We are going to delve into the theoretical foundations of one-way ANOVA, including the formulas, derivations and assumptions, and then implement it from scratch to analyze R's mtcars built-in dataset.

Formulas and Derivations

Consider a scenario with k groups.

Hypotheses (i=1,...,k):

Alternative hypothesis (at least two means are different)

The overall mean is defined as the average of all observations:

Overall mean, where N is the total number of observations

The group mean for the i-th group is calculated as follows:

The sum of squares between groups (SSB) is the sum of squared deviations of group means from the overall mean, weighted by the group sample size:

SSB indicates how much the group means differ from the overall mean

The sum of squares within groups (SSW) is the sum of squared deviations of individual observations from their respective group means:

SSW captures how much individual observations deviate from their respective group means.

Degrees of freedom (the maximum number of independent values that can vary) for SSB and SSW are:

The mean squares (MS) for between groups and within groups are calculated as the sum of squares divided by their respective degrees of freedom:

The F-statistic is the ratio of mean squares:

F-test statistic is compared against the critical value from the F-distribution to determine statistical significance. The critical value is determined based on the selected significance level (alpha) and the degrees of freedom for the SSW and SSB. Correspondent p-value is also obtained.

It is important to keep in mind that p-value does not represent the probability that the result it due to chance which is a very common misconception. What it represents is the probability that, given a chance model, results as extreme as the observed results could occur. In the context of ANOVA, p-value is the probability of observing an F-statistic as extreme as, or more extreme than, the one calculated from the sample data under the null hypothesis.

领英推荐

Determining weights in a GRAPHRAG

Ajit Jaokar 11 个月前

Parametric And Nonparametric Test In R: How To Perform…

Ze Learning Labb 7 个月前

Multi-Curve Regression Analysis

Alireza Soroudi, PhD 1 年前

Assumptions:

Normality: The data within each group should be approximately normally distributed. This assumption is more robust with larger sample sizes due to the Central Limit Theorem.
Homogeneity of Variances: The variances within each group should be roughly equal. This assumption, also known as homoscedasticity, is crucial for the validity of the F-test.
Independence: Observations within each group should be independent of each other.

Why does ANOVA work?

According to the Cochran’s theorem under the assumption of data normality, the various quadratic forms (in our case mean squares) are independent and Chi-squared distributed. By taking ratio of two independent Chi-squared distributed variables divided by their correspondent degrees of freedom F distribution is obtained.

ANOVA in R from scratch

We will rely on tapply() function which is used to apply a function over subsets of a vector (for example, to calculate means for each group):

# Function to perform one-way ANOVA
one_way_anova <- function(response, group) {
  # Data preparation
  data <- data.frame(response, group)
  unique_groups <- unique(group)
  
  # Calculate overall mean
  overall_mean <- mean(response)
  
  # Calculate sum of squares between groups (SSB)
  SSB <- sum(table(data$group) * (tapply(data$response, data$group, mean)-overall_mean)^2)
  
  # Calculate sum of squares within groups (SSW)
  SSW <- sum(tapply(data$response, data$group, function(x) sum((x - mean(x))^2)))
  
  # Degrees of freedom
  df_between <- length(unique_groups) - 1
  df_within <- length(response) - length(unique_groups)
  
  # Mean squares
  MS_between <- SSB / df_between
  MS_within <- SSW / df_within
  
  # F-statistic
  F <- MS_between / MS_within
  
  # p-value
  p_value <- pf(F, df_between, df_within, lower.tail = FALSE)
  
  # Return results
  return(list(F_statistic = F, p_value = p_value, df_between = df_between, df_within = df_within))
}

Example using mtcars:

data(mtcars)
result_custom_anova <- one_way_anova(mtcars$mpg, mtcars$gear)
print(result_custom_anova)

Result (statistically significant at 1%):

Custom ANOVA function ran on mtcars dataset (comparing mpg at different gears)

Let's compare with build-in aov() function. Note that group (gear in our case) should be factorized.

result_2 <- aov(mpg ~ factor(gear), data = mtcars)

summary(result_2)

As you can see the results are identical. Therefore, there does exist a difference in average mpg for different number of forward gears which we can additionally verify visually.

library(ggplot2)

# Create a boxplot with different colors for each level of gears and add data points
ggplot(mtcars, aes(x = factor(gear), y = mpg, fill = factor(gear))) +
  geom_boxplot() +
  geom_jitter(position = position_jitter(width = 0.2), alpha = 0.5) +  # Add data points
  scale_fill_manual(values = c("red", "green", "blue")) +  # Specify colors for each level
  labs(title = "Boxplot of MPG at Different Gears",
       x = "Gears",
       y = "MPG") +
  theme_minimal() +
  theme(
    plot.title = element_text(hjust = 0.5)  # Center the title
  )

Conclusion

One limitation of analysis of variance (ANOVA) is that it doesn't explicitly identify which specific group means are different from each other. While ANOVA determines whether there are statistically significant differences in at least one pair of group means, it does not pinpoint the specific pairs that exhibit these differences. This limitation prompts the need for additional post hoc tests or pairwise comparisons.

Post hoc tests are employed after ANOVA to perform detailed pairwise comparisons between group means. These tests help identify which groups differ significantly from one another. Commonly used post hoc tests include Tukey's Honestly Significant Difference (HSD), Bonferroni correction, Scheffé's test, and others. Each of these tests has its own strengths and considerations, and the choice often depends on the specific characteristics of the data and the research question.

In part 2 we will discuss two-way ANOVA and in part 3 we will talk about post hoc test.

Stay tuned!

要查看或添加评论，请登录

Vadim Tyuryaev的更多文章

A Deep Dive into ANOVA(part 3)

2024年2月22日

A Deep Dive into ANOVA(part 3)

In Parts 1 and 2, we engaged in a detailed and methodological discourse on one-way and two-way Analysis of Variance…
A Deep Dive into ANOVA(part 2)

2024年1月15日

A Deep Dive into ANOVA(part 2)

In part 1 of the ANOVA series, our discussion encompassed the principles of one-way ANOVA, along with the…
The Importance of Avoiding Data Snooping and the Vast Search Effect in Data Science

2024年1月9日

The Importance of Avoiding Data Snooping and the Vast Search Effect in Data Science

In the dynamic landscape of data analysis, researchers and analysts often find themselves grappling with the challenges…
MS VBA to reorder columns in Excel

2023年6月16日

MS VBA to reorder columns in Excel

INTRODUCTION Recently, I encountered a challenge when using SharePoint surveys. It came to my attention that SharePoint…
Versioning Large Files with Git LFS

2023年3月28日

Versioning Large Files with Git LFS

Git is a popular version control system used for managing code repositories. However, one limitation of Git is that it…
Selenium Basic for Chrome browser

2022年6月16日

Selenium Basic for Chrome browser

Modern data science demands a high degree of flexibility, knowledge of multiple programming languages such as Python…

9 条评论
GPU version of TensorFlow? for R

2022年6月8日

GPU version of TensorFlow? for R

Modern statistical and machine learning (ML) algorithms require fast, reliable and efficient computations. The very…

See all articles

A Deep Dive into ANOVA (part 1)

Vadim Tyuryaev

Data Scientist | PhD Candidate in Statistics | Executive MBA candidate | ML & AI Expert | Digital Innovation Advocate | International Educator |

Formulas and Derivations

领英推荐

Assumptions:

Why does ANOVA work?

ANOVA in R from scratch

Conclusion

Vadim Tyuryaev的更多文章

社区洞察

其他会员也浏览了

Copulas explained

R-squared in Regression Analysis

Q. How to choose the best-fit among various Statistical Models ?

Approaches to Repeated Measures: Repeated Measures ANOVA, Marginal, and Mixed Models

LINEAR REGRESSION ON BOSTON DATASET

Exploring Univariate Combo Charts

The Difference Between Random Factors and Random Effects

Application of Logistic Regression with LASSO regularization to predicting March Madness Results

What’s in a Name? Moderation and Interaction, Independent and Predictor Variables

Experimenting on Facebook Prophet

Formulas and Derivations

领英推荐

Assumptions:

Why does ANOVA work?

ANOVA in R from scratch

Conclusion

Vadim Tyuryaev的更多文章

A Deep Dive into ANOVA(part 3)

A Deep Dive into ANOVA(part 2)

The Importance of Avoiding Data Snooping and the Vast Search Effect in Data Science

MS VBA to reorder columns in Excel

Versioning Large Files with Git LFS

Selenium Basic for Chrome browser

GPU version of TensorFlow? for R

社区洞察

其他会员也浏览了

Copulas explained

R-squared in Regression Analysis

Q. How to choose the best-fit among various Statistical Models ?

Approaches to Repeated Measures: Repeated Measures ANOVA, Marginal, and Mixed Models

LINEAR REGRESSION ON BOSTON DATASET

Exploring Univariate Combo Charts

The Difference Between Random Factors and Random Effects

Application of Logistic Regression with LASSO regularization to predicting March Madness Results

What’s in a Name? Moderation and Interaction, Independent and Predictor Variables

Experimenting on Facebook Prophet