Comparing means of different groups (Analysis of Variance)
José Jaime Comé
Information Management Associate @ UNHCR ? Data Specialist/Statistician (Python||R||SQL||PowerBI||Excel) ? Youtube: 15K+ subscribers
Analysis of Variance (ANOVA) is collection of statistical tests used to analyze the difference between means of more than two groups. ANOVA simultaneously compares means across several groups and determine if observed differences are due to chance or genuine reflection of data variation. While ANOVA compare means of more than two groups, t-test are only used for comparing two groups at one time.
The researcher begins by set hypothesis where null hypothesis states that the means of all groups are the same while any difference between means of these groups will be due the random change. Alternative hypothesis stating that at least the means of one group is difference from others.
Type of Analysis of Variance
·?????? One-way ANOVA involves a single independent variable (single factor).
·?????? Two-way ANOVA involves two independent variables (two or more factors).
·?????? One-way Multivariate Analysis of Variance (MANOVA) allows for assessment of the impact of one independent variable (factor) on multiple dependent variables.
·?????? Two-way MANOVA allows for simultaneous assessment of the impact of two independent variables (factors) on multiple dependent variables.
Multivariate Analysis of Covariance (MANCOVA)
MANCOVA is extension of ANOVA with covariate(s). In MANCOVA are assessed multiple continuous dependent variables by an independent grouping variables, while controlling for a third variable called the covariate. Multiple covariates can be used. MANCOVA also can be ran as one-way or two-way.
Assumptions
·?????? Dependent variables are continuous, as such, should be measured at the interval or ratio level.
·?????? Independent variable should consist of two or more categorical, independent groups.
·?????? Variance is comparable in different experiment groups.
领英推荐
·?????? Dependent variable is normally distributed.
·?????? The larger sample size, the better.
·?????? No univariate or multivariate outliers.
·?????? For MANOVA and MANCOVA, linear relationship between each pair of dependent variables for each group of the independent variable.
·?????? For MANCOVA, there is homogeneity of variance-covariance matrices.
·?????? For MANCOVA, there is no multicollinearity.
Limitations
Analysis of Variance only tell if there is a significant difference between the means of more than two groups, but without explain which pair differs. If there are difference of means between groups, follow up tests must be performed. Analysis of Variance assume that the variations are same or similar across groups, if this assumption fails, the conclusion maybe inaccurate.
?Analysis of Variance in R
One-way ANOVA
#Importing library
library(car)
# Attach the data make it ease to call later
attach(iris)
#Display the Histogram
hist(iris$Sepal.Width)
# Test of Noarmality
shapiro.test(iris$Sepal.Width)
# The Shapiro-Wilk test shows a p-value of 0.10, which is greater than 0.05, suggesting that the data is normally distributed.
# If it’s less than 0.05, then the data is not normal distribuited.
# Testing Variances
lm.iris1<-lm(Sepal.Width~Species,data=iris)
leveneTest(lm.iris1)
# The variances are equal, as the p-value is 0.556 (greater than 0.05). This dataset satisfies the assumptions, ANOVA can be ran!
# Running the one-way ANOVA
anova(lm.iris1)
# There is at least one group that is significantly different (p is super small: p< 2.2 x 10^-16).
# Display the Summary of Anova
summary(lm.iris1)
# Post-hoc Multiple Comparisons
# Tukey HSD is the follow up to one-way ANOVA, when the F-test has revealed the existence of a significant difference between groups
iris.aov<-aov(Sepal.Width~Species,data=iris)
TukeyHSD(iris.aov)
# Actual difference between the means under diff and the adjusted p-value (p adj) for each pairwise comparison
Two-way ANOVA
# Importing library
library(car)
# Attach the data make it ease to call later
attach(iris)
iris$level <- c('low', 'high')
# Display the Histogram
hist(iris$Sepal.Width)
# Test of Noarmality
shapiro.test(iris$Sepal.Width)
# The Shapiro-Wilk test shows a p-value of 0.10, which is greater than 0.05, suggesting that the data is normally distributed.
# If it’s less than 0.05, then the data is not normal distribuited.
# Testing Variances
lm.iris1<-lm(Sepal.Width~Species*level,data=iris)
leveneTest(lm.iris1)
# The variances are equal, as the p-value is 0.556 (greater than 0.05). This dataset satisfies the assumptions of equality or similarity of variance, ANOVA can be ran!
# Running the two-way ANOVA
anova(lm.iris1)
# There is at least one group that is significantly different (p is super small: p< 2.2 x 10^-16).
# Display the Summary of Anova
summary(lm.iris1)
# Species:level row. The farthest right column (Pr(>F)) shows a p-value of 0.42, the interaction of the 2 variables is not significant.
# In the level row, we see a p-value of 0.81 - level does not have a significant effect on sepal width.
# Species has a super small p-value (less than 2 X 10^-16), it means that it has a significant effect on sepal width.
# Post-hoc Multiple Comparisons
# The only signicant factor was Species, this post-hoc comparison will be as follow
iris1.aov<-aov(Sepal.Width~Species,data=iris)
TukeyHSD(iris1.aov)
# Actual difference between the means under diff and the adjusted p-value (p adj) for each pairwise comparison
# If there was significant interaction of Species*level the code would be as follow
iris2.aov<-aov(Sepal.Width~Species*level,data=iris)
TukeyHSD(iris2.aov)
One and two-way MANOVA
# Importing library
library(car)
library(tidyverse)
library(ggpubr)
library(rstatix)
library(GGally)
# Attach the data make it ease to call later
attach(iris)
# Boxplot
ggboxplot(
iris, x = "Species", y = c("Sepal.Length", "Petal.Length"),
merge = TRUE, palette = "jco"
)
# Statistics
iris %>%
group_by(Species) %>%
get_summary_stats(Sepal.Length, Petal.Length, type = "mean_sd")
# Sample size
iris %>%
group_by(Species) %>%
summarise(N = n())
# Univariate outliers
iris %>%
group_by(Species) %>%
identify_outliers(Sepal.Length)
iris %>%
group_by(Species) %>%
identify_outliers(Petal.Length)
# Multivariate outliers
iris %>%
group_by(Species) %>%
mahalanobis_distance() %>%
filter(is.outlier == TRUE) %>%
as.data.frame()
# No multivariate outliers in the data, Mahalanobis distance (p > 0.001).
# Univariate normality
iris %>%
group_by(Species) %>%
shapiro_test(Sepal.Length, Petal.Length) %>%
arrange(variable)
# Sepal.Length and Petal.length normally distributed for each Species groups, Shapiro-Wilk's test (p > 0.05).
# QQ plot of Sepal.Length
ggqqplot(iris, "Sepal.Length", facet.by = "Species",
ylab = "Sepal Length", ggtheme = theme_bw())
# QQ plot of Petal.Length
ggqqplot(iris, "Petal.Length", facet.by = "Species",
ylab = "Petal Length", ggtheme = theme_bw())
# Multivariate normality
iris %>%
select(Sepal.Length, Petal.Length) %>%
mshapiro_test()
# With (p > 0.05), can be assumed multivariate normality in data.
# Multicollinearity
iris %>% cor_test(Sepal.Length, Petal.Length)
# No multicollinearity, Pearson correlation (r = 0.87, p < 0.0001).
# Linearity
# Create a scatterplot matrix by group
results <- iris %>%
select(Sepal.Length, Petal.Length, Species) %>%
group_by(Species) %>%
doo(~ggpairs(.) + theme_bw(), result = "plots")
results
# Show the plots
results$plots
# Homogeneity of covariances
box_m(iris[, c("Sepal.Length", "Petal.Length")], iris$Species)
# As (p < 0.001), assumption of homogeneity of variance-covariance matrices as been violated.
# Homogneity of variance
iris %>%
gather(key = "variable", value = "value", Sepal.Length, Petal.Length) %>%
group_by(variable) %>%
levene_test(value ~ Species)
# No homogeneity of variances as (p < 0.05).
# One way Manova
model <- lm(cbind(Sepal.Length, Petal.Length) ~ Species, iris)
Manova(model, test.statistic = "Pillai")
# Can be seen that there was a statistically significant difference between the Species on the combined dependent variables (Sepal.Length and Petal.Length)
# F(4, 294) = 71.829, p < 0.0001.
# Two way Manova
iris$Level <- c('low', 'high')
model <- lm(cbind(Sepal.Length, Petal.Length) ~ Species * Level , iris)
Manova(model, test.statistic = "Pillai")
Supply Associate at UNHCR, the UN Refugee Agency
8 个月Interesting!
Social Development Specialist
8 个月Interesting!