登录查看更多内容

Comparing means of different groups (Analysis of Variance)

José Jaime Comé

Information Management Associate @ UNHCR ? Data Specialist/Statistician (Python||R||SQL||PowerBI||Excel) ? Youtube: 15K+ subscribers

发布日期: 2024年6月29日

Analysis of Variance (ANOVA) is collection of statistical tests used to analyze the difference between means of more than two groups. ANOVA simultaneously compares means across several groups and determine if observed differences are due to chance or genuine reflection of data variation. While ANOVA compare means of more than two groups, t-test are only used for comparing two groups at one time.

The researcher begins by set hypothesis where null hypothesis states that the means of all groups are the same while any difference between means of these groups will be due the random change. Alternative hypothesis stating that at least the means of one group is difference from others.

Type of Analysis of Variance

·?????? One-way ANOVA involves a single independent variable (single factor).

·?????? Two-way ANOVA involves two independent variables (two or more factors).

·?????? One-way Multivariate Analysis of Variance (MANOVA) allows for assessment of the impact of one independent variable (factor) on multiple dependent variables.

·?????? Two-way MANOVA allows for simultaneous assessment of the impact of two independent variables (factors) on multiple dependent variables.

Multivariate Analysis of Covariance (MANCOVA)

MANCOVA is extension of ANOVA with covariate(s). In MANCOVA are assessed multiple continuous dependent variables by an independent grouping variables, while controlling for a third variable called the covariate. Multiple covariates can be used. MANCOVA also can be ran as one-way or two-way.

Assumptions

·?????? Dependent variables are continuous, as such, should be measured at the interval or ratio level.

·?????? Independent variable should consist of two or more categorical, independent groups.

·?????? Variance is comparable in different experiment groups.

领英推荐

Normal Distribution or (Gaussian) distribution in…

Lean Manufacturing & Six Sigma Worldwide 1 年前

One Way ANOVA in Statistics (VIDEO??)

Lean Manufacturing & Six Sigma Worldwide 1 年前

Test Assumptions Only After Running an Initial Model

The Analysis Factor 1 个月前

·?????? Dependent variable is normally distributed.

·?????? The larger sample size, the better.

·?????? No univariate or multivariate outliers.

·?????? For MANOVA and MANCOVA, linear relationship between each pair of dependent variables for each group of the independent variable.

·?????? For MANCOVA, there is homogeneity of variance-covariance matrices.

·?????? For MANCOVA, there is no multicollinearity.

Limitations

Analysis of Variance only tell if there is a significant difference between the means of more than two groups, but without explain which pair differs. If there are difference of means between groups, follow up tests must be performed. Analysis of Variance assume that the variations are same or similar across groups, if this assumption fails, the conclusion maybe inaccurate.

?Analysis of Variance in R

One-way ANOVA

#Importing library
library(car)

# Attach the data make it ease to call later
attach(iris)

#Display the Histogram
hist(iris$Sepal.Width)

# Test of Noarmality
shapiro.test(iris$Sepal.Width)

# The Shapiro-Wilk test shows a p-value of 0.10, which is greater than 0.05, suggesting that the data is normally distributed. 
# If it’s less than 0.05, then the data is not normal distribuited.

# Testing Variances
lm.iris1<-lm(Sepal.Width~Species,data=iris)
leveneTest(lm.iris1)

# The variances are equal, as the p-value is 0.556 (greater than 0.05). This dataset satisfies the assumptions, ANOVA can be ran!
  
# Running the one-way ANOVA
anova(lm.iris1)
# There is at least one group that is significantly different (p is super small: p< 2.2 x 10^-16).

# Display the Summary of Anova
summary(lm.iris1)


# Post-hoc Multiple Comparisons
# Tukey HSD is the follow up to one-way ANOVA, when the F-test has revealed the existence of a significant difference between groups
iris.aov<-aov(Sepal.Width~Species,data=iris)
TukeyHSD(iris.aov)

# Actual difference between the means under diff and the adjusted p-value (p adj) for each pairwise comparison

Two-way ANOVA

# Importing library
library(car)

# Attach the data make it ease to call later
attach(iris)

iris$level <- c('low', 'high')

# Display the Histogram
hist(iris$Sepal.Width)

# Test of Noarmality
shapiro.test(iris$Sepal.Width)
# The Shapiro-Wilk test shows a p-value of 0.10, which is greater than 0.05, suggesting that the data is normally distributed. 
# If it’s less than 0.05, then the data is not normal distribuited.

# Testing Variances
lm.iris1<-lm(Sepal.Width~Species*level,data=iris)
leveneTest(lm.iris1)
# The variances are equal, as the p-value is 0.556 (greater than 0.05). This dataset satisfies the assumptions of equality or similarity of variance, ANOVA can be ran!

# Running the two-way ANOVA
anova(lm.iris1)
# There is at least one group that is significantly different (p is super small: p< 2.2 x 10^-16).

# Display the Summary of Anova
summary(lm.iris1)
# Species:level row. The farthest right column (Pr(>F)) shows a p-value of 0.42, the interaction of the 2 variables is not significant.
# In the level row, we see a p-value of 0.81 - level does not have a significant effect on sepal width.
# Species has a super small p-value (less than 2 X 10^-16), it means that it has a significant effect on sepal width.

# Post-hoc Multiple Comparisons
# The only signicant factor was Species, this post-hoc comparison will be as follow
iris1.aov<-aov(Sepal.Width~Species,data=iris)
TukeyHSD(iris1.aov)
# Actual difference between the means under diff and the adjusted p-value (p adj) for each pairwise comparison

# If there was significant interaction of Species*level the code would be as follow
iris2.aov<-aov(Sepal.Width~Species*level,data=iris)
TukeyHSD(iris2.aov)

One and two-way MANOVA

# Importing library
library(car)
library(tidyverse)
library(ggpubr)
library(rstatix)
library(GGally)

# Attach the data make it ease to call later
attach(iris)

# Boxplot
ggboxplot(
  iris, x = "Species", y = c("Sepal.Length", "Petal.Length"), 
  merge = TRUE, palette = "jco"
)

# Statistics
iris %>%
  group_by(Species) %>%
  get_summary_stats(Sepal.Length, Petal.Length, type = "mean_sd")

# Sample size
iris %>%
  group_by(Species) %>%
  summarise(N = n())

# Univariate outliers
iris %>%
  group_by(Species) %>%
  identify_outliers(Sepal.Length)

iris %>%
  group_by(Species) %>%
  identify_outliers(Petal.Length)

# Multivariate outliers
iris %>%
  group_by(Species) %>%
  mahalanobis_distance() %>%
  filter(is.outlier == TRUE) %>%
  as.data.frame()
# No multivariate outliers in the data, Mahalanobis distance (p > 0.001).

# Univariate normality
iris %>%
  group_by(Species) %>%
  shapiro_test(Sepal.Length, Petal.Length) %>%
  arrange(variable)
# Sepal.Length and Petal.length normally distributed for each Species groups, Shapiro-Wilk's test (p > 0.05).

# QQ plot of Sepal.Length
ggqqplot(iris, "Sepal.Length", facet.by = "Species",
         ylab = "Sepal Length", ggtheme = theme_bw())

# QQ plot of Petal.Length
ggqqplot(iris, "Petal.Length", facet.by = "Species",
         ylab = "Petal Length", ggtheme = theme_bw())


# Multivariate normality
iris %>%
  select(Sepal.Length, Petal.Length) %>%
  mshapiro_test()
# With (p > 0.05), can be assumed multivariate normality in data.

# Multicollinearity
iris %>% cor_test(Sepal.Length, Petal.Length)
# No multicollinearity, Pearson correlation (r = 0.87, p < 0.0001).

# Linearity
# Create a scatterplot matrix by group
results <- iris %>%
  select(Sepal.Length, Petal.Length, Species) %>%
  group_by(Species) %>%
  doo(~ggpairs(.) + theme_bw(), result = "plots")
results

# Show the plots
results$plots

# Homogeneity of covariances
box_m(iris[, c("Sepal.Length", "Petal.Length")], iris$Species)
# As (p < 0.001), assumption of homogeneity of variance-covariance matrices as been violated.

# Homogneity of variance
iris %>% 
  gather(key = "variable", value = "value", Sepal.Length, Petal.Length) %>%
  group_by(variable) %>%
  levene_test(value ~ Species)
# No homogeneity of variances as (p < 0.05).

# One way Manova
model <- lm(cbind(Sepal.Length, Petal.Length) ~ Species, iris)
Manova(model, test.statistic = "Pillai")
# Can be seen that there was a statistically significant difference between the Species on the combined dependent variables (Sepal.Length and Petal.Length)
# F(4, 294) = 71.829, p < 0.0001.

# Two way Manova
iris$Level <- c('low', 'high')
model <- lm(cbind(Sepal.Length, Petal.Length) ~ Species * Level , iris)
Manova(model, test.statistic = "Pillai")

Nilton Tomás

Supply Associate at UNHCR, the UN Refugee Agency

8 个月

Interesting!

1 次回应

Anselmo CIPRIANO

Social Development Specialist

8 个月

Interesting!

1 次回应

查看更多评论

要查看或添加评论，请登录

José Jaime Comé的更多文章

Machine Learning: Predicting outcomes using Binary Logistic Regression

2024年8月7日

Machine Learning: Predicting outcomes using Binary Logistic Regression

Logistic regression is a statistical model that is used for binary classification by linear combination of data of one…
Prediction Model using Autoregressive Integrated Moving Average (ARIMA)

2024年7月10日

Prediction Model using Autoregressive Integrated Moving Average (ARIMA)

An autoregressive integrated moving average (ARIMA) is a statistical analysis model that predict values based on…
Linear Discriminant Analysis

2024年6月21日

Linear Discriminant Analysis

Linear discriminant analysis (LDA) group data into categories, as such, this technique is used for dimensionality…

1 条评论
Factor Analysis

2024年6月11日

Factor Analysis

Factor analysis is a statistical method used to describe variability among large number of observed, correlated…

1 条评论
Principal Component Analysis (PCA)

2024年5月27日

Principal Component Analysis (PCA)

The number of features or dimensions in a dataset can lead to issues such as overfitting, increasing computation…

1 条评论
Data Governance

2024年5月15日

Data Governance

While Data management is part of the overall management of data. Data governance in short is just documentation…
Data Mining with Cluster Analysis

2024年2月25日

Data Mining with Cluster Analysis

The Cluster analysis is technique of statistical analysis and one of the method of data mining that consist of dividing…

See all articles

Comparing means of different groups (Analysis of Variance)

José Jaime Comé

Information Management Associate @ UNHCR ? Data Specialist/Statistician (Python||R||SQL||PowerBI||Excel) ? Youtube: 15K+ subscribers

领英推荐

José Jaime Comé的更多文章

社区洞察

其他会员也浏览了

Outliers: To Drop or Not to Drop

Demystifying the P-Value: A Simple Guide for Non-Statisticians

Moving Averages in Time Series Analysis

The Magic Number 30: Why a Sample Size of 30 Is Often Considered Sufficient for Statistical Significance

Four Flaws in Foundations of Statistics

Can Likert Scale Data ever be Continuous?

Chapter 2: Histogram-based Outlier Score (HBOS)

Multiple Regression

Going Deeper into Time Series Analysis by Focusing on Frequency

领英推荐

José Jaime Comé的更多文章

Machine Learning: Predicting outcomes using Binary Logistic Regression

Prediction Model using Autoregressive Integrated Moving Average (ARIMA)

Linear Discriminant Analysis

Factor Analysis

Principal Component Analysis (PCA)

Data Governance

Data Mining with Cluster Analysis

社区洞察

其他会员也浏览了

Outliers: To Drop or Not to Drop

Demystifying the P-Value: A Simple Guide for Non-Statisticians

Moving Averages in Time Series Analysis

The Magic Number 30: Why a Sample Size of 30 Is Often Considered Sufficient for Statistical Significance

Four Flaws in Foundations of Statistics

Can Likert Scale Data ever be Continuous?

Chapter 2: Histogram-based Outlier Score (HBOS)

Multiple Regression

Going Deeper into Time Series Analysis by Focusing on Frequency