Comparing means of different groups (Analysis of Variance)

Comparing means of different groups (Analysis of Variance)

Analysis of Variance (ANOVA) is collection of statistical tests used to analyze the difference between means of more than two groups. ANOVA simultaneously compares means across several groups and determine if observed differences are due to chance or genuine reflection of data variation. While ANOVA compare means of more than two groups, t-test are only used for comparing two groups at one time.

The researcher begins by set hypothesis where null hypothesis states that the means of all groups are the same while any difference between means of these groups will be due the random change. Alternative hypothesis stating that at least the means of one group is difference from others.

Type of Analysis of Variance

·?????? One-way ANOVA involves a single independent variable (single factor).

·?????? Two-way ANOVA involves two independent variables (two or more factors).

·?????? One-way Multivariate Analysis of Variance (MANOVA) allows for assessment of the impact of one independent variable (factor) on multiple dependent variables.

·?????? Two-way MANOVA allows for simultaneous assessment of the impact of two independent variables (factors) on multiple dependent variables.

ANOVA and MANOVA

Multivariate Analysis of Covariance (MANCOVA)

MANCOVA is extension of ANOVA with covariate(s). In MANCOVA are assessed multiple continuous dependent variables by an independent grouping variables, while controlling for a third variable called the covariate. Multiple covariates can be used. MANCOVA also can be ran as one-way or two-way.

MANCOVA

Assumptions

·?????? Dependent variables are continuous, as such, should be measured at the interval or ratio level.

·?????? Independent variable should consist of two or more categorical, independent groups.

·?????? Variance is comparable in different experiment groups.

·?????? Dependent variable is normally distributed.

·?????? The larger sample size, the better.

·?????? No univariate or multivariate outliers.

·?????? For MANOVA and MANCOVA, linear relationship between each pair of dependent variables for each group of the independent variable.

·?????? For MANCOVA, there is homogeneity of variance-covariance matrices.

·?????? For MANCOVA, there is no multicollinearity.

Limitations

Analysis of Variance only tell if there is a significant difference between the means of more than two groups, but without explain which pair differs. If there are difference of means between groups, follow up tests must be performed. Analysis of Variance assume that the variations are same or similar across groups, if this assumption fails, the conclusion maybe inaccurate.

?Analysis of Variance in R

One-way ANOVA

#Importing library
library(car)

# Attach the data make it ease to call later
attach(iris)

#Display the Histogram
hist(iris$Sepal.Width)

# Test of Noarmality
shapiro.test(iris$Sepal.Width)

# The Shapiro-Wilk test shows a p-value of 0.10, which is greater than 0.05, suggesting that the data is normally distributed. 
# If it’s less than 0.05, then the data is not normal distribuited.

# Testing Variances
lm.iris1<-lm(Sepal.Width~Species,data=iris)
leveneTest(lm.iris1)

# The variances are equal, as the p-value is 0.556 (greater than 0.05). This dataset satisfies the assumptions, ANOVA can be ran!
  
# Running the one-way ANOVA
anova(lm.iris1)
# There is at least one group that is significantly different (p is super small: p< 2.2 x 10^-16).

# Display the Summary of Anova
summary(lm.iris1)


# Post-hoc Multiple Comparisons
# Tukey HSD is the follow up to one-way ANOVA, when the F-test has revealed the existence of a significant difference between groups
iris.aov<-aov(Sepal.Width~Species,data=iris)
TukeyHSD(iris.aov)

# Actual difference between the means under diff and the adjusted p-value (p adj) for each pairwise comparison        

Two-way ANOVA

# Importing library
library(car)

# Attach the data make it ease to call later
attach(iris)

iris$level <- c('low', 'high')

# Display the Histogram
hist(iris$Sepal.Width)

# Test of Noarmality
shapiro.test(iris$Sepal.Width)
# The Shapiro-Wilk test shows a p-value of 0.10, which is greater than 0.05, suggesting that the data is normally distributed. 
# If it’s less than 0.05, then the data is not normal distribuited.

# Testing Variances
lm.iris1<-lm(Sepal.Width~Species*level,data=iris)
leveneTest(lm.iris1)
# The variances are equal, as the p-value is 0.556 (greater than 0.05). This dataset satisfies the assumptions of equality or similarity of variance, ANOVA can be ran!

# Running the two-way ANOVA
anova(lm.iris1)
# There is at least one group that is significantly different (p is super small: p< 2.2 x 10^-16).

# Display the Summary of Anova
summary(lm.iris1)
# Species:level row. The farthest right column (Pr(>F)) shows a p-value of 0.42, the interaction of the 2 variables is not significant.
# In the level row, we see a p-value of 0.81 - level does not have a significant effect on sepal width.
# Species has a super small p-value (less than 2 X 10^-16), it means that it has a significant effect on sepal width.

# Post-hoc Multiple Comparisons
# The only signicant factor was Species, this post-hoc comparison will be as follow
iris1.aov<-aov(Sepal.Width~Species,data=iris)
TukeyHSD(iris1.aov)
# Actual difference between the means under diff and the adjusted p-value (p adj) for each pairwise comparison

# If there was significant interaction of Species*level the code would be as follow
iris2.aov<-aov(Sepal.Width~Species*level,data=iris)
TukeyHSD(iris2.aov)        

One and two-way MANOVA

# Importing library
library(car)
library(tidyverse)
library(ggpubr)
library(rstatix)
library(GGally)

# Attach the data make it ease to call later
attach(iris)

# Boxplot
ggboxplot(
  iris, x = "Species", y = c("Sepal.Length", "Petal.Length"), 
  merge = TRUE, palette = "jco"
)

# Statistics
iris %>%
  group_by(Species) %>%
  get_summary_stats(Sepal.Length, Petal.Length, type = "mean_sd")

# Sample size
iris %>%
  group_by(Species) %>%
  summarise(N = n())

# Univariate outliers
iris %>%
  group_by(Species) %>%
  identify_outliers(Sepal.Length)

iris %>%
  group_by(Species) %>%
  identify_outliers(Petal.Length)

# Multivariate outliers
iris %>%
  group_by(Species) %>%
  mahalanobis_distance() %>%
  filter(is.outlier == TRUE) %>%
  as.data.frame()
# No multivariate outliers in the data, Mahalanobis distance (p > 0.001).

# Univariate normality
iris %>%
  group_by(Species) %>%
  shapiro_test(Sepal.Length, Petal.Length) %>%
  arrange(variable)
# Sepal.Length and Petal.length normally distributed for each Species groups, Shapiro-Wilk's test (p > 0.05).

# QQ plot of Sepal.Length
ggqqplot(iris, "Sepal.Length", facet.by = "Species",
         ylab = "Sepal Length", ggtheme = theme_bw())

# QQ plot of Petal.Length
ggqqplot(iris, "Petal.Length", facet.by = "Species",
         ylab = "Petal Length", ggtheme = theme_bw())


# Multivariate normality
iris %>%
  select(Sepal.Length, Petal.Length) %>%
  mshapiro_test()
# With (p > 0.05), can be assumed multivariate normality in data.

# Multicollinearity
iris %>% cor_test(Sepal.Length, Petal.Length)
# No multicollinearity, Pearson correlation (r = 0.87, p < 0.0001).

# Linearity
# Create a scatterplot matrix by group
results <- iris %>%
  select(Sepal.Length, Petal.Length, Species) %>%
  group_by(Species) %>%
  doo(~ggpairs(.) + theme_bw(), result = "plots")
results

# Show the plots
results$plots

# Homogeneity of covariances
box_m(iris[, c("Sepal.Length", "Petal.Length")], iris$Species)
# As (p < 0.001), assumption of homogeneity of variance-covariance matrices as been violated.

# Homogneity of variance
iris %>% 
  gather(key = "variable", value = "value", Sepal.Length, Petal.Length) %>%
  group_by(variable) %>%
  levene_test(value ~ Species)
# No homogeneity of variances as (p < 0.05).

# One way Manova
model <- lm(cbind(Sepal.Length, Petal.Length) ~ Species, iris)
Manova(model, test.statistic = "Pillai")
# Can be seen that there was a statistically significant difference between the Species on the combined dependent variables (Sepal.Length and Petal.Length)
# F(4, 294) = 71.829, p < 0.0001.

# Two way Manova
iris$Level <- c('low', 'high')
model <- lm(cbind(Sepal.Length, Petal.Length) ~ Species * Level , iris)
Manova(model, test.statistic = "Pillai")        
Nilton Tomás

Supply Associate at UNHCR, the UN Refugee Agency

8 个月

Interesting!

Anselmo CIPRIANO

Social Development Specialist

8 个月

Interesting!

要查看或添加评论,请登录

José Jaime Comé的更多文章

  • Machine Learning: Predicting outcomes using Binary Logistic Regression

    Machine Learning: Predicting outcomes using Binary Logistic Regression

    Logistic regression is a statistical model that is used for binary classification by linear combination of data of one…

  • Prediction Model using Autoregressive Integrated Moving Average (ARIMA)

    Prediction Model using Autoregressive Integrated Moving Average (ARIMA)

    An autoregressive integrated moving average (ARIMA) is a statistical analysis model that predict values based on…

  • Linear Discriminant Analysis

    Linear Discriminant Analysis

    Linear discriminant analysis (LDA) group data into categories, as such, this technique is used for dimensionality…

    1 条评论
  • Factor Analysis

    Factor Analysis

    Factor analysis is a statistical method used to describe variability among large number of observed, correlated…

    1 条评论
  • Principal Component Analysis (PCA)

    Principal Component Analysis (PCA)

    The number of features or dimensions in a dataset can lead to issues such as overfitting, increasing computation…

    1 条评论
  • Data Governance

    Data Governance

    While Data management is part of the overall management of data. Data governance in short is just documentation…

  • Data Mining with Cluster Analysis

    Data Mining with Cluster Analysis

    The Cluster analysis is technique of statistical analysis and one of the method of data mining that consist of dividing…

社区洞察

其他会员也浏览了