A Primer on Statistical Power and Power Analysis
Keith McNulty
Leader in Technology, Science and Analytics | Mathematician, Statistician and Psychometrician | Author and Teacher | Coder, Engineer, Architect
If your experience is anything like mine, you’ve probably heard numerous people talk about ‘statistical power’ in conversations at work. I’m pretty sure that for the most part these people are pushing for larger sample sizes on the basis of some vague conception that a larger n is always good.
But how many of these people can actually define what statistical power is? In this article I want to take a look at the concept and definition of statistical power and identify where it is useful as a measure.
Hypothesis testing
The term ‘statistical power’ only has meaning when it is referring to a hypothesis test. You may recall that a hypothesis test involves using the statistical properties of a sample of data to determine the level of certainty of a statement about the overall population from which that sample was drawn. Let’s take an example. The salespeople data set from the peopleanalyticsdata R package contains the data of a sample of salespeople in a technology company, including their annual sales figures in thousands of dollars and their recent performance ratings on an increasing ordinal scale of 1 to 4. Let’s take a look at the first few rows.
library(peopleanalyticsdata)
salespeople <- salespeople[complete.cases(salespeople), ]
head(salespeople)
Now let’s take this statement: the mean sales of top performing salespeople is different to that of bottom performing sales people in the overall population. We start by assuming this is not true — that is, that the mean sales of top performers is the same as bottom performers — and call this the null hypothesis. We then perform a test to establish the maximum probability that our samples would look the way they do if the null hypothesis were indeed true in the population — known as the p-value of the test. In this case we conduct Welch’s t-test to compare two samples of unequal variance.
# sales of top performers
sales4 <- salespeople$sales[salespeople$performance == 4]
# sales of bottom performers
sales1 <- salespeople$sales[salespeople$performance == 1]
# p-value of null hypothesis that their means are the same
t.test(sales4, sales1)$p.value
## 1.093244e-05
This indicates that it is highly unlikely that our samples would look the way they do if our null hypothesis were true in the overall population. We define a level of likelihood below which we agree to reject the null hypothesis, and this is known as alpha. Often alpha is 0.05, but sometimes in can be much lower. If we take our alpha to be 0.05 here, we comfortably reject the null hypothesis and conclude the alternative hypothesis — that there is a difference in the mean sales of low and high performers in the population. Note that choosing an alpha of 0.05 means that we will make the wrong conclusion 1 in every 20 times on average. Hypothesis testing is about likelihood, not about certainty.
Defining statistical power
We can see that hypothesis testing is about the level of certainty at which we are comfortable to conclude a difference in the population, acknowledging that we can only observe a sample of that population. Nothing is ever 100% certain for an unobserved population, and therefore four scenarios can occur:
Statistical power relates to number 4 — it is the probability that the null hypothesis will be rejected based on the sample, given that it is false in the population. Instinctively, you can imagine that this depends on the size of your sample, the actual (unobserved) difference in the population (appropriately normalized), and the standard of certainty at which you reject the null hypothesis (alpha). For example if the actual population difference is larger, you might see it in a smaller sample. If alpha is smaller, you might need a greater population difference or a higher n to meet your standard of certainty.
The elephant in the room here is of course that we will never know the difference in our population — we only know the difference in our sample. Therefore we usually satisfy ourselves with an observed statistical power by using the observed difference in our sample. For our salespeople example here, because it is a t-test, we use Cohen’s effect size d as our normalized observed difference. Combining this with our sample size and an alpha of 0.05, we can calculate a statistical power of 0.996 for our hypothesis test. We can be pretty certain that the null hypothesis will be accurately rejected.
领英推荐
library(effectsize)
library(WebPower)
# sample sizes
n4 <- length(sales4)
n1 <- length(sales1)
# cohen's effect size d
d <- cohens_d(sales4, sales1)$Cohens_d
# statistical power
wp.t(n4, n1, d = d, type = "two.sample.2n")
## Unbalanced two-sample t-test
##
## n1 n2 d alpha power
## 55 60 0.8741483 0.05 0.996347
When is statistical power useful?
Not that often, to be brutally honest. In situations where you have your samples and your data and you have already conducted hypothesis testing, statistical power is really just a measure of how well you have cleared your alpha bar. The less stringent your alpha, the higher the power. Take a look.
library(ggplot2)
# statistical power
test <- WebPower::wp.t(n4, n1, d = d, type = "two.sample.2n",
alpha = seq(0.05, 0.0001, by = -0.0001))
test_df <- data.frame(
Alpha = test$alpha,
Power = test$power
)
# plot power against alpha
ggplot(test_df, aes(x = Alpha, y = Power)) +
geom_point(color = "pink") +
theme_minimal()
If you haven’t gotten your sample data or done any hypothesis testing, and if you are planning out an experiment or research that could involve a lot of work, maybe statistical power as a measure can be helpful. Because sample size plays a role, you can in theory calculate a minimum sample size to achieve a certain alpha standard.
But in practice I find that whole process highly speculative, because you need to know your observed effect size, and of course you don’t know it because you haven’t run your experiment yet. Therefore, most sample size estimations that come from statistical power calculations tend to be in the form of sensitivity ranges.
Experiments can be hard to organize and resource, and statistical power can be of some use in sizing the scale of what is needed. It can also help illustrate when you get diminishing returns on your n. For example, if we test a range of sample sizes on a paired t-test with a medium effect size and an alpha of 0.05, we see that there is a point where the extra n won’t make much of a difference to the power.
# test a range of sample sizes
sample_sizes <- 20:100
power <- wp.t(n1 = sample_sizes, d = 0.5, type = "paired")
power_df <- data.frame(
n = power$n,
Power = power$power
)
# plot power against sample size
ggplot(power_df, aes(x = n, y = Power)) +
geom_point(color = "lightblue") +
theme_minimal()
Overall, statistical power is a blunt instrument. In some ways you can think of it as a ‘bolt-on’ to hypothesis testing that is only useful in certain situations mostly related to experimental design.
If you are interested to further explore the mathematics and theory of statistical power and to learn about the different statistics used in power analysis on hypothesis tests and regression models, check out Chapter 11 of Handbook of Regression Modeling in People Analytics .
Healthcare Consultant | Epic, Analytics, AI
6 个月A true expert can explain things in plain terms to everyday people. Folks who hide behind fancy terms often don't have a solid understanding themselves.