Impossible Hypotheses

Impossible Hypotheses

There are mathematical boundaries to the magnitudes of effect sizes in the population. This means that, when considering multiple variables jointly, some hypothesized effect sizes may be outright impossible. This notion has profound consequences for hypothesis formulation, power analysis (or defining Bayesian priors), and evaluations of the importance or triviality of effects prior to conducting research.

?

It has become increasingly common in sciences to accompany hypotheses with statements about the effect size. This may happen explicitly in the hypotheses themselves, or (perhaps more commonly) in the discussions about power analyses or importance of an effect. Determining the effect sizes, however, is a challenging task. In a recent article, Van Tilburg and Van Tilburg (2023) offer a useful approach for calculating the mathematical boundaries of an effect size while formulating hypotheses.

Imagine two researchers interested in the relation between nostalgia, positive affect, and negative affect. The researchers know, based on existing theory, that feelings of nostalgia are accompanied with a mix of positive and negative affect. As such, they both predict a moderate-to-large correlation between nostalgia and affect. They also know, based on a meta-analysis (Busseri, 2018), that positive and negative affect are inversely correlated in the population, at around r = -0.59.

The two correlation matrices below represent hypothesis groups 1 and 2, which represent the hypothesized correlations between the three variables proposed by each of these researchers.

Table 1

As we can see, Researcher 1 predicts a slightly higher correlation between nostalgia and affect compared to Researcher 2. It turns out, the first researcher's hypothesis (Group 1) is mathematically impossible.

Hypothesis Group 2, while possible in isolation, enters the realm of impossibility once we consider other unmeasured correlates of nostalgia in the population. Let’s take a few steps back…

Boundaries of Effect Size Estimates

The presence of a correlation between two variables sets limits to how these variables might relate to a third variable. We can think of this limit in two ways: (a) geometric representations of the variables, or (b) limits to multiple correlations and variance explained.

(1) Impossible hypotheses have impossible geometry.

We can intuitively think of correlation coefficients as representing the cosines of the angle between two variables’ axes. For example, if two variables are absolutely uncorrelated (r = 0.0), we can think of them as a set of perpendicular axes (cos 90 = 0). If two variables are perfectly correlated (e.g., r = +1.0), they can be represented with a coaxial arrangement (i.e., they share a common axis; cos 0 = +1). The larger a correlation coefficient gets, the smaller the angle becomes between the corresponding variable axes.

Now, if we add a third variable to the mix, we will impose restrictions on the angles between the three axes. Let's look at an example:

Figure 1a

Figure 1a represents three uncorrelated variables. The angles between each pair of variable axes is 90 degrees.

If we change the correlation between one of the variables (i.e., X3) and the other two to a moderate correlation of r = 0.35, the X3 axis gets closer to the other two axes at an angle of approximately 70 degrees (figure 1b).

Figure 1b

As we keep increasing the correlation X3 and the other two variables, the X3 axis will eventually reache its limit by touching the X1-X2 plane (figure 1c). At this point, the X3 axis has a 45 degree?angle with the other two axes.

Figure 1c

As shown below, r = 0.71 is the largest positive correlation possible between X3 and the other two uncorrelated variables:

This geometric approach is an intuitive way to demonstrate that the presence of a correlation between two variables sets limits to how these variables might relate to a third variable.

A second approach, using the familiar concept of R-squared, can lead to a simple determination of the mathematical limits of effect sizes.

(2) Impossible hypotheses exceed the R-squared ≤ 1 boundary.

The maximum proportion of variance that two variables have in common (i.e., R-squared) is 100%. This is why the absolute correlation between two variables cannot exceed 1. We can apply this "r-squared ≤ 1" rule to cases where more than two variables are correlated. For example, if we have three correlated variables, we know that the proportion of variance in each variable accounted for by the other two variables cannot exceed 100%.

Given three variables (X1, X2, X3), we can calculate the proportion of variance in one variable (X1) accounted for by the other two variables (X2 and X3) jointly using this equation:

Equation 1

Going back to the specific case of the impossible correlation coefficients between nostalgia, and positive and negative affect (Table 1), if we use the equation above to calculate the squared multiple correlation for each of the three variables in Hypothesis Group 1, we find that the R-squared value for nostalgia is 122%. Further, the R-squared for both positive and negative affect is 119%. This is why the Hypothesis Group 1 is mathematically impossible.

The R-squared values for the Hypotheses Group 2, however, are 94%, 96%, and 96%. These are extremely high, but not mathematically impossible.

Determining the Minimum and Maximum Possible Correlations

Building on Equation 1, and given that R-squared ≤ 1, we can calculate the minimum and maximum possible values of correlation coefficients between three variables using these equations:

Equation 2

For example, following the predictions in Hypothesis Group 1 above, if the correlation between nostalgia and positive affect is r = 0.50, and the correlation between positive and negative affect is r = -0.59, the correlation between nostalgia and negative affect must be somewhere between -0.99 and 0.40. Positive correlations larger that 0.40 are impossible.

Note that this approach extends to larger correlation matrices (e.g., when we hypothesize about 5 or 10 variables). It also applies to hypotheses about group differences (as opposed to correlations) where effect sizes are commonly reported using Cohen's d. The Hypothesis Evaluation Tool is a useful Shiny App where you can check the impossibility of your hypothesized correlation matrix.

Hypotheses that are Impossible within a Multivariate Population

I mentioned before that Hypothesis Group 2, though mathematically possible in isolation, becomes impossible once we take into account the other correlates of nostalgia in the literature.

Many hypotheses are about two or three "key variables." This does not mean that only these variables constrain the effect size boundaries. Researchers should consider the role of unmeasured variables in the multivariate population in restricting effect sizes for the key variables of a study.

Hypotheses are statements about the relation between some variables in the population. Beyond the key variables in a study, there are likely many other (unmeasured) variables that influence an outcome variable. Since these other variables explain part of the variance in the outcome, they can add to the mathematical restrictions of the hypotheses about the key variables.

As an example, let's go back to Hypothesis Group 2. Past research suggests that nostalgia is correlated with loneliness (r = 0.14), which in turn is correlated with negative affect (r = 0.47) and positive affect (r = -0.56). Based on this knowledge, we can extend the correlation matrix for Hypothesis Group 2 by adding loneliness (an unmeasured variable):

Table 2

If we use Equation 1 to calculate the R-squared values for these four variables, we see that the percentages of variance accounted for for all variables exceeds 100%. Effectively, when taking into account the relation between an unmeasured variable (i.e., loneliness) and the three measured ones, the hypothesized effect sizes become impossible.

This is, in my opinion, one of the most consequential aspects of the mathematical boundaries of effect sizes.

Practical Implications

The insights about hypothesis boundaries have direct implications for researchers:

  • Before finalizing a hypothesis, formulating power analysis, or evaluating the practical importance of the variables based on their hypothesized effect sizes, researchers should evaluate whether their hypotheses are within the realm of possibility.
  • Even when the hypotheses are technically possible, researchers should cautiously examine how close their effect sizes are to the boundaries of mathematical impossibility. Considering the role of unmeasured variables in the multivariate population in restricting effect sizes for the key (measured) variables in a study, it is probably wiser to avoid hypothesizing effect sizes that are too close to the mathematical limits of possibility.
  • The impact that "unmeasured variables" can have on the (im)possibility of the hypotheses highlights the importance of conducting a thorough literature review. To avoid proposing impossible hypotheses, and to the extent that past research is trustworthy (i.e., not affected by questionable research practices or issues such as small sample size) and generalizable (in terms of population and methodology), researchers should incorporate all relevant variables in the multivariate population when formulating their hypotheses. This is a potentially profound paradigm shift in understanding variability (especially in social sciences). Even when we are interested in the relation between two variables, it is critical to keep in mind that the actual "vectors of the mind" are much more expanded than the limited key variables we are studying.



Ian Tingen

Behavioral Research & Design Expert

1 年

Fantastic article, Pooya! Using this thinking "in reverse" has been useful for me in looking at archival data / associations / assumptions, as well. For instance, I've seen data where a construct (an axis) isn't defined / deployed 1:1 across multiple experiments. Those differences in operationalization and implementation can lead to wacky results! TL;DR: when you see impossible relationships, a quick peek at construct validity and deployment can be SUPER informative, too

要查看或添加评论,请登录

Pooya Razavi的更多文章

  • Four Ways to Think about Latent Variables

    Four Ways to Think about Latent Variables

    In human-centered sciences, it's nearly impossible to envision a research question without involving a latent variable,…

社区洞察

其他会员也浏览了