When Linear Models Don’t Fit Your Data, Now What?
Karen Grace-Martin
Statistical Consultant, Trainer, and Mentor for Researchers at The Analysis Factor
When your dependent variable is not continuous, unbounded, and measured on an interval or ratio scale, linear models don’t fit. The data just will not meet the assumptions of linear models. But there’s good news, other models exist for many types of dependent variables.
Today I’m going to go into more detail about 6 common types of dependent variables that are either discrete, bounded, or measured on a nominal or ordinal scale and the tests that work for them instead. Some are all of these.
Distributional Assumptions in Linear Models
Let’s take a moment to review the assumptions that will fail here.
Two key assumptions about the errors in linear models are that they all come from the same normal distribution and that they have a constant variance. There are more assumptions, but those are the ones we’re focusing on here.
These errors are in the population, but we estimate them with the sample residuals to check the feasibility of these assumptions.
There are many data sets with variables that could theoretically follow these assumptions, but don’t. Here we’re talking about dependent variables that just won’t ever give you the residual distribution linear models need. So you can try fitting a linear model and then testing the assumptions, but it will pretty much always fail.
The usual advice is to do one of two things. One is to transform your dependent variable. And that can definitely work in some situations. But not for these variables.
The other is to use nonparametric tests when normality assumptions fail. That works when you’re doing something simple, like a correlation or comparing group means. But if you’re including covariates or interactions in a model, you need a real model.
Categorical Dependent Variables
Both binary (2 values) and multicategory (3 or more values) variables clearly fail all three criteria.? But there are other types of regression models that work just fine for these variables.
For binary variables, probit and logistic regression models are the most common.? For multicategorical variables, use multinomial logistic regression.
Ordinal Variables
These variables are made up of ordered categories.? They include rank and likert-item variables, although are not limited to these.
Although ordinal variables look like numbers, the distances between their values aren’t equal in a true numerical sense. So it doesn’t make sense to apply numerical operations like addition and division to them. Hence means, the basis of linear models, don’t really compute.
Like unordered categorical variables, ordinal variables require specialized logistic or probit models, such as the proportional odds model. There are a few other types of ordinal models, but the proportional odds model is most commonly available.
Count Variables
Discrete counts fail the assumptions of linear models for many reasons.? The most obvious is that the normal distribution of linear models allows any value on the number scale, but counts are bounded at 0.? It just doesn’t make sense to predict negative numbers of cigarettes smoked each day, children in a family, or aggressive incidents.
But Poisson regression, or related models like negative binomial, are designed to accurately model count data.
Zero Inflated Variables
Zero Inflated data have a spike in the distribution at 0.
They are common in Poisson data, but can occur with any distribution.? A recent example I saw were scores on a depression scale.? The scale ran from 0 to 20, and 0 was by far the most common value (which is a good thing for the state of humanity, but it really messes up the linear model assumptions).
Even if the rest of the distribution is normal, you can’t transform zero inflated data to look normal.? A Zero-Inflated model, however, incorporates the high number of zeros by simultaneously modeling 0/Not 0 as a logistic regression and all the Not 0 values as another distribution.? It’s pretty cool, actually.
Censored Variables
Censored data have full information about the values of the DV only for some values.? The distribution gets cut off for some values, often at the end of the distribution.
Examples include surveys that have exact income information for everyone up to $200k, but beyond that, everyone is just given “over $200k.”? In surveys, this is done for privacy issues–there just aren’t many people with such high incomes.
But sometimes it’s just a measurement issue.? Tobit regression models are designed to handle the imprecise measurements on some parts of the scale.
Proportions
Proportions, bounded at 0 and 1, or percentages, bounded at 0 and 100, really become problematic if much of the data are close to the bounds.
If all the data fall in the middle portion, say in the .2 to .8 range, a linear model can give reasonably good results.? But beyond that, you need to either use a beta regression if the proportion is continuous or logistic regression if the proportion measures discrete events with a certain outcome (proportion of questions answered correctly).
Generalized Linear Models
So the next time linear models don’t fit your data, consider a different type of model.
Most of the models I’ve described here fit into the family of regression models called Generalized Linear Models. If you ever work with any of the variables described here, it’s worth learning them.
PhD Candidate at The Chinese University of Hong Kong
1 个月This post is a reminder that the current dominant paradigm of teaching a "test framework" of statistics, which boxes people into a dizzying number of seemingly unrelated techniques, is vastly inferior to the "model framework" of statistics, which is unified and has some form of sensibility.
Behavioral Science Research Methods | Applied Statistics | Social, personality, and clinical psychology research
1 个月Your articles are always so well written and hit a lot of good points. Unless I'm coerced, I will not transform the data (given the dependent variable is calculated already). I prefer to lean on the appropriate option you laid out. I remember learning about the different parameters estimated by these different statistical distributions (there are many that are interconnected/special cases of another). But in practice I only resort to a few options depending on model is appropriate. For example negative binomial regression, can be thought of a Poisson regression with (over) dispersion parameter freely estimated. Poisson regression fixes the dispersion i.e. scale parameter to 1. To determine whether to use negative binomial or Poisson regression, one can use Wald / Likelihood ratio tests on the scale parameter. Also one can note the magnitude of the deviation from the fixed to the estimated. If the scale parameter is significantly different from the fixed parameter (i.e., 1), then negative binomial might be the most appropriate statistical model. Then, in likelihood terms we can use the statistical model and the scientific model to test theory, explore data, or predict. Anyway thank you again! ??
Adjunct torturer (I teach math and stats) and push boundaries that should never be.
1 个月I asked my students what to do at various stages of solving math problems. Crying is always an option. Afterwards, they need to try and new approach. (Sadly, a lot of the female students think I'm making some sexist comment. In private, several male students admit they wanted to cry and rage with exhuberent anger...)