登录查看更多内容

When Linear Models Don’t Fit Your Data, Now What?

Karen Grace-Martin

Statistical Consultant, Trainer, and Mentor for Researchers at The Analysis Factor

发布日期: 2024年10月23日

When your dependent variable is not continuous, unbounded, and measured on an interval or ratio scale, linear models don’t fit. The data just will not meet the assumptions of linear models. But there’s good news, other models exist for many types of dependent variables.

Today I’m going to go into more detail about 6 common types of dependent variables that are either discrete, bounded, or measured on a nominal or ordinal scale and the tests that work for them instead. Some are all of these.

Distributional Assumptions in Linear Models

Let’s take a moment to review the assumptions that will fail here.

Two key assumptions about the errors in linear models are that they all come from the same normal distribution and that they have a constant variance. There are more assumptions, but those are the ones we’re focusing on here.

These errors are in the population, but we estimate them with the sample residuals to check the feasibility of these assumptions.

There are many data sets with variables that could theoretically follow these assumptions, but don’t. Here we’re talking about dependent variables that just won’t ever give you the residual distribution linear models need. So you can try fitting a linear model and then testing the assumptions, but it will pretty much always fail.

The usual advice is to do one of two things. One is to transform your dependent variable. And that can definitely work in some situations. But not for these variables.

The other is to use nonparametric tests when normality assumptions fail. That works when you’re doing something simple, like a correlation or comparing group means. But if you’re including covariates or interactions in a model, you need a real model.

Categorical Dependent Variables

Both binary (2 values) and multicategory (3 or more values) variables clearly fail all three criteria.? But there are other types of regression models that work just fine for these variables.

For binary variables, probit and logistic regression models are the most common.? For multicategorical variables, use multinomial logistic regression.

Ordinal Variables

These variables are made up of ordered categories.? They include rank and likert-item variables, although are not limited to these.

Although ordinal variables look like numbers, the distances between their values aren’t equal in a true numerical sense. So it doesn’t make sense to apply numerical operations like addition and division to them. Hence means, the basis of linear models, don’t really compute.

Like unordered categorical variables, ordinal variables require specialized logistic or probit models, such as the proportional odds model. There are a few other types of ordinal models, but the proportional odds model is most commonly available.

领英推荐

Determining weights in a GRAPHRAG

Ajit Jaokar 7 个月前

TIQ Part 4 – Being Time intelligent

Nikola Ilic 4 年前

Bias-Variance tradeoff

Sanjay Kumar MBA,MS,PhD 1 年前

Count Variables

Discrete counts fail the assumptions of linear models for many reasons.? The most obvious is that the normal distribution of linear models allows any value on the number scale, but counts are bounded at 0.? It just doesn’t make sense to predict negative numbers of cigarettes smoked each day, children in a family, or aggressive incidents.

But Poisson regression, or related models like negative binomial, are designed to accurately model count data.

Zero Inflated Variables

Zero Inflated data have a spike in the distribution at 0.

They are common in Poisson data, but can occur with any distribution.? A recent example I saw were scores on a depression scale.? The scale ran from 0 to 20, and 0 was by far the most common value (which is a good thing for the state of humanity, but it really messes up the linear model assumptions).

Even if the rest of the distribution is normal, you can’t transform zero inflated data to look normal.? A Zero-Inflated model, however, incorporates the high number of zeros by simultaneously modeling 0/Not 0 as a logistic regression and all the Not 0 values as another distribution.? It’s pretty cool, actually.

Censored Variables

Censored data have full information about the values of the DV only for some values.? The distribution gets cut off for some values, often at the end of the distribution.

Examples include surveys that have exact income information for everyone up to $200k, but beyond that, everyone is just given “over $200k.”? In surveys, this is done for privacy issues–there just aren’t many people with such high incomes.

But sometimes it’s just a measurement issue.? Tobit regression models are designed to handle the imprecise measurements on some parts of the scale.

Proportions

Proportions, bounded at 0 and 1, or percentages, bounded at 0 and 100, really become problematic if much of the data are close to the bounds.

If all the data fall in the middle portion, say in the .2 to .8 range, a linear model can give reasonably good results.? But beyond that, you need to either use a beta regression if the proportion is continuous or logistic regression if the proportion measures discrete events with a certain outcome (proportion of questions answered correctly).

Generalized Linear Models

So the next time linear models don’t fit your data, consider a different type of model.

Most of the models I’ve described here fit into the family of regression models called Generalized Linear Models. If you ever work with any of the variables described here, it’s worth learning them.

Shawn Hemelstrand

PhD Candidate at The Chinese University of Hong Kong

1 个月

This post is a reminder that the current dominant paradigm of teaching a "test framework" of statistics, which boxes people into a dizzying number of seemingly unrelated techniques, is vastly inferior to the "model framework" of statistics, which is unified and has some form of sensibility.

Michael F. Wagner, Ph.D.

Behavioral Science Research Methods | Applied Statistics | Social, personality, and clinical psychology research

1 个月

Your articles are always so well written and hit a lot of good points. Unless I'm coerced, I will not transform the data (given the dependent variable is calculated already). I prefer to lean on the appropriate option you laid out. I remember learning about the different parameters estimated by these different statistical distributions (there are many that are interconnected/special cases of another). But in practice I only resort to a few options depending on model is appropriate. For example negative binomial regression, can be thought of a Poisson regression with (over) dispersion parameter freely estimated. Poisson regression fixes the dispersion i.e. scale parameter to 1. To determine whether to use negative binomial or Poisson regression, one can use Wald / Likelihood ratio tests on the scale parameter. Also one can note the magnitude of the deviation from the fixed to the estimated. If the scale parameter is significantly different from the fixed parameter (i.e., 1), then negative binomial might be the most appropriate statistical model. Then, in likelihood terms we can use the statistical model and the scientific model to test theory, explore data, or predict. Anyway thank you again! ??

1 次回应

Andrew Ekstrom

Adjunct torturer (I teach math and stats) and push boundaries that should never be.

1 个月

I asked my students what to do at various stages of solving math problems. Crying is always an option. Afterwards, they need to try and new approach. (Sadly, a lot of the female students think I'm making some sexist comment. In private, several male students admit they wanted to cry and rage with exhuberent anger...)

2 次回应

查看更多评论

要查看或添加评论，请登录

查看全部

When Linear Models Don’t Fit Your Data, Now What?

Karen Grace-Martin

Statistical Consultant, Trainer, and Mentor for Researchers at The Analysis Factor

Distributional Assumptions in Linear Models

Categorical Dependent Variables

Ordinal Variables

领英推荐

Count Variables

Zero Inflated Variables

Censored Variables

Proportions

Generalized Linear Models

更多精彩文章

社区洞察

其他会员也浏览了

Addressing Normality in Latent Profile Analysis (LPA) and Latent Class Analysis (LCA)

Introduction to Group Feature Selection

Checking for the Assumptions of Linear Regression using the mtcars dataset ????

Statistical Distributions: Types and Importance.

Understanding the Minimum Description Length Principle: A Balance Between Model Complexity and Data Fit

Understanding Shuffle Operations in Spark: An In-Depth Look

Birds Of A Feather. Or Do They? - K Nearest Neighbors Validation

I ran 580 model-dataset experiments to show that, even if you try very hard, it is almost impossible to know that a model is degrading just by looking

Quantico: Forecasting Panel & Single Series Data

LINEAR REGRESSION ON BOSTON DATASET

Distributional Assumptions in Linear Models

Categorical Dependent Variables

Ordinal Variables

领英推荐

Count Variables

Zero Inflated Variables

Censored Variables

Proportions

Generalized Linear Models

Strategies for Choosing the Reference Category in Dummy Coding

2024年11月21日

Overfitting in Regression Models

2024年11月13日

The Right Analysis or the Best Analysis? What to Do When You Can’t Run the Ideal Analysis

2024年11月7日

When to Use Logistic Regression for Percentages and Counts

2024年10月30日

Tricks for Using Word to Make Statistical Syntax Easier

2024年10月16日

Five Extensions of the General Linear Model

2024年10月4日

Why Statistics Terminology is Especially Confusing

2024年9月25日

How Big of a Sample Size do you need for Factor Analysis?

2024年9月23日

The Difference Between Model Assumptions, Inference Assumptions, and Data Issues

2024年9月11日

The Difference Between Association and Correlation

2024年9月2日

社区洞察

其他会员也浏览了

Addressing Normality in Latent Profile Analysis (LPA) and Latent Class Analysis (LCA)

Introduction to Group Feature Selection

Checking for the Assumptions of Linear Regression using the mtcars dataset ????

Statistical Distributions: Types and Importance.

Understanding the Minimum Description Length Principle: A Balance Between Model Complexity and Data Fit

Understanding Shuffle Operations in Spark: An In-Depth Look

Birds Of A Feather. Or Do They? - K Nearest Neighbors Validation

I ran 580 model-dataset experiments to show that, even if you try very hard, it is almost impossible to know that a model is degrading just by looking

Quantico: Forecasting Panel & Single Series Data

LINEAR REGRESSION ON BOSTON DATASET