登录查看更多内容

Overfitting in Regression Models

Karen Grace-Martin

Statistical Consultant, Trainer, and Mentor for Researchers at The Analysis Factor

发布日期: 2024年11月13日

The practice of choosing predictors for a regression model, called model building, is an area of real craft.

There are many possible strategies and approaches and they all work well in some situations. Every one of them requires making a lot of decisions along the way. As you make decisions, one danger to look out for is overfitting—creating a model that is too complex for the the data.

What Overfitting Looks Like

Overfitting can sneak up on you. When it occurs, everything looks great. You have strong model fit statistics. You have large coefficients, with small p-values.

An overfit model appears to predict well with the existing sample of data. But unfortunately, it doesn’t reflect the population.

Regression coefficients are too large.

An overfit model overstates confidence intervals and understates p-values.

It does not fit future observations.

And most importantly, it does not replicate.

It’s like custom-tailoring a suit to a tall, thin individual with especially narrow shoulders and long arms, then hoping to sell it off the rack to the general public.

It will look great on the person you made it for, but it won’t work for (just about) anyone else.

A visual example of overfitting in regression

Below we see two scatter plots with the same data. I’ve chosen this to be a bit of an extreme example, just so you can visualize it.

On the left is a linear model for these points, and on the right is a model that fits the data pretty perfectly. The model on the right uses many more regression parameters and is overfit.

You can see why this model on the right looks great for this data set. But the only way it could work with another sample is if the points were in nearly the exact same places. It’s too customized for the data in this sample.

How and when does overfitting occur?

Models with multiplicative effects

Multiplicative effects include polynomials—quadratic, cubic, etc., and interactions. These terms aren’t generally bad, but they have the potential to add a lot of complexity to a model.

Models with many predictor variables

This is especially true for small data sets. Harrell describes a rule of thumb to avoid overfitting of a minimum of 10 observations per regression parameter in the model. Remember that each numerical predictor in the model adds a parameter. Each categorical predictor in the model adds k-1 parameters, where k is the number of categories.

So a small data set of 50 observations (much larger than my example above!) should have no more than 5 parameters, at a maximum.

Automated models

There are a number of automated model selection techniques. You may have heard them called stepwise regression; or forward or backward selection. They’re great for quickly finding the predictors that are most predictive.

Unfortunately, they’re also prone to overfitting. Newer versions of automated models, like LASSO, have been develop to specifically avoid overfitting.

Ways to avoid overfitting in regression

Use large samples whenever possible. If you have a small sample, you’ll be limited in the number of predictors you can include.

Validate the model. There are many ways to do this, such as testing it on another sample or through resampling techniques such as bootstrapping and jackknifing.

Use scientific theory in model selection. Even models whose sole purpose is prediction benefit from thoughtful reflection about which variables should be included in a model.

A great reference if you want to learn more is Harrell, Frank (2015). Regression Modeling Strategies.

Simon Bird

Chief Strategy & Measurement Officer

1 周

I think over fitting is quite like p hacking but using r2 or adjusted r2 instead. They are both great examples of Goodharts Law - once a metric for something becomes a goal it is no longer useful as a metric. In business, unlike in academia, the point of analytics or modelling is to find something interesting that your competition have not identified rather than a more ‘accurate’ model. Meaning a lower R2 model could actually be more valuable than a model with a higher r2 with more obvious variables.

2 次回应

Andrew Ekstrom

Adjunct torturer (I teach math and stats) and push boundaries that should never be.

1 周

A way to help prevent over fitting of a model is to break the data into testing and training data. See what are the "important" variables. Repeat this many more times with different random seeds. After doing this say 10 times with 10 random seeds, you've eliminated most of the spurious correlations and eliminated some of the weaker variables. This leads to a model that doesn't include "all" the important variables. It eliminates most of the non-important variables. Thus you have a more general model. Then, we have to have the discussion, "Would you rather have a generalizavle model that works well most of the time or an over fit model?"

Andrew Ekstrom

Adjunct torturer (I teach math and stats) and push boundaries that should never be.

1 周

I wish I had the presentation recorded... (almost). I did a presentation on optimizing regression models. The main idea was that overfit models make worse optimal solutions. Which means that a lot of the ideas we have in stats about how to handle a model might not be that great. Eliminate all the "non-significant" terms. Even when stats says to keep.it because it's part of a quadratic term or interaction term.

Yasir Zia (富诚)

1 周

Thanks for sharing, Karen Grace-Martin, it's amazing

1 次回应

Natalie Rodrigue

Skilled Biostatistician & Epidemiologist Mentor, Experienced, Knowledgeable & Passionate Instructor - Preclinical & Clinical Research, Genomics & Aeronautics + 30 yrs experience.

1 周

Dear Karen Grace-Martin, I would like to take a minute to thank you for posting interesting info (not copy & paste) of misleading info like we see so often on LinkedIn.

4 次回应

查看更多评论

要查看或添加评论，请登录

查看全部

What Overfitting Looks Like

A visual example of overfitting in regression

How and when does overfitting occur?

Models with multiplicative effects

Models with many predictor variables

Automated models

Ways to avoid overfitting in regression

Strategies for Choosing the Reference Category in Dummy Coding

2024年11月21日

The Right Analysis or the Best Analysis? What to Do When You Can’t Run the Ideal Analysis

2024年11月7日

When to Use Logistic Regression for Percentages and Counts

2024年10月30日

When Linear Models Don’t Fit Your Data, Now What?

2024年10月23日

Tricks for Using Word to Make Statistical Syntax Easier

2024年10月16日

Five Extensions of the General Linear Model

2024年10月4日

Why Statistics Terminology is Especially Confusing

2024年9月25日

How Big of a Sample Size do you need for Factor Analysis?

2024年9月23日

The Difference Between Model Assumptions, Inference Assumptions, and Data Issues

2024年9月11日

The Difference Between Association and Correlation

2024年9月2日