Overfitting in Regression Models
Karen Grace-Martin
Statistical Consultant, Trainer, and Mentor for Researchers at The Analysis Factor
The practice of choosing predictors for a regression model, called model building, is an area of real craft.
There are many possible strategies and approaches and they all work well in some situations. Every one of them requires making a lot of decisions along the way. As you make decisions, one danger to look out for is overfitting—creating a model that is too complex for the the data.
What Overfitting Looks Like
Overfitting can sneak up on you. When it occurs, everything looks great. You have strong model fit statistics. You have large coefficients, with small p-values.
An overfit model appears to predict well with the existing sample of data. But unfortunately, it doesn’t reflect the population.
Regression coefficients are too large.
An overfit model overstates confidence intervals and understates p-values.
It does not fit future observations.
And most importantly, it does not replicate.
It’s like custom-tailoring a suit to a tall, thin individual with especially narrow shoulders and long arms, then hoping to sell it off the rack to the general public.
It will look great on the person you made it for, but it won’t work for (just about) anyone else.
A visual example of overfitting in regression
Below we see two scatter plots with the same data. I’ve chosen this to be a bit of an extreme example, just so you can visualize it.
On the left is a linear model for these points, and on the right is a model that fits the data pretty perfectly. The model on the right uses many more regression parameters and is overfit.
You can see why this model on the right looks great for this data set. But the only way it could work with another sample is if the points were in nearly the exact same places. It’s too customized for the data in this sample.
How and when does overfitting occur?
Models with multiplicative effects
Multiplicative effects include polynomials—quadratic, cubic, etc., and interactions. These terms aren’t generally bad, but they have the potential to add a lot of complexity to a model.
Models with many predictor variables
This is especially true for small data sets. Harrell describes a rule of thumb to avoid overfitting of a minimum of 10 observations per regression parameter in the model. Remember that each numerical predictor in the model adds a parameter. Each categorical predictor in the model adds k-1 parameters, where k is the number of categories.
So a small data set of 50 observations (much larger than my example above!) should have no more than 5 parameters, at a maximum.
Automated models
There are a number of automated model selection techniques. You may have heard them called stepwise regression; or forward or backward selection. They’re great for quickly finding the predictors that are most predictive.
Unfortunately, they’re also prone to overfitting. Newer versions of automated models, like LASSO, have been develop to specifically avoid overfitting.
Ways to avoid overfitting in regression
Use large samples whenever possible. If you have a small sample, you’ll be limited in the number of predictors you can include.
Validate the model. There are many ways to do this, such as testing it on another sample or through resampling techniques such as bootstrapping and jackknifing.
Use scientific theory in model selection. Even models whose sole purpose is prediction benefit from thoughtful reflection about which variables should be included in a model.
A great reference if you want to learn more is Harrell, Frank (2015). Regression Modeling Strategies.
Chief Strategy & Measurement Officer
1 周I think over fitting is quite like p hacking but using r2 or adjusted r2 instead. They are both great examples of Goodharts Law - once a metric for something becomes a goal it is no longer useful as a metric. In business, unlike in academia, the point of analytics or modelling is to find something interesting that your competition have not identified rather than a more ‘accurate’ model. Meaning a lower R2 model could actually be more valuable than a model with a higher r2 with more obvious variables.
Adjunct torturer (I teach math and stats) and push boundaries that should never be.
1 周A way to help prevent over fitting of a model is to break the data into testing and training data. See what are the "important" variables. Repeat this many more times with different random seeds. After doing this say 10 times with 10 random seeds, you've eliminated most of the spurious correlations and eliminated some of the weaker variables. This leads to a model that doesn't include "all" the important variables. It eliminates most of the non-important variables. Thus you have a more general model. Then, we have to have the discussion, "Would you rather have a generalizavle model that works well most of the time or an over fit model?"
Adjunct torturer (I teach math and stats) and push boundaries that should never be.
1 周I wish I had the presentation recorded... (almost). I did a presentation on optimizing regression models. The main idea was that overfit models make worse optimal solutions. Which means that a lot of the ideas we have in stats about how to handle a model might not be that great. Eliminate all the "non-significant" terms. Even when stats says to keep.it because it's part of a quadratic term or interaction term.
Thanks for sharing, Karen Grace-Martin, it's amazing
Skilled Biostatistician & Epidemiologist Mentor, Experienced, Knowledgeable & Passionate Instructor - Preclinical & Clinical Research, Genomics & Aeronautics + 30 yrs experience.
1 周Dear Karen Grace-Martin, I would like to take a minute to thank you for posting interesting info (not copy & paste) of misleading info like we see so often on LinkedIn.