Test Assumptions Only After Running an Initial Model
Like the chicken and the egg, there’s a question about which comes first: run a model or test assumptions? Unlike the chickens’, the model’s question has an easy answer.
There are two types of assumptions in a statistical model.? Some are distributional assumptions about the residuals.? Examples include independence, normality, and constant variance in a linear model.
Others are about the form of the model.? They include linearity and including the right predictors.
You can get clues about whether most of these assumptions will be met before running a model. But you can’t check them.
All the distributional assumptions of linear models are about the residuals.? Many of the others can be checked by looking at residuals.
And you can’t get residuals until you run a model.
In the?steps to running a model I use, testing assumptions is step 11.? Running an initial model is number 9.? Here is the full list, in case you haven’t seen it.
A Big Fat Caveat
So don’t start running normal probability plots or checking variances before you are reasonably sure you have what is close to a final model.
But that doesn’t mean you should put a lot of work into model refinement without a reasonable idea of whether the model is appropriate for the data.
You want to be thinking about the most appropriate type and form of model from the very beginning.
If you’ve done the foundational work in the early steps, testing assumptions is about looking for minor deviations, not major transgressions.
The Design
In Step 2, you defined the design. You checked for things like repeated measures, pairing, cluster sampling, or nested factors.? Any of these would make residuals non-independent.
If any of these design issues exist in your data, you’re not going to apply a linear model and only notice non-independence once you get to the 11th step.
Instead, you’d choose a model that accounts for the non-independence.
So yes, you still should check if the non-independence exists in the data during step 11.? (Sometimes it doesn’t even though the design indicates it’s likely).? But you should look for it and incorporate it into the analysis plan much, much earlier.
领英推荐
The Scales of Measurement
In Step 3, you defined the measurement scales of all variables.
Remember, an outcome variable (Y) does not have to be normally distributed for a linear model’s assumptions to be met.
The residuals do.
So don’t bother running tests of normality on Y. All that will do is make you panic unnecessarily if it’s a bit skewed.? (But do look at the distribution in an upcoming step, before you run a single model).
Since the predictor variables (the Xs) affect the shape of Y’s distribution, it’s possible for the residuals to be normally distributed even when Y isn’t.
But this can only happen if Y is continuous, unbounded, and measued on an interval or ratio scale.
If any of these fail, it’s nearly impossible to get normally distributed residuals, even with remedial transformations.
Types of variables that will generally fail these criteria include:
So if you find your outcome variable isn’t continuous, run a more appropriate initial model.
Run descriptive statistics first.
Likewise, in Step 6, you ran univariate and bivariate descriptive statistics—and even better, graphs—on all variables you planned to use in your model.
The univariate graphs will illuminate any distributional hiccups on even continuous data. ? Skew may not be a problem, especially if it’s not extreme.? But other distributional issues can be.? Here you’re looking for issues like:
The bivariate graphs will help you see any non-linearity in relationships and give you inklings of non-constant variance. This will allow you to incorporate these issues into the initial model run in Step 9, before you even get to checking assumptions.
When you finally do check the assumptions, you may still have some surprises. But they will be the kind you can remedy, not the kind that forces you to start over.