How to justify assumptions behind a Fit Least Squares model in JMP?
Variance heterogeneity issue

How to justify assumptions behind a Fit Least Squares model in JMP?

Fit Least Squares models are often used to take critical decisions. It is therefore important to justify assumptions behind the models. We experience that there is an increased focus from the Health Authorities on this during audits.

There are basically three assumptions, where the last one is only for models with random effects:

1.??? Residuals are normal distributed

2.??? There is variance homogeneity

3.??? Random factor effects follow a normal distribution

Assumption 1 and 2 can be enforced at a time looking at an externally studentized residual plot in e.g. JMP from SAS.

Assumption 3 can be enforced by looking on the BLUP′s (Best Linear Unbiased Predictors of Random Effects) also using an externally studentized residual plot.

To our opinion this is the best and the easiest way of doing it. It is a unified approach where the same tool is used for all 3 assumptions.

In the following the tool will be described in more detail including 3 examples on that all 3 assumptions can be enforced with the same tool.


Tool: Externally studentized residual plot

Studentized residuals are residuals scaled to standard error of the residual.

Externally studentized residuals are residuals scaled to the standard error of individuals, where the observation under test is excluded in the model when calculating both the residuals and the standard error on individuals. Thereby you increase the sensitivity of the test.

Since it is scaled residuals, the limits are for studentized and externally studentized residuals respectively:

Formulas for studentized and externally studentized residuals limits

For Externally Studentized residuals you must subtract 1 in the Degrees of Freedom DF for the residuals to correct for taking the observation under test out. DF is equal to N-1-k where N is the number of observations and k is the number of parameters in the model (not counting the intercept).

The great thing about these limits is that they take estimation uncertainty of the standard deviation into consideration by using the t-quantile and adapt the false alert rate alpha to the number of observations by dividing it with the N, so we in total only run alpha risk of getting a false alert across all observations. Thereby a one-sided alpha of 2.5% works in most situations. In JMP you will also see the individual limits where alpha is not divided by N, they are green. The limits where alpha is divided by N, so-called Bonferroni limits, are red.

In case of few levels J of a random factor, like 3 validation batches, it might be better to look at a studentized residual plot instead of an externally studentized, since limits gets very wide with few degrees of freedom. When J is low there is a big difference on the width of the t-distribution with J-1 and J-2 Degrees of Freedom.


Example with lognormal distributed data

Contamination data typically follows a lognormal distribution as shown below in Figure 1.

Figure 1: Distribution test on contamination data

If you make a Fit Least Squares model in JMP, you will get a clear warning on the externally studentized residual plot. A lot of points are outside the red limits, but only to the high side, due to an upwards skewed distribution as shown in Figure 2.

Figure 2: Externally studentized residual plot on contamination data

However, just below the externally studentized residual plot you have the Box-Cox transformation test as shown in Figure 3.

Figure 3: Box-Cox transformation test on contamination data

The best lambda is close to 0 (log transformation) and if you Refit with Transform with lambda 0, you will see that there are no outliers any longer, due to you are now living up to the assumption of normal distributed residuals after transformation. This is shown in Figure 4.

Figure 4: Externally studentized residual plot on log transformed contamination data.


Example with lack of variance homogeneity.

Below in Figure 5 is shown the ISPE (International Society of Pharmaceutical Engineering) validation data set for Blend Uniformity. Batch B varies much more within a blender location than Batch A and C.

Figure 5: ISPE Blend Uniformity data set

Building a model with Batch, Location and Batch*Location, this is detected on the externally studentized residual plot. A single point is outside the red line, and many are outside the green line, all of them coming from Batch B as shown in Figure 6.

Figure 6. Fit Least Squares model

You cannot just go on with the model here. You will be pooling unequal residual variances, which does not make sense. It does not help to transform, since the variance issue is not related to the level of the response. You will need to scale down the variance on Batch B, so it can be assumed to be the same as Batch A&C, or scale up the variance on Batch A&C so it can be assumed to be the same as Batch B. Let us assume we have found and removed the cause of larger variance in Batch B, then we are allowed to scale it down. This can be done by making a log-variance model to find out how much more it varies and then do weighted regression with inverse variance ensuring that the weight factor is 1 for Batch A&C.

Then the externally studentized residual plot is without outliers and Batch B does not have higher outliers than the rest as shown in Figure 7.

Figure 7: Externally studentized residual plot on a weighted Fit Least Squares Model



Example with Random Factors

Since the ISPE data set is a validation data set, Batch should be a Random Factor, because we need to prove that also future batches will be good. We then see the validation batches as a random sampling of all future batches. Then Location*Batch is also random.

Location will be systematic, since the location are systematically selected to cover worst case positions in the blender.

However, if you make a model with random factors, you will not get limits on the studentized residual plot in JMP, and it will not be externally studentized. This is due to that we cannot just look at individual outliers. A complete level of a random factor might be outliers.

We need to justify that Batch and Location*Batch can be assumed to be random factors, where their effects are assumed normal distributed. This can be done by looking at the random effect BLUP′s and make a studentized residual plot on these for each random factor Batch and Location*Batch separately. Since we only have 3 batches, it might be the best to make a studentized residual plot, instead of an externally studentized. But why not make both as shown in Figure 8. Both the studentized residuals (blue) and externally studentized residuals (red) are inside their respective limits, so we can justify setting both Batch and Batch*Location as random factors. Remark the much wider limits on external residuals for batches, where there are only three.

Figure 8: Studentized (blue) and externally studentized (red) residual plot on BLUP′s.

From the model with Batch and Batch*Location as random factors and Location as systematic 99.73% prediction limits for future batches can be calculated as shown in Figure 10. Remark that prediction limits are inside specification limits for all locations except location 1. It is thereby proven with confidence that future batches will also be inside specification for all locations, except location 1, if we can maintain the validated state. The dotted line shows control limits calculated based on estimated mean, estimated total standard deviation and the normal quantile (without confidence). They are inside specification also for location 1. So, without confidence all future batches are predicted to be fine, but to prove location 1 is fine with confidence we need more batches. Since we are close, just one more batch should be sufficient.

Batch B (green) are outside both control and prediction limits, since it is raw data that is plotted, and limits are calculated as if Batch B has the same variance as A&C. Above conclusions are only valid, if we have found and removed the cause for higher variance in Batch B.

Figure 10: Prediction and control limits versus location and batch.


Massimo Martucci

Guiding JMP customers to find faster, easier and better answers to their statistical discovery questions

2 个月

Great article Per Vase, checking those OLS assumptions is crucial but sometimes they are even forgotten

回复

要查看或添加评论,请登录

Per Vase的更多文章

社区洞察

其他会员也浏览了