Why value creation with statistics often fails and how to avoid it.

Why value creation with statistics often fails and how to avoid it.

??????????

1?????????Introduction

I have for years had a training program in applied statics using JMP from SAS. This document has the triple purpose of:

1.???making a summary of the learnings for participants

2.???being an appetizer for those who might be interested in it

3.???making users of applied statistics in general aware of pitfalls

Working as a consultant in applied statistics, I have realized that many companies often fail using applied statistics to create value. Below I have listed some of the typical errors I see. I am an autodidact “applied statistician” with a Ph.D. in Physics. I have made all the errors myself but hopefully learned from it, not making the same mistake too many times. That is the best way to remember.

Some companies hire statisticians to prevent these errors. I have learned a LOT about statistics working with these. Although they are for obvious reasons good in statistics, their ability to combine the result with mechanistic understanding is not always the best. I am a strong believer in that the combination of applied statistical skills and mechanistic knowledge is what makes a difference. We must train technical people in statistics. Fortunately, with the existence of statistical software packages, this is must easier today than previously. My recommendation for software is JMP from SAS, but several other packages exist. It is more important how you use the software than which software you use.

We still need to prevent the statistically trained people doing the errors listed below, by making them aware of the pitfalls and how to solve these. It is for sure not a complete list, just the typical things, I am able see with my limited knowledge and experience. I have tried to combine them into themes so the reader can go to the relevant theme.

2???????????????Sample Size

Sample size is becoming a hot issue especially within the pharmaceutical industry. It is a balance between compliance and QC costs.

2.1???????????Issue 1: Sample size rationale is missing or unclear. Compliance issue.

A typical remark during audit is that rationale for sample size is either missing or unclear. The typical reason for the remark, is quality decisions are based on estimated value of a summary statistic in the sample and should be based on the true value in the population. Therefor a certain sample size is needed, to ensure estimated value is close enough to the true value. In practice this is the same as the whole confidence interval around the estimated value should be acceptable.

Solution: The by far easiest solution to this, is to put confidence into the acceptance criteria. If the quality criteria are passed with confidence, the sample size has been sufficient. Then sample size calculations are no longer quality critical, but still needs to be made to ensure the criteria will be passed but are now made for business reasons.

2.2???????????Issue 2: Sample size is a guess. Yield & QC costs issue.

Result of sample size calculations are mainly decided by the gap between the actual and required performance. If the gap is large, there is room for a large confidence interval and thereby only a small sample size is needed and vice versa. The required performance should be known, although it is sometimes unclear where the requirement is coming from and often are set tighter than necessary, especially when enforced with confidence. But the true performance is never known, so that will always to some degree be based on a qualified guess. Thereby the sample size is also just a guess. Might be too low (false rejection) or too high (too costly).

Solution: Go from fixed sample size to sequential sampling, where sampling is done in sequences until acceptance criteria are either passed or failed with confidence. It requires a stop rule. If the whole confidence interval is outside acceptable range, next step is not allowed, and a non-Conformity must be raised.

3???????????????Measurement Systems Analysis

When investigating Out of Specification (OOS) issues at a client, we will in most cases conclude it is coming from the measurements, not the products. This is due to the two issues described below.

3.1???????????Issue 1: Precision Tolerance ratio large. Yield issue.

Classically a GageRR/Intermediate Precision standard deviation is estimated to qualify a measurement system. The requirement is typically in absolute units or as a GageRR%, where it is calculated how much of the total variation is coming from the measurement system. None of these directly ensure that the GageRR/Intermediate do not take a too large part of the tolerance and thereby leads to false rejects.

Solution: Enforce the ratio between the GageRR/Intermediate Precision standard deviation and the +/-Tolerance during method qualification. We recommend it shall be lower than 10%, but in special cases, where Production only needs a small part of the tolerance, it can be higher.

3.2???????????Issue 2: Precision without confidence. Yield issue.

Most companies qualify their measurements GageRR/Intermediate Precision standard deviation without confidence although guidance’s for how to do it with confidence (like ISO GUM) have been there for years. Typically, only few levels (classically 3) of reproducibility factors (operator/equipment/day) have been used, making the estimation of reproducibility very uncertain, so there can be a large difference between estimated and true value. Even though the estimated precision is fine the true value might not be, leading to false approval of measurement systems and thereby a Yield issue.

Solution: Enforce ratio between GageRR/Intermediate Precision standard deviation and +/- Tolerance with confidence. Although statistical software packages in their Measurement Systems Analysis Platforms typically only give estimated GageRR/Intermediate Precision without confidence, they have the toolbox to do it with confidence in their modelling platforms with random factors. With a little bit of scripting, this can be fully automated.

4???????????????Statistical Modelling

Statistical modelling is the heart of analytical statistics. Classical Fit Least Squares models can describe many processes and thereby form the basis for databased optimization and control. Often you do not need to go to more advanced modelling techniques like Machine Learning or Artificial Intelligence, if you are aware and act on the issues described below. There are typically some assumptions behind these models that must be fulfilled to a certain degree, to make conclusions reliable. Below these assumptions are listed in issue 1&2 including how to handle it, if not fulfilled.

4.1???????????Issue 1: Residuals are not normal distributed.

Residuals not being normal distributed can have several causes:

1.???Despite the name many processes are by nature not normal distributed

2.???Outliers

3.???Due to drift over time the normal distribution moves

4.???If data from several units are pooled the result might be multimodal.

Solutions:

1.???Transform data with e.g., a Box-Cox transformation to make residuals normal distributed

2.???Exclude outliers

3.???Enter sampling time as a random factor in the model

4.???Enter unit as systematic factor in the model

4.2???????????Issue 2: Variance homogeneity is not fulfilled.

In Fit Least Squares a model is made where the average squared residual is minimized. If residuals depend on either:

1.???level

2.???factor settings

the model will fit the observations where the residuals are highest the best because this will give the lowest average squared residual. This does not make sense. It means that we pay more attention to observations the noisier they are. It should be the opposite.

Solutions:

1.???Box-Cox transformation to obtain variance homogeneity or weigh regression with inverse variance to scale residuals so they get the same. From the Box-Cox lambda the weighing factor can be calculated.

2.???Weighted regression as above. Make a LogVariance model to find the variance dependency on factor setting.

4.3???????????Issue 3: No test for Outliers

Especially with models based on many data, it is unavoidable that there will be outliers, that needs to be excluded from the model and reported separately. An outlier is an observation that cannot be described by the model. Often, we see model results being used without testing for outliers. This is serious, because if there are outliers they cannot be predicted by the model, so even if the model predicts a good process, it might not be right. Outliers shall be reported separately.

Solution: Look at an externally studentized residual plot with 95% simultaneous limits. If observations are outside limits, it must be treated as an outlier and the model rebuilt without it. Before testing for outliers ensure that assumptions behind the model are fulfilled and if needed corrected for as described in Issue 1 & 2 above.

4.4???????????Issue 4: Identified outliers are included in the model

Even if an outlier test has been performed and outliers have been identified, they are typically included in the model if Laboratory Investigations are inconclusive. This is wrong, it is still an outlier that cannot be described by the model. So even if the model is acceptable with the outlier included, it is not sufficient.

Solution: Report the outliers separately.

4.5???????????Issue 5: Model reduction is done based on P-Values

Classically model reduction is done by removing terms with high P-Values from the model to avoid overfitting. However, the P-Value criteria needs to be adapted to the size of the data set and the noise level. In large data set with low noise level a low P-Value threshold should be used. Otherwise, parameters with no practical influence stays in the model. In small data set with large noise levels a high P-value threshold should be used. Otherwise, parameters with large practical influence might be removed.

Solution: Use loglikelihood criteria corrected for number of fitting parameters when reducing models. For large data sets with few predictors, Bayes Information Criterion (BIC) is recommended. For small data sets with many predictors, Akaike Information Criterion corrected (AICc) is recommended. Model Comparison with BIC and AICc can be found in All Possible model comparison in JMP. At all times combine with mechanistic understanding from process experts.

?

4.6???????????Issue 6: Lack of variability and combinations

To build a model and find relations between inputs and outputs they both need to vary. In historical data sets, especially from processes with fixed recipe, the variation might not be sufficient. In addition to test for interactions many combinations of factors need to be in the data set, which is often not the case.

Solution: Design of Experiments were factors by purpose a varied more than in Production, and where the design ensures we have the combination needed to test for interactions.

?

5???????????????Design of Experiments

As described above Design of Experiments (DoE) solve the issues with lack of variability and combinations of the predictors. In DoE the classical approach is first to make a screening experiment among many factors to find the critical few and then in second round make a model experiment on the critical few to establish relations including curvature and interactions.

5.1???????????Issue 1: Wrong view on what is important.

Classical screening experiments do not take curvature and interactions into consideration. They check each factor at low and high level only, followed with some center points where all factors are in the middle. If there is curvature on a factor, it can be important, even though the output is the same at low and high level. In a classical screening design, you can test for if there is curvature to get an alert, but you cannot see which factor caused it, since they were all in the middle at the same time. In addition, in classical screening design there is full confounding between main effects and 2 factor interactions. This causes a main effect can be seen stronger or weaker as it is, in case of two factor interaction.

Solution: Definitive Screening Design: A design where center points are displaced and where 2 factor interactions are not confounded at all with main effects.

5.2???????????Issue 2: Long time to perform experiments.

Often some factors are more difficult to change than others. If the experiments are fully randomized, as they should be, it takes then a long time to perform the experiments. The hard to change factors shall be changed between each experiment. Alternatively, the experimental plan can be sorted, so there is as few changes as possible. However, then the experiment is no longer randomized, and drift will be confounded with hard to change factors, leading to potentially wrong estimation of their effects.

Solution: Make a Design with Hard to Change factors, so they are randomized less than the others and then include Whole Plots as random factor in the model to correct for different degree of randomization.

5.3???????????Issue 3: Model reduction with P-Values is especially difficult when using hard to change factors.

When using hard to change factors as described in Issue 2, the hard to change factors have fewer Degrees of Freedom than the other due to fewer changes of the factors. This causes that the P-value will be higher for the same influence. P-Values are therefore not a fair criterion for model reduction. Hard to change factors comes together with a Whole Plot as a random factor which is not compatible with All Possible models as described in 4.5.

Solution: Look at both parameter estimates (practical significance) and P-Values (statistical significance) when reducing models and combine with mechanistic understanding from process experts.

5.4???????????Issue 4: Wrong interpretation of significance.

When doing experiments, the time-consuming part is often to do the changes between experiments. It is there for obvious to make more than one sample per experiment, when the process settings are done. Then we must be careful analyzing. This is NOT the same as performing the same experiment at another timepoint. It can lead to wrong interpretation of statistical significance.

Solution:

Build a model on the mean of the samples within an experiment, or preferably build a model on the individual observations, including the experimental number as a random attribute factor in the model. Then the model is corrected for it is several measurements on the same experiment and not the same experiment performed at a different time.

5.5???????????Issue 5: It takes a long time with 2 rounds

Classically DoE has been a 2-step approach, with first a screening experiment and then a model experiment. Due to the 2 rounds, with waiting time in between waiting for screening results, before the model DoE can be designed, it can take a long time

Solution: Make the screening DoE slightly larger, include curvature and minimize aliasing between main effects and interactions. Then there is a high chance that the screening DoE can be reused as model DoE if half of the screened factors has no or low influence. Then there is no need for a second round of DoE.

6???????????????Statistical Process Control

Statistical process control consists of two subjects control charts and capability indices. The classical control charts and capability indices assumes normal distributed data, which is rarely fulfilled. Remark that the solution to all issues below is the same.

6.1???????????Issue 1: Wrong control limits from non-normal data

Solution: Use prediction limits from a model on Box-Cox transformed data that describes the process as control limits. The transformed data will be normal distributed. In JMP from SAS prediction limits will automatically be back transformed.

6.2???????????Issue 2: Wrong control limits on Shewhart charts

Due to between subgroup variation is by nature different from within subgroup variation be careful with Shewhart charts that enforces within subgroup variation across subgroups.

Solution: Use prediction limits from a model with subgroup as random factor describing the process as control limits

6.3???????????Issue 3: Wrong alert limits due to large estimation error

Especially for new products, the first control limits must be set based on few data. Then there is a large standard error on both mean and especially standard deviation. There is then a high risk they are too narrow, and a future Out of Trend might be a false alert.

Solution: Use prediction limits from a model describing the process as control limits. Prediction limits takes estimation error into consideration. They must be updated when more data are available in order not to have to wide limits.

6.4???????????Issue 4: Capability indices have the same issues 1-3.

Solution: Calculate Capabilities from prediction limits.

7???????????????Validation

Validation is about prediction the future not the past. It is no longer enough to be able to make 3 batches. We must make several batches and based on these, predict that future batches will be good.

7.1???????????Issue 1: Compliance issue not prediction the future

Solution: Make a model that describes the validation data set with batch as random factor. Then prediction intervals contain future batches. If necessary, prediction limits can be converted to capability indices.

7.2???????????Issue 2: Take many batches to predict the future

Solution: Pass stage 2 on control limits without confidence are inside specification and stage 3A on prediction limits are inside. Then we can market batches after the first 3, as in the old validation concept. However, inspection level cannot be reduced or removed until prediction limits are inside.

?

7.3???????????Issue 3: High QC costs due to non-sufficient validation

Sine it is well known that being able to make 3 good batches, does not in itself guarantee that all future batches will be good, a costly QC test is needed on each batch afterwards to ensure it is acceptable.

Solution: Make predictive validation as in issue 1 and thereby end-product testing is not needed going forward if validated state is maintained. This can be done by measuring on the process instead of the product.

Patrick Galler

QC Manager - Data and Analytics hos Aker BioMarine

2 年

Which of the courses available here https://www.nne.com/about-us/events-and-courses/ would one have to book to cover as much as possible of what is described in this post? Thank you!

Allen Scott

Management / Quality Consultant “The measure of quality, no matter what the definition of quality may be is a variable.” (Shewhart, 1931)

2 年

You have fallen into the trap of Pearson the younger, the trap Dr. Shewhart was careful to avoid and you can forget about creating any value until you address your problem. Classic statistics are inappropriate for quality. This short piece might help you right the ship. https://www.spcpress.com/pdf/DJW287.pdf

"The classical control charts and capability indices assumes normal distributed data," Utter garbage! Please read Wheeler, Deming, Shewhart.

Per Vase

Managing Consultant at NNE , Denmark

2 年

Hi all, Thank you for the nice words and the constructive feedback. Based on your inputs I have added chapter 4.5, 5.3 and 7.3.

Mads Peter Rab?l

Process Engineer | GMP | Data Analysis

2 年

A good read! Thanks Per ??

要查看或添加评论,请登录

Per Vase的更多文章

社区洞察

其他会员也浏览了