Detecting Outliers in Correlation Analysis

Detecting Outliers in Correlation Analysis

Outliers in Correlation Analysis

Outliers are a source of complexity in correlation analysis as well as modeling and forecast performance analysis. A single outlier can have a significant impact on a correlation coefficient. When the values of one time series (or variable) are paired with corresponding values of a related time series (or variable), a relationship between the variables can be depicted in a scatter diagram. .Exhibit 1 shows a scatter diagram of 60 values from a simulated sample, from a bivariate normal distribution with population correlation coefficient ρ = 0.9. One point was moved to become an outlier. The ordinary correlation coefficient is now calculated to be 0.84. Exhibit 1 shows that, except for this single point (tagged as 1 on the labelled curve -0.06), the scatter is quite linear and, in fact, with this outlier removed, the estimated correlation coefficient is 0.9 (Source: paper “Robust estimation and outlier detection with correlation coefficients,” Susan Devlin et al., Biometrika, 1975). Outlier-resistant and robust alternatives offer some useful protection in these instances, and elsewhere when there is a need for non-normality in error distribution assumptions.

No alt text provided for this image

Exhibit 1 Scatter Plot with Influence Function Contours for a Sample of Bivariate Normal Data with the Added Outlier (n = 60, r = 0.9, with outlier r = 0.84) (Source: Data from R.A. Fisher, The Design of Experiments, 1935, digitized 2009).

Linear Association and Correlation In a scatter plot or diagram, one variable is plotted on the horizontal scale and the other is plotted on the vertical scale. Such a plot is a valuable tool for studying the relationship between a pair of variables as a preliminary step in building regression models for forecasting.?When two series have a strong positive association, the scatter diagram has a pattern of points along a line of positive slope. A negative association shows up as a scatter of points along a line with negative slope. A conventional measure of linear association between a pair of variables Y and X is known as the ordinary correlation coefficient r, where r is an averaging formula using the sample mean and sample standard deviation of the two variables, respectively. It is also more conventionally referred to as the Pearson product moment correlation coefficient.

Nowadays, spreadsheet and statistical software programs routinely calculate r, as it is just the result of an averaging process; namely, of the average of a product of standardized variables: Average {(Standardized Yt)*(Standardized Xt)}, with a divisor of (n-1) instead of n, where n is the common number of X, Y pairs. When a variable is standardized, it has a zero mean and a unit standard deviation, which is useful for making comparisons and correlations between variables that have very different sizes or scales of measurement. A standardized value is obtained by subtracting the sample mean from the data and dividing by the sample standard deviation.

No alt text provided for this image

The table shows a spreadsheet calculation using standardized variables for an annual housing starts and mortgage rates series. The coefficient can vary between +1 and -1, so that r = - 0.20 suggests a weak negative association between housing starts and mortgage rates. Although the product moment correlation coefficient for the housing starts versus mortgage rates data is only about –0.20, the correlation coefficient for the respective annual changes in these variables turns out to be -0.57. Both are negative, as expected, but the latter reflects a much stronger linear association. This suggests that the strength of the relationship between housing starts and mortgage rates is reflected in their respective growth rates, not so much the levels.

No alt text provided for this image

The product moment correlation coefficient is a measure of linear association between two variables. An outlier-resistant measure of correlation, explained later, comes up with values of r* (SSD) = 0.9 for the level as that represents the linear relationship shown in the bulk of the data. When we contrast r with r* (SSD) between these sets of numbers, it may indicate deviations in linearity due to outlying or non-typical data.

The forecaster needs to periodically review the underlying data for outlier(s) in the patterns. An outlier may not necessarily appear to be visually extreme from the bulk of the data in these situations. In routine practice, a forecaster needs to calculate both estimates of correlation. If the numbers are close, the forecaster can report the classical r. If deemed too different, the forecaster should delve deeper into the data underlying the calculations.

The Need for Outlier-Resistance in Correlation Analysis. One outlier-resistant estimator of correlation, known as r*(SSD), is less affected by outliers than the ordinary correlation coefficient r. It is derived from the standardized sums and differences of two variables, say Y and X, as introduced in a Biometrika paper “Robust estimation and outlier detection with correlation coefficients,” by Susan Devlin et al., (1975).

The first step in deriving r*(SSD) is to standardize both Y and X robustly by constructing two new variables Y and X: ? = (Y - Y*)/SY* and ? = (X - X*)/SX* where Y* and X* are robust estimates of location and SY* and SX* are robust estimates of scale. Now, let Z1 = ? + ? and Z2 = ? - ?, the sum and difference vectors, respectively. Then the robust variance of the sum vector Z1` and difference vector Z2 are calculated; they are denoted by V+* and V-*, respectively.

The variances are used in the calculation of the robust correlation estimate r*(SSD) given by r*(SSD) = (V+* - V-*) / (V+* + V-*). The justification for this formula can be seen by inspecting the formula for the variance of the sum of two variables: Var(Z1) = Var (?) + Var (?) + 2 Cov (?, ?) where Cov denotes the covariance between ? and ?. Since ? and X are standardized, centered about zero, with unit scale, the expected variance of Z1 is approximately Var(Z1) ≈ 1 + 1 + 2 ρ (?, ?) = 2 (1 + ρ), where ρ is the theoretical correlation between ? `and X. Similarly, for Z2, Var (Z2) ≈ Var (?) + Var (?) - 2 Cov (?, ?) = 1 + 1 - 2 ρ (?, ?) = 2 (1- ρ). Notice that the expression [Var (Z1) - Var (Z2)] / [Var (Z1) + Var (Z2)] ≈ [2(1+ ρ) - 2(1- ρ)] / [2(1+ ρ) + 2(1 - ρ)] = ρ. Some robust estimates of the (square root of the) variance, required in the formula for r* include the Median Absolute Deviation from the median (MdAD) and the InterQuartile Range (IQR). This approach is documented in my book entitled Change&Chance Embraced: Achieving Agility with Smarter Forecasting in the Supply Chain (available through Amazon.com).

No alt text provided for this image
No alt text provided for this image

Hans Levenbach, PhD is Executive Director,?CPDF Training and Certification Programs. He conducts hands-on Professional Development Workshops on Demand Forecasting for multi-national supply chain companies worldwide.

No alt text provided for this image

Dr. Levenbach is he author of a business forecasting and demand planning book entitled: Change&Chance Embraced: Achieving Agility with Smarter Forecasting in the Supply Chain. The book has been updated with the new?LZI method for intermittent demand forecasting?in the Supply Chain.

No alt text provided for this image

With endorsement from the International Institute of Forecasters (IIF), he created?CPDF,?the first IIF certification curriculum for the professional development of demand forecasters. and has conducted numerous, hands-on?Professional Development Workshops?for Demand Planners and Operations Managers in multi-national supply chain companies worldwide. Hans is an?elected Fellow, Past President and former Treasurer, and former member of the Board of Directors of the?International Institute of Forecasters.

No alt text provided for this image

The?2021 CPDF Workshop Manual?is available for self-study, online workshops, or in-house professional development courses.

He is group manager of the LinkedIn groups (1)?Demand Forecaster Training and Certification, Blended Learning, Predictive Visualization, and (2)?New Product Forecasting and Innovation Planning, Cognitive Modeling, Predictive Visualization.

This type of analysis is documented in my recently updated paperback book, Change&Chance Embraced, Achieving Agility with Smarter Forecasting in the Supply Chain, available on Amazon worldwide.

回复
?lkem ?etinkaya

Europe Strategic Supply Planning Sr. Manager

8 年

This is good, also happy that lewandowski on jda enables this automatically ??

Pranav R.

Pharma | Business Insights | Consulting | Analytics | Forecasting

8 年

Thank you! This actually relates to a question I recently faced

要查看或添加评论,请登录

社区洞察

其他会员也浏览了