登录查看更多内容

Detecting Outliers in Correlation Analysis

Hans Levenbach PhD CPDF

发布日期: 2016年9月22日

Outliers in Correlation Analysis

Outliers are a source of complexity in correlation analysis as well as modeling and forecast performance analysis. A single outlier can have a significant impact on a correlation coefficient. When the values of one time series (or variable) are paired with corresponding values of a related time series (or variable), a relationship between the variables can be depicted in a scatter diagram. .Exhibit 1 shows a scatter diagram of 60 values from a simulated sample, from a bivariate normal distribution with population correlation coefficient ρ = 0.9. One point was moved to become an outlier. The ordinary correlation coefficient is now calculated to be 0.84. Exhibit 1 shows that, except for this single point (tagged as 1 on the labelled curve -0.06), the scatter is quite linear and, in fact, with this outlier removed, the estimated correlation coefficient is 0.9 (Source: paper “Robust estimation and outlier detection with correlation coefficients,” Susan Devlin et al., Biometrika, 1975). Outlier-resistant and robust alternatives offer some useful protection in these instances, and elsewhere when there is a need for non-normality in error distribution assumptions.

Exhibit 1 Scatter Plot with Influence Function Contours for a Sample of Bivariate Normal Data with the Added Outlier (n = 60, r = 0.9, with outlier r = 0.84) (Source: Data from R.A. Fisher, The Design of Experiments, 1935, digitized 2009).

Linear Association and Correlation In a scatter plot or diagram, one variable is plotted on the horizontal scale and the other is plotted on the vertical scale. Such a plot is a valuable tool for studying the relationship between a pair of variables as a preliminary step in building regression models for forecasting.?When two series have a strong positive association, the scatter diagram has a pattern of points along a line of positive slope. A negative association shows up as a scatter of points along a line with negative slope. A conventional measure of linear association between a pair of variables Y and X is known as the ordinary correlation coefficient r, where r is an averaging formula using the sample mean and sample standard deviation of the two variables, respectively. It is also more conventionally referred to as the Pearson product moment correlation coefficient.

Nowadays, spreadsheet and statistical software programs routinely calculate r, as it is just the result of an averaging process; namely, of the average of a product of standardized variables: Average {(Standardized Yt)*(Standardized Xt)}, with a divisor of (n-1) instead of n, where n is the common number of X, Y pairs. When a variable is standardized, it has a zero mean and a unit standard deviation, which is useful for making comparisons and correlations between variables that have very different sizes or scales of measurement. A standardized value is obtained by subtracting the sample mean from the data and dividing by the sample standard deviation.

The table shows a spreadsheet calculation using standardized variables for an annual housing starts and mortgage rates series. The coefficient can vary between +1 and -1, so that r = - 0.20 suggests a weak negative association between housing starts and mortgage rates. Although the product moment correlation coefficient for the housing starts versus mortgage rates data is only about –0.20, the correlation coefficient for the respective annual changes in these variables turns out to be -0.57. Both are negative, as expected, but the latter reflects a much stronger linear association. This suggests that the strength of the relationship between housing starts and mortgage rates is reflected in their respective growth rates, not so much the levels.

The product moment correlation coefficient is a measure of linear association between two variables. An outlier-resistant measure of correlation, explained later, comes up with values of r* (SSD) = 0.9 for the level as that represents the linear relationship shown in the bulk of the data. When we contrast r with r* (SSD) between these sets of numbers, it may indicate deviations in linearity due to outlying or non-typical data.

The forecaster needs to periodically review the underlying data for outlier(s) in the patterns. An outlier may not necessarily appear to be visually extreme from the bulk of the data in these situations. In routine practice, a forecaster needs to calculate both estimates of correlation. If the numbers are close, the forecaster can report the classical r. If deemed too different, the forecaster should delve deeper into the data underlying the calculations.

The Need for Outlier-Resistance in Correlation Analysis. One outlier-resistant estimator of correlation, known as r*(SSD), is less affected by outliers than the ordinary correlation coefficient r. It is derived from the standardized sums and differences of two variables, say Y and X, as introduced in a Biometrika paper “Robust estimation and outlier detection with correlation coefficients,” by Susan Devlin et al., (1975).

领英推荐

Regression Models - Poisson Regression

360DigiTMG 4 个月前

Simple Linear Regression in Statistics

Lean Manufacturing & Six Sigma Worldwide 10 个月前

Simple Linear Regression in Statistics (VIDEO??)

Lean Manufacturing & Six Sigma Worldwide 1 年前

The first step in deriving r*(SSD) is to standardize both Y and X robustly by constructing two new variables Y and X: ? = (Y - Y*)/SY* and ? = (X - X*)/SX* where Y* and X* are robust estimates of location and SY* and SX* are robust estimates of scale. Now, let Z1 = ? + ? and Z2 = ? - ?, the sum and difference vectors, respectively. Then the robust variance of the sum vector Z1` and difference vector Z2 are calculated; they are denoted by V+* and V-*, respectively.

The variances are used in the calculation of the robust correlation estimate r*(SSD) given by r*(SSD) = (V+* - V-*) / (V+* + V-*). The justification for this formula can be seen by inspecting the formula for the variance of the sum of two variables: Var(Z1) = Var (?) + Var (?) + 2 Cov (?, ?) where Cov denotes the covariance between ? and ?. Since ? and X are standardized, centered about zero, with unit scale, the expected variance of Z1 is approximately Var(Z1) ≈ 1 + 1 + 2 ρ (?, ?) = 2 (1 + ρ), where ρ is the theoretical correlation between ? `and X. Similarly, for Z2, Var (Z2) ≈ Var (?) + Var (?) - 2 Cov (?, ?) = 1 + 1 - 2 ρ (?, ?) = 2 (1- ρ). Notice that the expression [Var (Z1) - Var (Z2)] / [Var (Z1) + Var (Z2)] ≈ [2(1+ ρ) - 2(1- ρ)] / [2(1+ ρ) + 2(1 - ρ)] = ρ. Some robust estimates of the (square root of the) variance, required in the formula for r* include the Median Absolute Deviation from the median (MdAD) and the InterQuartile Range (IQR). This approach is documented in my book entitled Change&Chance Embraced: Achieving Agility with Smarter Forecasting in the Supply Chain (available through Amazon.com).

Hans Levenbach, PhD is Executive Director,?CPDF Training and Certification Programs. He conducts hands-on Professional Development Workshops on Demand Forecasting for multi-national supply chain companies worldwide.

Dr. Levenbach is he author of a business forecasting and demand planning book entitled: Change&Chance Embraced: Achieving Agility with Smarter Forecasting in the Supply Chain. The book has been updated with the new?LZI method for intermittent demand forecasting?in the Supply Chain.

With endorsement from the International Institute of Forecasters (IIF), he created?CPDF,?the first IIF certification curriculum for the professional development of demand forecasters. and has conducted numerous, hands-on?Professional Development Workshops?for Demand Planners and Operations Managers in multi-national supply chain companies worldwide. Hans is an?elected Fellow, Past President and former Treasurer, and former member of the Board of Directors of the?International Institute of Forecasters.

The?2021 CPDF Workshop Manual?is available for self-study, online workshops, or in-house professional development courses.

He is group manager of the LinkedIn groups (1)?Demand Forecaster Training and Certification, Blended Learning, Predictive Visualization, and (2)?New Product Forecasting and Innovation Planning, Cognitive Modeling, Predictive Visualization.

Hans Levenbach PhD CPDF

4 年

This type of analysis is documented in my recently updated paperback book, Change&Chance Embraced, Achieving Agility with Smarter Forecasting in the Supply Chain, available on Amazon worldwide.

Pranav R.

Pharma | Business Insights | Consulting | Analytics | Forecasting

8 年

Thank you! This actually relates to a question I recently faced

4 次回应

查看更多评论

要查看或添加评论，请登录

Hans Levenbach PhD CPDF的更多文章

Learning About Demand Forecasting Best Practices from Industry Examples

2025年3月12日

Learning About Demand Forecasting Best Practices from Industry Examples

Demand forecasting is a critical process for businesses aiming to meet customer needs while optimizing resources. By…

2 条评论
Announcing Our Newly Updated Hands-On Workshop Manual for Demand Forecasting and Planning! ________________________________________

2025年2月2日

Announcing Our Newly Updated Hands-On Workshop Manual for Demand Forecasting and Planning! ________________________________________

We’re thrilled to introduce our newly updated forecasting workshop manual designed for professionals eager to enhance…
Using EDA To Effectively Achieve Improved Data Quality and Higher Forecast Accuracy in the Supply Chain

2025年1月4日

Using EDA To Effectively Achieve Improved Data Quality and Higher Forecast Accuracy in the Supply Chain

Exploratory Data Analysis (EDA) plays a crucial role in enhancing data quality and improving forecast accuracy in…

1 条评论
Don't Ban the Mean APE; Instead Fire the Consultants Who Advance the Ban

2024年9月13日

Don't Ban the Mean APE; Instead Fire the Consultants Who Advance the Ban

Over the years, there has been much written by consultants and academic researchers about retiring the Mean Absolute…

2 条评论
Why Demand Forecasting is So Crucial to Supply Chain Planners and Managers

2024年8月18日

Why Demand Forecasting is So Crucial to Supply Chain Planners and Managers

Why Demand Forecasting? Forecasting for demand planning and management in the supply chain generally attempts to…

14 条评论
Creating Useful Models for Demand Forecasting and Planning Applications

2024年3月23日

Creating Useful Models for Demand Forecasting and Planning Applications

All Models are Wrong. Some are Useful.

2 条评论
How to Gain Insights Into Forecasting the Demand for New Products and Services

2024年2月8日

How to Gain Insights Into Forecasting the Demand for New Products and Services

Forecasting the demand for new products and services requires a combination of data analysis, market research, and…

2 条评论
e-Commerce Forecasting: A New Challenge for Demand Planners in the Supply Chain

2024年1月30日

e-Commerce Forecasting: A New Challenge for Demand Planners in the Supply Chain

E-commerce demand planning has indeed presented a new set of challenges for supply chain professionals, often requiring…

4 条评论
How to Detect and Correct Outliers in Correlation Analysis

2024年1月2日

How to Detect and Correct Outliers in Correlation Analysis

In a correlation analysis, outliers can have a significant impact on the interpretation of a correlation coefficient…

3 条评论
Why Demand Planners May Need to Use More Nonconventional Approaches in the Sales and Operations Planning (S&OP) Process

2023年11月12日

Why Demand Planners May Need to Use More Nonconventional Approaches in the Sales and Operations Planning (S&OP) Process

Demand planners in supply chain organizations are so accustomed to using the Mean Absolute Percentage Error (MAPE) as…

1 条评论

See all articles

Detecting Outliers in Correlation Analysis

Hans Levenbach PhD CPDF

Outliers in Correlation Analysis

领英推荐

Hans Levenbach PhD CPDF的更多文章

社区洞察

其他会员也浏览了

Linear regression

6 MISTAKES OF HYPOTHESIS TESTING

Noise or Signal? Ways to Set Cut-Offs and Find the Real Cells in Single Cell Data!

Sampling size evaluation

Introduction to Regression Analysis: Predicting Outcomes with Statistical Models

The Power of Hypothesis Testing

Using Mahalanobis Distance to Detect Multivariate Outliers: A Key Tool in Data Analysis

Harnessing the Power of Random Forest for Glucose Prediction, How I Completed This Task

Learn Statistical Regression in 4?mins!

R Linear Regression

Outliers in Correlation Analysis

领英推荐

Hans Levenbach PhD CPDF的更多文章

Learning About Demand Forecasting Best Practices from Industry Examples

Announcing Our Newly Updated Hands-On Workshop Manual for Demand Forecasting and Planning! ________________________________________

Using EDA To Effectively Achieve Improved Data Quality and Higher Forecast Accuracy in the Supply Chain

Don't Ban the Mean APE; Instead Fire the Consultants Who Advance the Ban

Why Demand Forecasting is So Crucial to Supply Chain Planners and Managers

Creating Useful Models for Demand Forecasting and Planning Applications

How to Gain Insights Into Forecasting the Demand for New Products and Services

e-Commerce Forecasting: A New Challenge for Demand Planners in the Supply Chain

How to Detect and Correct Outliers in Correlation Analysis

Why Demand Planners May Need to Use More Nonconventional Approaches in the Sales and Operations Planning (S&OP) Process

社区洞察

其他会员也浏览了

Linear regression

6 MISTAKES OF HYPOTHESIS TESTING

Noise or Signal? Ways to Set Cut-Offs and Find the Real Cells in Single Cell Data!

Sampling size evaluation

Introduction to Regression Analysis: Predicting Outcomes with Statistical Models

The Power of Hypothesis Testing

Using Mahalanobis Distance to Detect Multivariate Outliers: A Key Tool in Data Analysis

Harnessing the Power of Random Forest for Glucose Prediction, How I Completed This Task

Learn Statistical Regression in 4?mins!

R Linear Regression