登录查看更多内容

How to Detect and Correct Outliers in Correlation Analysis

Hans Levenbach PhD CPDF

发布日期: 2024年1月2日

In a correlation analysis, outliers can have a significant impact on the interpretation of a correlation coefficient. as well as the modeling and forecast performance analysis. The classical (Pearson product moment) correlation coefficient r measures the strength and direction of a linear relationship between two variables. The table shows a spreadsheet calculation using standardized variables for an annual housing starts and mortgage rates series. The coefficient can vary between +1 and -1, so that r = - 0.20 suggests a weak negative association between housing starts and mortgage rates:

Transformations

Transforming variables (e.g., using logarithmic or square root transformations) can sometimes improve the interpretation of correlations. Nowadays, spreadsheet and statistical software programs routinely calculate r, as it is just the result of an averaging process; namely, of the average of a product of standardized variables: Average {(Standardized Yt)*(Standardized Xt)}, with a divisor of (n-1) instead of n, where n is the common number of X, Y pairs. When a variable is standardized, it has a zero mean and a unit standard deviation, which is useful for making comparisons and correlations between variables that have very different sizes or scales of measurement. A standardized value is obtained by subtracting the sample mean from the data and dividing by the sample standard deviation.

Although the product moment correlation coefficient for the housing starts series versus mortgage rates data, in this example, is only about –0.20, the correlation coefficient for the respective annual changes in these variables turns out to be -0.57. Both are negative, as expected, but the latter reflects a much stronger linear association. This suggests that the strength of the relationship between housing starts and mortgage rates is better reflected in their respective growth rates, a transformation of the levels.

Visual Inspection

Plot a scatterplot of the data to visually identify outliers. This can help you understand the distribution of data points and their influence on the correlation. In a scatter plot or diagram, one variable is plotted on the horizontal scale and the other is plotted on the vertical scale. Such a plot is a valuable tool for studying the relationship between a pair of variables as a preliminary step in building regression models for forecasting. When two series have a strong positive association, the scatter diagram has a pattern of points along a line of positive slope. A negative association shows up as a scatter of points along a line with negative slope.

The scatter plot depicts Influence Function Contours for a sample of bivariate normal data with the added outlier (n = 60, r = 0.9, with outlier r = 0.84) (Source: data from R.A. Fisher, The Design of Experiments, 1935, digitized 2009). The outliers are the data points that deviate significantly from the overall pattern of the data and can influence a correlation analysis in several ways:

Inflated or Deflated Correlation Coefficient

Outliers can artificially inflate or deflate the correlation coefficient. If an outlier has a high leverage (i.e., it is far from the mean in one or both variables), it can have a strong influence on the slope of the regression line and, consequently, on the correlation coefficient.

Misleading Strength of Relationship:

Outliers may give a misleading impression of the strength of the relationship between variables. If a few extreme data points are driving the correlation, it may not accurately represent the majority of the data.

Masked Relationships:

Outliers can mask existing relationships between variables. The presence of outliers may obscure true patterns in the data, leading to incorrect conclusions about the nature and strength of the relationship.

Nonlinear Relationships:

领英推荐

The S&P's Magnificent Seven

Wall Street Oasis 11 个月前

Terrifying Charts…

Raoul Pal 2 年前

Big Data's growing demand for ratings and…

Naveen Joshi 6 年前

Common correlation analyses generally assume a linear relationship between variables. Outliers that follow a nonlinear pattern may distort the correlation coefficient, as correlation is sensitive to linear relationships.

To address the impact of outliers in correlation analysis, consider the following steps:

Use Outlier-Resistant Measures

Consider using outlier-resistant correlation coefficients, such as r* (SSD).. An outlier-resistant measure of correlation, explained below, comes up with values of r* (SSD) = 0.9 for the level as that represents the linear relationship shown in the bulk of the data. When we contrast r with r* (SSD) between these sets of numbers, it may indicate deviations in linearity due to outlying or non-typical data.

The forecaster needs to routinely review the underlying data for outlier(s) in the patterns. An outlier may not necessarily appear to be visually extreme from the bulk of the data in these situations. In routine practice, a smarter forecaster needs to calculate both estimates of correlation. If the numbers are close, the forecaster can report the classical r. If deemed too different, the forecaster should delve deeper into the data underlying the calculations.

One outlier-resistant estimator of correlation, known as r*(SSD), is less affected by outliers than the ordinary correlation coefficient r. It is derived from the standardized sums and differences of two variables, say Y and X, as introduced in a Biometrika paper “Robust estimation and outlier detection with correlation coefficients,” by Susan Devlin et al., (1975).

The first step in deriving r*(SSD) is to standardize both Y and X robustly by constructing two new variables Y? and X?, where Y? = (Y - Y*)/SY* and X? = (X - X*)/SX* where Y* and X* are robust estimates of location and SY* and SX* are robust estimates of scale. Now, let Z

= Y? + X? and Z2 = Y? - X?, the sum and difference vectors, respectively. Then the robust variance of the sum vector Z1 and difference vector Z2 are calculated; they are denoted by V+* and V-*, respectively.

The variances are used in the calculation of the robust correlation estimate r*(SSD) given by r*(SSD) = (V+* - V-*) / (V+* + V-*). The justification for this formula can be seen by inspecting the formula for the variance of the sum of two variables: Var(Z1) = Var (Y?) + Var (X?) + 2 Cov (Y?, X?) where Cov denotes the covariance between Y? and X?. Since Y? and X are standardized, centered about zero, with unit scale, the expected variance of Z1 is approximately Var(Z1) ≈ 1 + 1 + 2 ρ (Y?, X?) = 2 (1 + ρ), where ρ is the theoretical correlation between Y? `and X. Similarly, for Z2, Var (Z2) ≈ Var (Y?) + Var (X?) - 2 Cov (Y?, X?) = 1 + 1 - 2 ρ (Y?, X?) = 2 (1- ρ). Notice that the expression [Var (Z1) - Var (Z2)] / [Var (Z1) + Var (Z2)] ≈ [2(1+ ρ) - 2(1- ρ)] / [2(1+ ρ) + 2(1 - ρ)] = ρ. Some robust estimates of the (square root of the) variance, required in the formula for r* include the Median Absolute Deviation from the median (MdAD) and the InterQuartile Range (IQR). This approach is documented in my book Change&Chance Embraced: Achieving Agility with Smarter Forecasting in the Supply Chain (available online through Amazon websites worldwide).

Outlier Removal

Depending on the nature of the data and the study objectives, you may choose to remove outliers. However, this should be done carefully and with justification, as it can affect the validity of the analysis.

Takeaways

·Always document any decisions made regarding outlier handling and be transparent about the methods used in your analysis.
Employ diagnostic tools, such as residual analysis, to identify influential observations and assess the overall fit of the model.

Paul Goodwin

Emeritus Professor at University of Bath

10 个月

Thanks Hans!

Layton Franko

adjunct professor at Queens college

11 个月

I think my solution to outliers would be to get more data. However it sometimes happen that Outliers are a result of specification errors so Check out your model.

查看更多评论

要查看或添加评论，请登录

查看全部

How to Detect and Correct Outliers in Correlation Analysis

Hans Levenbach PhD CPDF

Transformations

Visual Inspection

领英推荐

Use Outlier-Resistant Measures

Takeaways

更多精彩文章

社区洞察

其他会员也浏览了

The Small Balance Intersection Update - July 20, 2024

AN ANALYSIS OF A FICTITIOUS DATA FROM QUICKCHECK

Making numbers meaningful

Model Monitoring for Credit Card / Retail Exposures

9 of the Most Valuable State Quarter Errors Worth Money

Just like that....810,000 jobs were added to the "adjusted" numbers in January

Harnessing AI & Financial Trends for HOA Management: Insights and Strategies for 2024

Goldilocks and the Sun Belt

Non-Google-able Insights

Nifty 50 OI Data Analysis June 12

Transformations

Visual Inspection

领英推荐

Use Outlier-Resistant Measures

Takeaways

Don't Ban the Mean APE; Instead Fire the Consultants Who Advance the Ban

2024年9月13日

Why Demand Forecasting is So Crucial to Supply Chain Planners and Managers

2024年8月18日

Creating Useful Models for Demand Forecasting and Planning Applications

2024年3月23日

How to Gain Insights Into Forecasting the Demand for New Products and Services

2024年2月8日

e-Commerce Forecasting: A New Challenge for Demand Planners in the Supply Chain

2024年1月30日

Why Demand Planners May Need to Use More Nonconventional Approaches in the Sales and Operations Planning (S&OP) Process

2023年11月12日

How to Achieve 100% Customer Service at Close to Minimum Inventories

2023年11月2日

Why Forecasting Methods Need to Be Brought Up-To-Date Before They Become Obsolete Tools!

2023年10月2日

E-Commerce Forecasting: A Smarter Role for Demand Forecasters, Planners and Managers in the Supply Chain

2023年9月27日

Why You Can and Should Abandon an Unenlightened Gaussian Mindset from Using the Arithmetic Mean and Standard Deviation Time Series Forecasting

2023年9月22日

社区洞察

其他会员也浏览了

The Small Balance Intersection Update - July 20, 2024

AN ANALYSIS OF A FICTITIOUS DATA FROM QUICKCHECK

Making numbers meaningful

Model Monitoring for Credit Card / Retail Exposures

9 of the Most Valuable State Quarter Errors Worth Money

Just like that....810,000 jobs were added to the "adjusted" numbers in January

Harnessing AI & Financial Trends for HOA Management: Insights and Strategies for 2024

Goldilocks and the Sun Belt

Non-Google-able Insights

Nifty 50 OI Data Analysis June 12