How to Detect and Correct Outliers in Correlation Analysis

How to Detect and Correct Outliers in Correlation Analysis

In a correlation analysis, outliers can have a significant impact on the interpretation of a correlation coefficient. as well as the modeling and forecast performance analysis. The classical (Pearson product moment) correlation coefficient r measures the strength and direction of a linear relationship between two variables. The table shows a spreadsheet calculation using standardized variables for an annual housing starts and mortgage rates series. The coefficient can vary between +1 and -1, so that r = - 0.20 suggests a weak negative association between housing starts and mortgage rates:

Housing Starts and Mortgage Rates Data

Transformations

Transforming variables (e.g., using logarithmic or square root transformations) can sometimes improve the interpretation of correlations. Nowadays, spreadsheet and statistical software programs routinely calculate r, as it is just the result of an averaging process; namely, of the average of a product of standardized variables: Average {(Standardized Yt)*(Standardized Xt)}, with a divisor of (n-1) instead of n, where n is the common number of X, Y pairs. When a variable is standardized, it has a zero mean and a unit standard deviation, which is useful for making comparisons and correlations between variables that have very different sizes or scales of measurement. A standardized value is obtained by subtracting the sample mean from the data and dividing by the sample standard deviation.

Although the product moment correlation coefficient for the housing starts series versus mortgage rates data, in this example, is only about –0.20, the correlation coefficient for the respective annual changes in these variables turns out to be -0.57. Both are negative, as expected, but the latter reflects a much stronger linear association. This suggests that the strength of the relationship between housing starts and mortgage rates is better reflected in their respective growth rates, a transformation of the levels.

Visual Inspection

Plot a scatterplot of the data to visually identify outliers. This can help you understand the distribution of data points and their influence on the correlation. In a scatter plot or diagram, one variable is plotted on the horizontal scale and the other is plotted on the vertical scale. Such a plot is a valuable tool for studying the relationship between a pair of variables as a preliminary step in building regression models for forecasting. When two series have a strong positive association, the scatter diagram has a pattern of points along a line of positive slope. A negative association shows up as a scatter of points along a line with negative slope.

The scatter plot depicts Influence Function Contours for a sample of bivariate normal data with the added outlier (n = 60, r = 0.9, with outlier r = 0.84) (Source: data from R.A. Fisher, The Design of Experiments, 1935, digitized 2009). The outliers are the data points that deviate significantly from the overall pattern of the data and can influence a correlation analysis in several ways:

  • Inflated or Deflated Correlation Coefficient

Outliers can artificially inflate or deflate the correlation coefficient. If an outlier has a high leverage (i.e., it is far from the mean in one or both variables), it can have a strong influence on the slope of the regression line and, consequently, on the correlation coefficient.

  • Misleading Strength of Relationship:

Outliers may give a misleading impression of the strength of the relationship between variables. If a few extreme data points are driving the correlation, it may not accurately represent the majority of the data.

  • Masked Relationships:

Outliers can mask existing relationships between variables. The presence of outliers may obscure true patterns in the data, leading to incorrect conclusions about the nature and strength of the relationship.

  • Nonlinear Relationships:

Common correlation analyses generally assume a linear relationship between variables. Outliers that follow a nonlinear pattern may distort the correlation coefficient, as correlation is sensitive to linear relationships.

To address the impact of outliers in correlation analysis, consider the following steps:

Use Outlier-Resistant Measures

Consider using outlier-resistant correlation coefficients, such as r* (SSD).. An outlier-resistant measure of correlation, explained below, comes up with values of r* (SSD) = 0.9 for the level as that represents the linear relationship shown in the bulk of the data. When we contrast r with r* (SSD) between these sets of numbers, it may indicate deviations in linearity due to outlying or non-typical data.

The forecaster needs to routinely review the underlying data for outlier(s) in the patterns. An outlier may not necessarily appear to be visually extreme from the bulk of the data in these situations. In routine practice, a smarter forecaster needs to calculate both estimates of correlation. If the numbers are close, the forecaster can report the classical r. If deemed too different, the forecaster should delve deeper into the data underlying the calculations.

One outlier-resistant estimator of correlation, known as r*(SSD), is less affected by outliers than the ordinary correlation coefficient r. It is derived from the standardized sums and differences of two variables, say Y and X, as introduced in a Biometrika paper “Robust estimation and outlier detection with correlation coefficients,” by Susan Devlin et al., (1975).

The first step in deriving r*(SSD) is to standardize both Y and X robustly by constructing two new variables Y? and X?, where Y? = (Y - Y*)/SY* and X? = (X - X*)/SX* where Y* and X* are robust estimates of location and SY* and SX* are robust estimates of scale. Now, let Z

= Y? + X? and Z2 = Y? - X?, the sum and difference vectors, respectively. Then the robust variance of the sum vector Z1 and difference vector Z2 are calculated; they are denoted by V+* and V-*, respectively.

The variances are used in the calculation of the robust correlation estimate r*(SSD) given by r*(SSD) = (V+* - V-*) / (V+* + V-*). The justification for this formula can be seen by inspecting the formula for the variance of the sum of two variables: Var(Z1) = Var (Y?) + Var (X?) + 2 Cov (Y?, X?) where Cov denotes the covariance between Y? and X?. Since Y? and X are standardized, centered about zero, with unit scale, the expected variance of Z1 is approximately Var(Z1) ≈ 1 + 1 + 2 ρ (Y?, X?) = 2 (1 + ρ), where ρ is the theoretical correlation between Y? `and X. Similarly, for Z2, Var (Z2) ≈ Var (Y?) + Var (X?) - 2 Cov (Y?, X?) = 1 + 1 - 2 ρ (Y?, X?) = 2 (1- ρ). Notice that the expression [Var (Z1) - Var (Z2)] / [Var (Z1) + Var (Z2)] ≈ [2(1+ ρ) - 2(1- ρ)] / [2(1+ ρ) + 2(1 - ρ)] = ρ. Some robust estimates of the (square root of the) variance, required in the formula for r* include the Median Absolute Deviation from the median (MdAD) and the InterQuartile Range (IQR). This approach is documented in my book Change&Chance Embraced: Achieving Agility with Smarter Forecasting in the Supply Chain (available online through Amazon websites worldwide).

Outlier Removal

Depending on the nature of the data and the study objectives, you may choose to remove outliers. However, this should be done carefully and with justification, as it can affect the validity of the analysis.

Takeaways

  • ·Always document any decisions made regarding outlier handling and be transparent about the methods used in your analysis.
  • Employ diagnostic tools, such as residual analysis, to identify influential observations and assess the overall fit of the model.


Paul Goodwin

Emeritus Professor at University of Bath

10 个月

Thanks Hans!

回复
Layton Franko

adjunct professor at Queens college

11 个月

I think my solution to outliers would be to get more data. However it sometimes happen that Outliers are a result of specification errors so Check out your model.

回复

要查看或添加评论,请登录

社区洞察

其他会员也浏览了