AIML 11- Choosing the appropriate correlation coefficient

AIML 11- Choosing the appropriate correlation coefficient

An overview of the concept of correlation and its usage in data science projects.

Written By:?Yuval Cohen,?David Grabois

BlogPost: MediumBlog

Follow me:?Alok Tiwari, PhD, IIT-BHU

Introduction:

One of the most fundamental questions in statistical learning is the relationship between variables. Estimating a measure of association between two variables can help make the right decisions in everyday data science-related problems for different reasons:

  1. Reduce model complexity — by implementing correlation-based feature selection.
  2. Reduce multicollinearity — by manipulating highly correlated variables.
  3. Reduce model uncertainty and infer insights.

This post deals with correlation coefficients. Correlation is the statistical phrase describing the degree of relationship between any two variables. Since

relying on the wrong correlation coefficient might lead to critical mistakes, it is essential to choose the correct correlation coefficient for our problem.

We will examine different correlations coefficients, considering different variable types (i.e., numerical, ordinal, and categorical) and types of relationships (linear/nonlinear).

The intuition behind correlation and common mistakes:

Correlation is a statistical relationship, whether causal or not, between any two random variables.

A correlation coefficient should quantify the strength of the relationship and answer the following questions:

  1. Is there a statistically significant relationship between the two variables?
  2. What is the direction of the relationship (if it exists)?

Common mistakes:

Measuring correlation is tricky since there are many correlation coefficients. Furthermore, each coefficient suits different setups and assumptions.

Many people fail in this since they tend to use the Pearson coefficient in every setup. It’s important to emphasize that using the Pearson correlation coefficient in a non-linear relationship of two variables is fundamentally wrong.

For example, In the plot below, we can obtain a strong non-linear relationship between x and y. However, the Pearson coefficient of x and y is only -0.01, which is very low considering the clear non-linear relationship. Therefore, we should use a different and correct coefficient for this setup to identify this relationship.

No alt text provided for this image

Fig 1. Pearson coefficient of x and y is -0.01

This section is divided into combinations of variable types, such as numerical-numerical, numerical-ordinal, etc.

Possible types of variables are numerical, categorical, and ordinal. We provide at least one appropriate correlation coefficient for each setup, describe it and provide its code implantation.

We use a simplified version of the PetFinder dataset for example. Each row in the PetFinder dataset describes a pet, and each column represents a variable.

import pandas as pd
df = pd.read_csv('petfinder-mimi.csv')        

Following is a description of this dataset.

No alt text provided for this image


Numerical vs Numerical:

Setup: X, Y — represents a numerical variable.

Pearson’s r

Pearson’s r is the ratio between the covariance of two variables and the product of their standard deviations. The result ranges between ?1 and 1; 1/-1 indicates a perfect linear positive/negative relationship, 0 means no linear relationship.

Assumptions:

The primary assumption of the model is a “linear relationship”. We can scatter plot the

data (on two dimensions) and visually test if it has a linear relationship.

Pros and cons:

The main drawback is that it ignores nonlinear relationship forms (e.g., a quadratic form of

correlation, etc.). On the other hand, the advantages are low time complexity and model simplicity.

Pearson’s r Equation:

No alt text provided for this image

Python implementation:

df['PhotoAmt'].corr(df['Fee'], method='pearson')        

Spearman’s rank correlation coefficient

Spearman’s rank correlation coefficient determines the strength and direction of the monotonic relationship between two variables. This coefficient can be calculated between any mixture of numerical and ordinal variables. The result ranges between ?1 and 1, similar to Pearson’s r.

Assumptions:

The primary assumption of the model is a “monotonic relationship”. We can

scatter plot the data and visually test if it has a monotonic relationship. A function is called monotonic if and only if it is either entirely non-increasing or entirely non-decreasing.

Pros and cons:

The main drawback of Spearman rank correlation is that it ignores non-monotonic relationship forms. Its advantages are low time complexity, model simplicity, and the ability to be calculated on ordinal or nominal variables.

Equation:

No alt text provided for this image

Where n is the sample size.

Python implementation:

df['PhotoAmt'].corr(df['Fee'], method='spearman')        

Distance correlation

Distance correlation is a fairly new correlation measurement introduced in 2005 by Gábor J. Székely. It measures the dependency between any two paired random vectors of arbitrary, not necessarily equal dimension. The distance correlation coefficient is zero if and only if the random vectors are independent and measures both linear and nonlinear association between

two random variables or random vectors.

Assumptions:

Distance correlation assumes data is i.i.d, and both X and Y have finite first moments.

Pros and cons:

The most significant drawbacks of distance correlation:

  1. The distance correlation is always positive and does not indicate the direction of the relationship.

2. Computational time.

The advantages:

  1. Handling various types of relationships between variables.
  2. Distance correlation of zero implies independence, unlike other correlation coefficients.
  3. It can be calculated on non-equal dimension variables.

Equation:

No alt text provided for this image
No alt text provided for this image

Python implementation:

import dcor
dcor.distance_correlation(df['PhotoAmt'], df['Fee'])        

Categorical vs Categorical:

Setup: X, Y - represents a categorical variable with two or more categories.

Goodman-Kruskal’s lambda

Goodman - Kruskal’s lambda is an asymmetric measure of proportional reduction in error contingency table analysis. It is based on measuring relationships by asking how much knowledge about one variable will help us predict the dependent variable better than the naive prediction. The lambda value ranges from 0 to 1, where 0 indicates no association and 1 indicates perfect association.

Assumptions:

Categorical vs categorical setup

Pros and cons:

A very intuitive measure that tells us the strength of association without making strong assumptions. However, this measure fails when the dependent variable has the same dominant class in each independent variable level.

Equation:

We will use the following Wikipedia example:

No alt text provided for this image

Since this measure is asymmetric, we will get different results for different directions of calculations. Let’s consider “Relationship Status” to be Y (dependent variable) and “Blood Pressure” to be X. We first naively predict all Y values to be the most frequent category in the given sample. Therefore, we will predict y_i to be “Married” for each i =1…n, and refer to the number of mistakes as E1.

Then, given the value of x_i, we’ll predict y_i to be the most frequent category inside the population of

the corresponding xi level (“Married” when Blood Pressure is “High” and “Unmarried” otherwise), and we will do that for every i=1…n. We will refer to the number of mistakes as E2.

The lambda value is defined to be the percent of error redaction.

No alt text provided for this image

So, in our example, we get:

No alt text provided for this image

Suppose we consider “Blood Pressure” as the dependent variable. In that case, we will get lambda equals zero due to the dominance of “Normal” in every level of “Relationship Status”, which will result in no error reduction in our calculation.

Python implementation:

from pypair.association import categorical_categorical
categorical_categorical(df['Type'], df['Color2'], 'gk_lambda')        

Cramér’s v

Cramér’s v is a symmetrical measure of association for categorical variables. Cramér’s v is based on Pearson’s chi-squared statistic, and its value ranges from 0 to 1, where 0 indicates no association and 1 indicates perfect association.

Assumptions:

Categorical vs. categorical setup. The observations are independent inside and outside the group.

Pros and cons:

Cramér’s v is a common choice to measure the association of two categorical variables. Since it is based on a chi-squared statistic, it makes no assumptions about the distribution of the population. However, it is not appropriate for dependent data.

Equation:

No alt text provided for this image

Python implementation:

import scipy.stats as stats
import numpy as np
import pandas as pd

def cramers_v(var1, var2):
    data = pd.crosstab(var1, var2).values
    chi_2 = stats.chi2_contingency(data)[0]
    n = data.sum()
    phi_2 = chi_2 / n
    r, k = data.shape
    return np.sqrt(phi_2 / min((k-1), (r-1)))
cramers_v(df['Type'], df['Color2'])        

Numerical vs. categorical

Setup: Y - numerical variable, X - categorical variable with two or more categories.

Point-biserial correlation coefficient:

Point- biserial correlation coefficient ranges between –1 and +1. Values close to ±1 indicate a strong positive/negative relationship, and values close to zero indicate no relationship between the two variables.

Calculation of the point-biserial correlation coefficient is accomplished by coding the two levels of the binary variable “0” and “1” and obtaining the coefficient between the continuous variable (Y) and this coded binary variable.

Assumptions:

Numerical and binary variable (categorical variable with two categories only). However, suppose the categorical variable has more than two categories. In that case, we can overcome it by creating a series of dummy variables for the categorical variable (e.g., one-hot encoding) and calculating the point-biserial correlation between the numerical variable and the dummies variables. Of course, this trick will provide many correlation scores, so we might want to use the max value or any other calculation of those scores as our final measure of the relationship.

Equation:

No alt text provided for this image
No alt text provided for this image

Python implementation:

import scipy.stats as stats
point_biserial, p_value = stats.pointbiserialr((df['Type'] == 'Cat'), df['PhotoAmt'])        

Categorical vs. Ordinal

Setup: Y - represents a categorical variable with two or more categories. X -represents an ordinal variable.

Rank-biserial correlation coefficient:

The rank-biserial correlation is similar to the point-biserial correlation coefficient, but it aims to measure the correlation between a binary and an ordinal variable.

The calculation of the rank-biserial correlation coefficient is accomplished by coding the two levels of the binary variable “0” and “1” and obtaining the coefficient between the ranked variable (Y) and this coded binary variable.

Assumptions:

Binary vs ordinal setup.

However, we can use the same one-hot encoding trick used in point-biserial to make this coefficient relevant also for cases of a categorical variable with more than two categories.

Equation:

No alt text provided for this image
No alt text provided for this image

Python implementation:

from pypair.association import binary_continuous
binary_continuous((df['Type'] == 'Cat'), df['AdoptionSpeed'], 'rank_biserial')        

Ordinal vs Ordinal, Ordinal vs Numerical

Setup: Y - represents an ordinal variable. X - represents an ordinal variable or numerical.

Spearman rank correlation:

Explained above

Kendall rank correlation coefficient (tau):


When the sample size is small and has many tied ranks, the Kendall rank correlation coefficient (tau) is a better choice for a correlation coefficient. There are three types of Kendall rank correlation: We refer to Kendall rank correlation tau-b. Similar to Spearman, it ranges between 1 and -1.

Assumptions:

Ordinal or numerical data. Monotonic relationships between the two variables.

Pros and cons:

Better to use on small data with many tied ranks.

Equation:

No alt text provided for this image


Python implementation:

import scipy.stats as stats
tau, p_value = stats.kendalltau(df['FurLength'], df['MaturitySize'])        

All in one solution — Phik (??k)

We have seen many different correlation coefficients so far. However, none of the coefficients above can be used for all setups and types of relationships.

We recently encountered the Phik(??k) method, which is based on several refinements of Pearson’s

χ2 (chi-squared) contingency test and can capture non-linear relationships between all variable types.

Pros and cons:

The calculation of???k is computationally expensive. Furthermore, it has no closed-form formula, and it does not indicate the direction of the relationship.

Also, other appropriate correlation coefficients will be more precise when used in their proper setup.

However, Phik(??k) gives us a global and generic solution for all types of variables and relationships between variables.

Python implementation:

import phik
from phik.report import plot_correlation_matrix

phik_overview = df.phik_matrix()        

Decision table:

No alt text provided for this image


Conclusions:

In this post, we covered different setups of bivariate analysis and the appropriate correlation coefficient for each.

We saw that using the Pearson correlation as our only correlation coefficient is wrong in most cases. Instead, the correct way to measure correlation is to precisely understand your setup in terms of variable types, relationship types, assumption of distribution, outliers, etc., and then choose the most appropriate correlation measure. However, when using correlation coefficients as part of our EDA, we can use different correlation

coefficients in parallel to account for many variable type combinations and assumptions. For example, we can always use Cramér’s v and Spearman’s (rho) to account for many cases and get a good idea of the data pairwise relationships in low computational time. We should note that we can’t always compare measurements, since they do not necessarily measure the same thing. For example, a Spearman score of 0.3 is not necessarily higher than a

distance correlation score of 0.2. Finally, if computational time is not much of a problem, we can try a more general method such as

Phik (??k) and see all pairwise ??k correlation scores at once.

Note:?More assumptions might be implied when using correlation coefficients as part of a statistical test.

要查看或添加评论,请登录

Dr. Alok Tiwari的更多文章

社区洞察

其他会员也浏览了