登录查看更多内容

AIML 11- Choosing the appropriate correlation coefficient

Dr. Alok Tiwari

?? LinkedIn Top Voice - AI, ML, Data Science & Data Engineering ?? ?? | Asst. Prof. (Big Data Analytics) at Goa Institute of Management | ?? Passionate Researcher -Artificial Intelligence in Healthcare | ??

发布日期: 2022年1月27日

+ 关注

An overview of the concept of correlation and its usage in data science projects.

Written By:?Yuval Cohen,?David Grabois

BlogPost: MediumBlog

Follow me:?Alok Tiwari, PhD, IIT-BHU

Introduction:

One of the most fundamental questions in statistical learning is the relationship between variables. Estimating a measure of association between two variables can help make the right decisions in everyday data science-related problems for different reasons:

Reduce model complexity — by implementing correlation-based feature selection.
Reduce multicollinearity — by manipulating highly correlated variables.
Reduce model uncertainty and infer insights.

This post deals with correlation coefficients. Correlation is the statistical phrase describing the degree of relationship between any two variables. Since

relying on the wrong correlation coefficient might lead to critical mistakes, it is essential to choose the correct correlation coefficient for our problem.

We will examine different correlations coefficients, considering different variable types (i.e., numerical, ordinal, and categorical) and types of relationships (linear/nonlinear).

The intuition behind correlation and common mistakes:

Correlation is a statistical relationship, whether causal or not, between any two random variables.

A correlation coefficient should quantify the strength of the relationship and answer the following questions:

Is there a statistically significant relationship between the two variables?
What is the direction of the relationship (if it exists)?

Common mistakes:

Measuring correlation is tricky since there are many correlation coefficients. Furthermore, each coefficient suits different setups and assumptions.

Many people fail in this since they tend to use the Pearson coefficient in every setup. It’s important to emphasize that using the Pearson correlation coefficient in a non-linear relationship of two variables is fundamentally wrong.

For example, In the plot below, we can obtain a strong non-linear relationship between x and y. However, the Pearson coefficient of x and y is only -0.01, which is very low considering the clear non-linear relationship. Therefore, we should use a different and correct coefficient for this setup to identify this relationship.

Fig 1. Pearson coefficient of x and y is -0.01

This section is divided into combinations of variable types, such as numerical-numerical, numerical-ordinal, etc.

Possible types of variables are numerical, categorical, and ordinal. We provide at least one appropriate correlation coefficient for each setup, describe it and provide its code implantation.

We use a simplified version of the PetFinder dataset for example. Each row in the PetFinder dataset describes a pet, and each column represents a variable.

import pandas as pd
df = pd.read_csv('petfinder-mimi.csv')

Following is a description of this dataset.

Numerical vs Numerical:

Setup: X, Y — represents a numerical variable.

Pearson’s r

Pearson’s r is the ratio between the covariance of two variables and the product of their standard deviations. The result ranges between ?1 and 1; 1/-1 indicates a perfect linear positive/negative relationship, 0 means no linear relationship.

Assumptions:

The primary assumption of the model is a “linear relationship”. We can scatter plot the

data (on two dimensions) and visually test if it has a linear relationship.

Pros and cons:

The main drawback is that it ignores nonlinear relationship forms (e.g., a quadratic form of

correlation, etc.). On the other hand, the advantages are low time complexity and model simplicity.

Pearson’s r Equation:

Python implementation:

df['PhotoAmt'].corr(df['Fee'], method='pearson')

Spearman’s rank correlation coefficient

Spearman’s rank correlation coefficient determines the strength and direction of the monotonic relationship between two variables. This coefficient can be calculated between any mixture of numerical and ordinal variables. The result ranges between ?1 and 1, similar to Pearson’s r.

Assumptions:

The primary assumption of the model is a “monotonic relationship”. We can

scatter plot the data and visually test if it has a monotonic relationship. A function is called monotonic if and only if it is either entirely non-increasing or entirely non-decreasing.

Pros and cons:

The main drawback of Spearman rank correlation is that it ignores non-monotonic relationship forms. Its advantages are low time complexity, model simplicity, and the ability to be calculated on ordinal or nominal variables.

Equation:

Where n is the sample size.

Python implementation:

df['PhotoAmt'].corr(df['Fee'], method='spearman')

Distance correlation

Distance correlation is a fairly new correlation measurement introduced in 2005 by Gábor J. Székely. It measures the dependency between any two paired random vectors of arbitrary, not necessarily equal dimension. The distance correlation coefficient is zero if and only if the random vectors are independent and measures both linear and nonlinear association between

two random variables or random vectors.

Assumptions:

Distance correlation assumes data is i.i.d, and both X and Y have finite first moments.

Pros and cons:

The most significant drawbacks of distance correlation:

The distance correlation is always positive and does not indicate the direction of the relationship.

2. Computational time.

The advantages:

Handling various types of relationships between variables.
Distance correlation of zero implies independence, unlike other correlation coefficients.
It can be calculated on non-equal dimension variables.

Equation:

Python implementation:

import dcor
dcor.distance_correlation(df['PhotoAmt'], df['Fee'])

Categorical vs Categorical:

Setup: X, Y - represents a categorical variable with two or more categories.

Goodman-Kruskal’s lambda

Goodman - Kruskal’s lambda is an asymmetric measure of proportional reduction in error contingency table analysis. It is based on measuring relationships by asking how much knowledge about one variable will help us predict the dependent variable better than the naive prediction. The lambda value ranges from 0 to 1, where 0 indicates no association and 1 indicates perfect association.

Assumptions:

Categorical vs categorical setup

Pros and cons:

A very intuitive measure that tells us the strength of association without making strong assumptions. However, this measure fails when the dependent variable has the same dominant class in each independent variable level.

Equation:

We will use the following Wikipedia example:

Since this measure is asymmetric, we will get different results for different directions of calculations. Let’s consider “Relationship Status” to be Y (dependent variable) and “Blood Pressure” to be X. We first naively predict all Y values to be the most frequent category in the given sample. Therefore, we will predict y_i to be “Married” for each i =1…n, and refer to the number of mistakes as E1.

领英推荐

How to avoid the most common data science pitfalls

Doug Rose 10 个月前

What Really Is Data Science? A Super-Simple…

Bernard Marr 5 年前

Data Science 101: An Introduction to the Fundamentals…

Kadir Sümerkent 2 年前

Then, given the value of x_i, we’ll predict y_i to be the most frequent category inside the population of

the corresponding xi level (“Married” when Blood Pressure is “High” and “Unmarried” otherwise), and we will do that for every i=1…n. We will refer to the number of mistakes as E2.

The lambda value is defined to be the percent of error redaction.

So, in our example, we get:

Suppose we consider “Blood Pressure” as the dependent variable. In that case, we will get lambda equals zero due to the dominance of “Normal” in every level of “Relationship Status”, which will result in no error reduction in our calculation.

Python implementation:

from pypair.association import categorical_categorical
categorical_categorical(df['Type'], df['Color2'], 'gk_lambda')

Cramér’s v

Cramér’s v is a symmetrical measure of association for categorical variables. Cramér’s v is based on Pearson’s chi-squared statistic, and its value ranges from 0 to 1, where 0 indicates no association and 1 indicates perfect association.

Assumptions:

Categorical vs. categorical setup. The observations are independent inside and outside the group.

Pros and cons:

Cramér’s v is a common choice to measure the association of two categorical variables. Since it is based on a chi-squared statistic, it makes no assumptions about the distribution of the population. However, it is not appropriate for dependent data.

Equation:

Python implementation:

import scipy.stats as stats
import numpy as np
import pandas as pd

def cramers_v(var1, var2):
    data = pd.crosstab(var1, var2).values
    chi_2 = stats.chi2_contingency(data)[0]
    n = data.sum()
    phi_2 = chi_2 / n
    r, k = data.shape
    return np.sqrt(phi_2 / min((k-1), (r-1)))
cramers_v(df['Type'], df['Color2'])

Numerical vs. categorical

Setup: Y - numerical variable, X - categorical variable with two or more categories.

Point-biserial correlation coefficient:

Point- biserial correlation coefficient ranges between –1 and +1. Values close to ±1 indicate a strong positive/negative relationship, and values close to zero indicate no relationship between the two variables.

Calculation of the point-biserial correlation coefficient is accomplished by coding the two levels of the binary variable “0” and “1” and obtaining the coefficient between the continuous variable (Y) and this coded binary variable.

Assumptions:

Numerical and binary variable (categorical variable with two categories only). However, suppose the categorical variable has more than two categories. In that case, we can overcome it by creating a series of dummy variables for the categorical variable (e.g., one-hot encoding) and calculating the point-biserial correlation between the numerical variable and the dummies variables. Of course, this trick will provide many correlation scores, so we might want to use the max value or any other calculation of those scores as our final measure of the relationship.

Equation:

Python implementation:

import scipy.stats as stats
point_biserial, p_value = stats.pointbiserialr((df['Type'] == 'Cat'), df['PhotoAmt'])

Categorical vs. Ordinal

Setup: Y - represents a categorical variable with two or more categories. X -represents an ordinal variable.

Rank-biserial correlation coefficient:

The rank-biserial correlation is similar to the point-biserial correlation coefficient, but it aims to measure the correlation between a binary and an ordinal variable.

The calculation of the rank-biserial correlation coefficient is accomplished by coding the two levels of the binary variable “0” and “1” and obtaining the coefficient between the ranked variable (Y) and this coded binary variable.

Assumptions:

Binary vs ordinal setup.

However, we can use the same one-hot encoding trick used in point-biserial to make this coefficient relevant also for cases of a categorical variable with more than two categories.

Equation:

Python implementation:

from pypair.association import binary_continuous
binary_continuous((df['Type'] == 'Cat'), df['AdoptionSpeed'], 'rank_biserial')

Ordinal vs Ordinal, Ordinal vs Numerical

Setup: Y - represents an ordinal variable. X - represents an ordinal variable or numerical.

Spearman rank correlation:

Explained above

Kendall rank correlation coefficient (tau):

When the sample size is small and has many tied ranks, the Kendall rank correlation coefficient (tau) is a better choice for a correlation coefficient. There are three types of Kendall rank correlation: We refer to Kendall rank correlation tau-b. Similar to Spearman, it ranges between 1 and -1.

Assumptions:

Ordinal or numerical data. Monotonic relationships between the two variables.

Pros and cons:

Better to use on small data with many tied ranks.

Equation:

Python implementation:

import scipy.stats as stats
tau, p_value = stats.kendalltau(df['FurLength'], df['MaturitySize'])

All in one solution — Phik (??k)

We have seen many different correlation coefficients so far. However, none of the coefficients above can be used for all setups and types of relationships.

We recently encountered the Phik(??k) method, which is based on several refinements of Pearson’s

χ2 (chi-squared) contingency test and can capture non-linear relationships between all variable types.

Pros and cons:

The calculation of???k is computationally expensive. Furthermore, it has no closed-form formula, and it does not indicate the direction of the relationship.

Also, other appropriate correlation coefficients will be more precise when used in their proper setup.

However, Phik(??k) gives us a global and generic solution for all types of variables and relationships between variables.

Python implementation:

import phik
from phik.report import plot_correlation_matrix

phik_overview = df.phik_matrix()

Decision table:

Conclusions:

In this post, we covered different setups of bivariate analysis and the appropriate correlation coefficient for each.

We saw that using the Pearson correlation as our only correlation coefficient is wrong in most cases. Instead, the correct way to measure correlation is to precisely understand your setup in terms of variable types, relationship types, assumption of distribution, outliers, etc., and then choose the most appropriate correlation measure. However, when using correlation coefficients as part of our EDA, we can use different correlation

coefficients in parallel to account for many variable type combinations and assumptions. For example, we can always use Cramér’s v and Spearman’s (rho) to account for many cases and get a good idea of the data pairwise relationships in low computational time. We should note that we can’t always compare measurements, since they do not necessarily measure the same thing. For example, a Spearman score of 0.3 is not necessarily higher than a

distance correlation score of 0.2. Finally, if computational time is not much of a problem, we can try a more general method such as

Phik (??k) and see all pairwise ??k correlation scores at once.

Note:?More assumptions might be implied when using correlation coefficients as part of a statistical test.

AIML_TipsTricks(By-AlokTiwari)

2,318 位关注者

要查看或添加评论，请登录

Dr. Alok Tiwari的更多文章

Be the Brand, Not the Product

2024年9月5日

Be the Brand, Not the Product

Why Personal Branding Transcends Academic Labels In a world where academic institutes often define social and…
Equality in Action: Is Reservation the Answer for India?

2024年5月23日

Equality in Action: Is Reservation the Answer for India?

Follow me : Dr. Alok Tiwari Introduction: Reservation, or affirmative action, is a policy in India that reserves a…
Exciting news for aspiring Big Data enthusiasts! - Harness the power of Big Data without spending a penny! ????

2024年1月30日

Exciting news for aspiring Big Data enthusiasts! - Harness the power of Big Data without spending a penny! ????

Looking to delve into the world of Big Data Analytics without breaking the bank? Here's a curated list of free…
Decoding AI: A Journey from Basics to Brilliance

2023年12月9日

Decoding AI: A Journey from Basics to Brilliance

???? Links to resources ???? ==== Pre-requisites ==== Probability and Statistics : Linear Algebra : Matrix Algebra :…

2 条评论
?? ML YouTube Courses

2023年11月26日

?? ML YouTube Courses

Credit: https://github.com/dair-ai/ML-YouTube-Courses At DAIR.
?? Demystifying the Power of Functions in AI and Machine Learning ??

2023年10月20日

?? Demystifying the Power of Functions in AI and Machine Learning ??

?? October 20, 2023 In the vast world of AI and machine learning, one concept shines as brightly as a starry night -…
"Academic-Industry Collaboration: Exploring the Advantages and Disadvantages"

2023年7月23日

"Academic-Industry Collaboration: Exploring the Advantages and Disadvantages"

# "Academic-Industry Collaboration: Exploring the Advantages and Disadvantages" ## Introduction The collaboration…

1 条评论
?? **"Weekend for Learning, Weekdays for Earning"**

2023年7月22日

?? **"Weekend for Learning, Weekdays for Earning"**

?? **Post: "Weekend for Learning, Weekdays for Earning"** **Introduction:** Hello LinkedIn network! ?? As…
AIML 28: Can Chat GPT Replace Writers, Poets, and Designers? Exploring the Pros and Cons

2023年6月16日

AIML 28: Can Chat GPT Replace Writers, Poets, and Designers? Exploring the Pros and Cons

?? #AI #GPT #Writers #Poets #Designers #ArtificialIntelligence #Creativity As artificial intelligence (AI) continues to…
AIML27- Full Roadmap to DevOps

2023年4月13日

AIML27- Full Roadmap to DevOps

HAPPY LEARNING ?? Source: https://github.com/ann-afame/DEVOPS-WORLD Recommended RoadMap: 1.

See all articles

AIML 11- Choosing the appropriate correlation coefficient

Dr. Alok Tiwari

?? LinkedIn Top Voice - AI, ML, Data Science & Data Engineering ?? ?? | Asst. Prof. (Big Data Analytics) at Goa Institute of Management | ?? Passionate Researcher -Artificial Intelligence in Healthcare | ??

Introduction:

The intuition behind correlation and common mistakes:

领英推荐

Decision table:

AIML_TipsTricks(By-AlokTiwari)

2,318 位关注者

Dr. Alok Tiwari的更多文章

社区洞察

其他会员也浏览了

For Your Data Science Projects, Here Are 30+ Free Datasets

Data Science – Machine Learning Interview Questions

Understanding the Central Limit Theorem in Data Science

Beyond the Code: 7 Essential Soft Skills for Aspiring Data Scientists

Important Questions for Data Scientist Interview Pt-2

Cognitive Biases in Data Science

Graph Theory and Network Analysis in Data Science

Robust Statistical Methods in Data Analysis: The Robustness and Precision of Robust Statistic Techniques (4/5)??????

Why Data Science is a versatile skill that opens new doors across sectors?

Introduction:

The intuition behind correlation and common mistakes:

领英推荐

Decision table:

AIML_TipsTricks(By-AlokTiwari)

2,318 位关注者

Dr. Alok Tiwari的更多文章

Be the Brand, Not the Product

Equality in Action: Is Reservation the Answer for India?

Exciting news for aspiring Big Data enthusiasts! - Harness the power of Big Data without spending a penny! ????

Decoding AI: A Journey from Basics to Brilliance

?? ML YouTube Courses

?? Demystifying the Power of Functions in AI and Machine Learning ??

"Academic-Industry Collaboration: Exploring the Advantages and Disadvantages"

?? **"Weekend for Learning, Weekdays for Earning"**

AIML 28: Can Chat GPT Replace Writers, Poets, and Designers? Exploring the Pros and Cons

AIML27- Full Roadmap to DevOps

社区洞察

其他会员也浏览了

For Your Data Science Projects, Here Are 30+ Free Datasets

Data Science – Machine Learning Interview Questions

Understanding the Central Limit Theorem in Data Science

Beyond the Code: 7 Essential Soft Skills for Aspiring Data Scientists

Important Questions for Data Scientist Interview Pt-2

Cognitive Biases in Data Science

Graph Theory and Network Analysis in Data Science

Robust Statistical Methods in Data Analysis: The Robustness and Precision of Robust Statistic Techniques (4/5)??????

Why Data Science is a versatile skill that opens new doors across sectors?

?? "Weekend for Learning, Weekdays for Earning"