Correlation In-depth Intuition!

Correlation In-depth Intuition!

Correlation analysis is a vast topic in statistics. We will look at some important concepts of correlation analysis in-depth with code!

Correlation is helpful in analyzing the linear relationship between two variables and measure the strength of the two variables.

Some Notations to remember:

  1. IVs - Indpendent Varaibles.
  2. DVs - Dependent Varaibles.

Correlation:

Data Type: Interval, ratio, and discrete (numerical) data types are basic assumptions.

  • Correlation analysis helps to compute the correlation coefficient (referred to as 'r') and the strength between two features.
  • We can measure with IVs with DVs or IVs with IVs and vice versa.

No alt text provided for this image

  • The correlation coefficient is an output of the correlation method, it is an alternate name of correlation.
  • It ranges between -1 to 1 (arbitrary values).
  • (-1) - Negative correlation.
  • (1) - Positive correlation.
  • (0) - Zero correlation.

No alt text provided for this image

  • Positive correlation tells that both features are positive trends and positive with each other.
  • A negative correlation tells that both features are negative trends and inversely related to each other.
  • No correlation means both features are independent of each other.

Covariance:

  • It also measured between two numerical features.
  • A covariance is a single number that measures the linear relationship between two variables.

Both correlation and covariance are same?

  • NO, for finding correlation first we find covariance. After doing some scaling, we get a correlation.
  • If you don't understand, don't worry. We will look at some more examples.

No alt text provided for this image

If you look at the formula, they relate both covariance and correlation formula to each other.

  • Covariance - It gives a linear relationship between two variables and it is not normalized.?
  • Correlation - It gives linear relationship and strength between two variables and it's normalized.?

This is all about correlation and covariance. Let's attack in code!

  • Simulate the data

#?import?libraries
import?matplotlib.pyplot?as?plt
import?numpy?as?np
import?scipy.stats?as?statss

##?simulate?data


N?=?66


#?generate?correlated?data
x?=?np.random.randn(N)
y?=?x?+?np.random.randn(N)


#?plot?the?data
plt.plot(x,y,'kp',markerfacecolor='b',markersize=12)
plt.xlabel('Variable?X')
plt.ylabel('Variable?Y')
plt.xticks([])
plt.yticks([])
plt.show()        

  • ?Implementing covariance and correlation

###?the?Covariance 

covar3?=?np.cov(np.vstack((x,y))

### the Correlation 

corr2?=?np.corrcoef(np.vstack((x,y)))

print('Covariance', str(covar3)) 
print('Correlation %g', %corr2)         

  • output

No alt text provided for this image

  • You may think what is this, its correlation matrix. I have been told the result of correlation and covariance is an arbitrary number ranging between -1 to 1, but what is this?
  • This is the Correlation matrix. You will see often it in machine learning and deep learning fields.
  • Interpretation of correlation and covariance matrix is simple. We will look at it now!

Correlation Matrix

No alt text provided for this image

1 - Blue sine wave.

2 - Green wave.

3 - Orange wave

  • we have three factors here.
  • The correlation between the blue wave with the blue wave is 1, the same for all. See the diagonal numbers.
  • The correlation between the green wave with the blue wave is.67. See the first 2-row first column.
  • Same for all, this is a pretty straightforward concept. I hope you will understand this clearly!

Covariance Matrix

No alt text provided for this image

  • Same as the previous one but the diagonal elements are not similar to each other!
  • why? because please look at the formula for covariance.
  • Because this is sample data, so we use n-1 freedom, so it reduces the value.
  • I hope you will find some valuable insights about correlation and covariance, lets see some of the correlation analysis methods!

Partial Correlation

  • Same as Correlation, but it depends on some other variable.
  • Partial - Calculating the relationship between two or more variables by controlling external variables.
  • Let's implement in Code!

Import Libraries!


#?import?libraries
import?matplotlib.pyplot?as?plt
import?numpy?as?np
import?pandas?as?pd
import?scipy.stats?as?stats
import?pingouin?as?pg

        

  • Implementing Partial correlation

##?now?for?dataset


N?=?76


#?correlated?datasets
x1?=?np.linspace(1,10,N)?+?np.random.randn(N)
x2?=?x1?+?np.random.randn(N)
x3?=?x1?+?np.random.randn(N)


#?let's?convert?these?data?to?a?pandas?frame
df?=?pd.DataFrame()
df['x1']?=?x1
df['x2']?=?x2
df['x3']?=?x3


#?compute?the?"raw"?correlation?matrix
cormatR?=?df.corr()
print(cormatR)


#?print?out?one?value
print('?')
print(cormatR.values[1,0])


#?partial?correlation
pc?=?pg.partial_corr(df,x='x3',y='x2',covar='x1')
print('?')
print(pc)        

Problem with 'Pearson Correlation'

  • The problem with Pearson correlation is that it can find a linear relationship, but if you want to find a non-linear relationship, it can't work. So, we will use some other methods to overcome this.

Major Problem - If your data has outliers and your data does not follow Gaussian distribution, the correlation value may be not sensible, so for that, we will use some other methods.

Conclusion:

  • Pearson correlation is sensitive to outliers.
  • Pearson is appropriate for normal data without outliers.
  • Pearson works well in roughly or normally distributed data.

Spearman Rank Correlation:

  • Spearman rank correlation is robust to outliers and nonlinear data (not normally distributed data).
  • Spearman also refers to "Spearman's rho".
  • Spearman tests the monotonic relationship between two variables, regardless of whether the relationship is linear or nonlinear.
  • Spearman helps to find a linear relationship in monotonic data.

What is the monotonic relationship?

No alt text provided for this image

  • It refers to increasing or decreasing numbers, regardless of spacing between numbers.?

How does Spearman work?

Step 1: Transform both the variables to rank

Step 2: Compute Pearson correlation coefficient on ranks. P-value is the same as Pearson Correlation.?

No alt text provided for this image


  • r_p = Pearson correlation.
  • r_s = Spearman correlation.

See the second image, the r_s is low and r_p is high, which it means captures the nonlinearity well.

See the similar images also!


Spearman action!

#?import?libraries
import?matplotlib.pyplot?as?plt
import?numpy?as?np
import?scipy.stats?as?stats
from?IPython?import?display
display.set_matplotlib_formats('svg')        

Implement spearman

##?Anscobe's?quartet


anscombe?=?np.array([
?????#?series?1?????series?2??????series?3???????series?4
????[10,??8.04,????10,??9.14,????10,??7.46,??????8,??6.58,?],
????[?8,??6.95,?????8,??8.14,?????8,??6.77,??????8,??5.76,?],
????[13,??7.58,????13,??8.76,????13,?12.74,??????8,??7.71,?],
????[?9,??8.81,?????9,??8.77,?????9,??7.11,??????8,??8.84,?],
????[11,??8.33,????11,??9.26,????11,??7.81,??????8,??8.47,?],
????[14,??9.96,????14,??8.10,????14,??8.84,??????8,??7.04,?],
????[?6,??7.24,?????6,??6.13,?????6,??6.08,??????8,??5.25,?],
????[?4,??4.26,?????4,??3.10,?????4,??5.39,??????8,??5.56,?],
????[12,?10.84,????12,??9.13,????12,??8.15,??????8,??7.91,?],
????[?7,??4.82,?????7,??7.26,?????7,??6.42,??????8,??6.89,?],
????[?5,??5.68,?????5,??4.74,?????5,??5.73,?????19,?12.50,?]
????])



#?plot?and?compute?correlations
fig,ax?=?plt.subplots(2,2,figsize=(6,6))
ax?=?ax.ravel()


for?i?in?range(4):
????ax[i].plot(anscombe[:,i*2],anscombe[:,i*2+1],'ko')
????ax[i].set_xticks([])
????ax[i].set_yticks([])
????corr_p?=?stats.pearsonr(anscombe[:,i*2],anscombe[:,i*2+1])[0]
????corr_s?=?stats.spearmanr(anscombe[:,i*2],anscombe[:,i*2+1])[0]
????ax[i].set_title('r_p?=?%g,?r_s?=?%g'%(np.round(corr_p*100)/100,np.round(corr_s*100)/100))


plt.show()        

Fisher Z Transformation for correlation:

  • Correlation ranges between -1 to 1, it has a uniform distribution.

Why do we need to use Fisher Z Transformation?

No alt text provided for this image


  • After you analyze the correlation, if you want to analyze (secondary analysis the same correlation), we need to transform the data to Gaussian distribution.
  • Because a correlation has a uniform distribution, but most of the statistical tests need Gaussian distribution, so we need to transform uniform to normal distribution for secondary analysis.
  • For transforming the uniform data to Gaussian distribution, we use Fisher Z Transformation.

No alt text provided for this image

  • We don't have any function in python to do this, but we have a formula present in NumPy arctanh. We will use this function to transform the data.
  • let us see in code!

Import the packages!

##?Fisher-Z?transform


import numpy as np 
#?simulate?correlation?coefficients
N?=?10000
r?=?2*np.random.rand(N)?-?1


#?Fisher-Z
fz?=?np.arctanh(r)        

Visualize the data

y,x?=?np.histogram(fz,30
x?=?(x[1:]+x[0:-1])/2
plt.bar(x,y))
        

Kendall Correlation:

  • When you have ordinal data in your dataset, you can use this method.

How does it work?

* Transform the data to rank concordance

No alt text provided for this image

Let's implement it in code!



# Import required libraries
from scipy.stats import kendalltau
??
# Taking values from the above example in Lists
X = [1, 2, 3, 4, 5, 6, 7]
Y = [1, 3, 6, 2, 7, 4, 5]
??
# Calculating Kendall Rank correlation
corr, _ = kendalltau(X, Y)
print('Kendall Rank correlation: %.5f' % corr)s        

Cosine similarity!

  • The fancy word right!
  • It's not a complicated topic, cosine similarity is the variety of Pearson correlation,
  • Both are the same, but it has minor differences.
  • The thing is that, if you have data with mean = 0 or mean-centered data or the nature of the data has a mean of close to 0, you can use this. Why? Because it works well in mean-centered data.

No alt text provided for this image


r = Pearson correlation coefficient

cos (o) = cosine similarity

  • If you look at the formula, both are the same. But, if you see the r, it has data mean (x bar) and it does not present in cos (o), this is because the cosine similarity works well in mean-centered data!

Key to remember: If your data has a mean of roughly 0, you can go for cosine similarity!

Cosine in action!

from scipy import spatial
List1 = [4, 47, 8, 3]
List2 = [3, 52, 12, 16]
result = 1 - spatial.distance.cosine(List1, List2)
print(result)        

we have seen a lot of topics in correlation analysis, it will give you a basic foundation for starting an analytics career!

Thank you!

Name :?R. Aravindan

Position :?Content Writer.

Company :?Artificial Neurons.AI



要查看或添加评论,请登录

Artificial Neurons.AI的更多文章