Correlation In-depth Intuition!
Correlation analysis is a vast topic in statistics. We will look at some important concepts of correlation analysis in-depth with code!
Correlation is helpful in analyzing the linear relationship between two variables and measure the strength of the two variables.
Some Notations to remember:
Correlation:
Data Type: Interval, ratio, and discrete (numerical) data types are basic assumptions.
Covariance:
Both correlation and covariance are same?
If you look at the formula, they relate both covariance and correlation formula to each other.
This is all about correlation and covariance. Let's attack in code!
#?import?libraries
import?matplotlib.pyplot?as?plt
import?numpy?as?np
import?scipy.stats?as?statss
##?simulate?data
N?=?66
#?generate?correlated?data
x?=?np.random.randn(N)
y?=?x?+?np.random.randn(N)
#?plot?the?data
plt.plot(x,y,'kp',markerfacecolor='b',markersize=12)
plt.xlabel('Variable?X')
plt.ylabel('Variable?Y')
plt.xticks([])
plt.yticks([])
plt.show()
###?the?Covariance
covar3?=?np.cov(np.vstack((x,y))
### the Correlation
corr2?=?np.corrcoef(np.vstack((x,y)))
print('Covariance', str(covar3))
print('Correlation %g', %corr2)
Correlation Matrix
1 - Blue sine wave.
2 - Green wave.
3 - Orange wave
Covariance Matrix
Partial Correlation
Import Libraries!
#?import?libraries
import?matplotlib.pyplot?as?plt
import?numpy?as?np
import?pandas?as?pd
import?scipy.stats?as?stats
import?pingouin?as?pg
##?now?for?dataset
N?=?76
#?correlated?datasets
x1?=?np.linspace(1,10,N)?+?np.random.randn(N)
x2?=?x1?+?np.random.randn(N)
x3?=?x1?+?np.random.randn(N)
#?let's?convert?these?data?to?a?pandas?frame
df?=?pd.DataFrame()
df['x1']?=?x1
df['x2']?=?x2
df['x3']?=?x3
#?compute?the?"raw"?correlation?matrix
cormatR?=?df.corr()
print(cormatR)
#?print?out?one?value
print('?')
print(cormatR.values[1,0])
#?partial?correlation
pc?=?pg.partial_corr(df,x='x3',y='x2',covar='x1')
print('?')
print(pc)
Problem with 'Pearson Correlation'
Major Problem - If your data has outliers and your data does not follow Gaussian distribution, the correlation value may be not sensible, so for that, we will use some other methods.
Conclusion:
Spearman Rank Correlation:
What is the monotonic relationship?
How does Spearman work?
Step 1: Transform both the variables to rank
Step 2: Compute Pearson correlation coefficient on ranks. P-value is the same as Pearson Correlation.?
See the second image, the r_s is low and r_p is high, which it means captures the nonlinearity well.
See the similar images also!
Spearman action!
#?import?libraries
import?matplotlib.pyplot?as?plt
import?numpy?as?np
import?scipy.stats?as?stats
from?IPython?import?display
display.set_matplotlib_formats('svg')
Implement spearman
##?Anscobe's?quartet
anscombe?=?np.array([
?????#?series?1?????series?2??????series?3???????series?4
????[10,??8.04,????10,??9.14,????10,??7.46,??????8,??6.58,?],
????[?8,??6.95,?????8,??8.14,?????8,??6.77,??????8,??5.76,?],
????[13,??7.58,????13,??8.76,????13,?12.74,??????8,??7.71,?],
????[?9,??8.81,?????9,??8.77,?????9,??7.11,??????8,??8.84,?],
????[11,??8.33,????11,??9.26,????11,??7.81,??????8,??8.47,?],
????[14,??9.96,????14,??8.10,????14,??8.84,??????8,??7.04,?],
????[?6,??7.24,?????6,??6.13,?????6,??6.08,??????8,??5.25,?],
????[?4,??4.26,?????4,??3.10,?????4,??5.39,??????8,??5.56,?],
????[12,?10.84,????12,??9.13,????12,??8.15,??????8,??7.91,?],
????[?7,??4.82,?????7,??7.26,?????7,??6.42,??????8,??6.89,?],
????[?5,??5.68,?????5,??4.74,?????5,??5.73,?????19,?12.50,?]
????])
#?plot?and?compute?correlations
fig,ax?=?plt.subplots(2,2,figsize=(6,6))
ax?=?ax.ravel()
for?i?in?range(4):
????ax[i].plot(anscombe[:,i*2],anscombe[:,i*2+1],'ko')
????ax[i].set_xticks([])
????ax[i].set_yticks([])
????corr_p?=?stats.pearsonr(anscombe[:,i*2],anscombe[:,i*2+1])[0]
????corr_s?=?stats.spearmanr(anscombe[:,i*2],anscombe[:,i*2+1])[0]
????ax[i].set_title('r_p?=?%g,?r_s?=?%g'%(np.round(corr_p*100)/100,np.round(corr_s*100)/100))
plt.show()
Fisher Z Transformation for correlation:
Why do we need to use Fisher Z Transformation?
Import the packages!
##?Fisher-Z?transform
import numpy as np
#?simulate?correlation?coefficients
N?=?10000
r?=?2*np.random.rand(N)?-?1
#?Fisher-Z
fz?=?np.arctanh(r)
Visualize the data
y,x?=?np.histogram(fz,30
x?=?(x[1:]+x[0:-1])/2
plt.bar(x,y))
Kendall Correlation:
How does it work?
* Transform the data to rank concordance
Let's implement it in code!
# Import required libraries
from scipy.stats import kendalltau
??
# Taking values from the above example in Lists
X = [1, 2, 3, 4, 5, 6, 7]
Y = [1, 3, 6, 2, 7, 4, 5]
??
# Calculating Kendall Rank correlation
corr, _ = kendalltau(X, Y)
print('Kendall Rank correlation: %.5f' % corr)s
Cosine similarity!
r = Pearson correlation coefficient
cos (o) = cosine similarity
Key to remember: If your data has a mean of roughly 0, you can go for cosine similarity!
Cosine in action!
from scipy import spatial
List1 = [4, 47, 8, 3]
List2 = [3, 52, 12, 16]
result = 1 - spatial.distance.cosine(List1, List2)
print(result)
we have seen a lot of topics in correlation analysis, it will give you a basic foundation for starting an analytics career!
Thank you!
Name :?R. Aravindan
Position :?Content Writer.
Company :?Artificial Neurons.AI