Exploratory data analysis
Nourhan Elsherbiny
ITI Trainee - Java Enterprise & Web Apps development || ISTQB? CTFL
It helps us in describing the data by means of statistical and visualization techniques in order to bring important aspects of that data into focus for further analysis.
Main four types of EDA:
_________________________________________________
Uni-variate non-graphical analysis
Where the data being analyzed consists of just one variable. The main purpose of uni-variate analysis is to describe the data that exist within it.
X.describe()
Uni-variate graphical analysis
Non-graphical methods don’t provide a full picture of the data. Graphical methods are therefore required. Common types of uni-variate graphics include:
Multivariate non-graphical analysis
Multivariate data arises from more than one variable. Multivariate non-graphical EDA techniques generally show the relationship between two or more variables of the data through cross-tabulation or statistics.
**It's important to know that correlation doesn't imply causation between variables.
Many methods can show relationships between variables include:
It measures the strength of correlation between 2 features with correlation coefficient and p-value.
For a strong correlation ==> corr. coeff. must be close to 1 or -1
==> p-value must be less than 0.001
from scipy.stats import pearsonr
corr, p_val = pearsonr(var1, var2)
It's a statistical method used to test whether there are significant differences between the means of 2 or more features, it returns 2 parameters (f-test score, p-value)
F-test score: The F-statistic is simply a ratio of two variances. Variances are a measure of dispersion, or how far the data are scattered from the mean. Larger values represent greater dispersion.
P-value: P-value tells how statistically significant our calculated score value is.
If our price variable is strongly correlated with the variable we are analyzing, we expect ANOVA to return a sizeable F-test score and a small p-value.
It's a test for association between categorical variables.
from scipy.stats import chi2_contingenc
??
data = [[207, 282, 241], [234, 242, 232]]
stat, p, dof, expected = chi2_contingency(data)
It doesn't tell us the type of relationship that exists between both variables but only that a relationship exists or not.
Multivariate graphical analysis
Multivariate data uses graphics to display relationships between two or more sets of data. Common types of multivariate graphics include:
import numpy as np
x=np.random.randn(100)
y1= x*5 +9
y2= -5*x
y3=np.random.randn(100)
# Plot
plt.rcParams.update({'figure.figsize':(10,8), 'figure.dpi':100})
plt.scatter(x, y1, label=f'First variable')
plt.scatter(x, y2, label=f'Second variable')
plt.scatter(x, y3, label=f'Third variable')
# Plot
plt.title('Multivariate relation')
plt.legend()
plt.show()p
import pandas as p
import matplotlib.pyplot as plt
import seaborn as sns
df = pd.DataFrame({"Date": ['01/01/2019','01/02/2019','01/03/2019','01/04/2019',
'01/05/2019','01/06/2019','01/07/2019','01/08/2019'],
"Price": [77,76,68,70,78,79,74,75]})
plt.figure(figsize = (15,8))
sns.lineplot(x = 'Date', y = 'Price',data = df)
import plotly.express as p
df = px.data.iris()
fig = px.scatter(df, x="sepal_width", y="sepal_length",
?? ??? ??? ??? ?color="species",
?? ??? ??? ??? ?size='petal_length',
?? ??? ??? ??? ?hover_data=['petal_width'])
fig.show()
import numpy as np; np.random.seed(0
import seaborn as sns; sns.set_theme()
uniform_data = np.random.rand(10, 12)
ax = sns.heatmap(uniform_data))
___________________________________________________
Specific statistical functions and techniques you can perform with EDA tools include:
_____________________________________________
Data preprocessing article:
You can check the guide for Data preprocessing
Reference: Data Analysis with pyhton - IBM
#data_analysis #data_preprocessing #Nourhan_Elsherbiny