Exploratory data analysis
Let your data talk

Exploratory data analysis

It helps us in describing the data by means of statistical and visualization techniques in order to bring important aspects of that data into focus for further analysis.

Main four types of EDA:

  1. Uni-variate non-graphical analysis
  2. Uni-variate graphical analysis
  3. Multivariate non-graphical analysis
  4. Multivariate graphical analysis

_________________________________________________

Uni-variate non-graphical analysis

Where the data being analyzed consists of just one variable. The main purpose of uni-variate analysis is to describe the data that exist within it.

  • It can be done by calling 'describe' method to show the statistical analysis of this variable.

X.describe()        
No alt text provided for this image





Uni-variate graphical analysis

Non-graphical methods don’t provide a full picture of the data. Graphical methods are therefore required. Common types of uni-variate graphics include:

  • Stem-and-leaf plots, which show all data values and the shape of the distribution.

No alt text provided for this image





  • Histograms, a bar plot in which each bar represents the frequency (count) or proportion (count/total count) of cases for a range of values.

No alt text provided for this image







  • Box plots, which graphically depict the five-number summary of minimum, first quartile, median, third quartile, and maximum.

No alt text provided for this image







Multivariate non-graphical analysis

Multivariate data arises from more than one variable. Multivariate non-graphical EDA techniques generally show the relationship between two or more variables of the data through cross-tabulation or statistics.

**It's important to know that correlation doesn't imply causation between variables.

Many methods can show relationships between variables include:

  • Pearson correlation

It measures the strength of correlation between 2 features with correlation coefficient and p-value.

No alt text provided for this image

For a strong correlation ==> corr. coeff. must be close to 1 or -1

==> p-value must be less than 0.001

from scipy.stats import pearsonr
corr, p_val = pearsonr(var1, var2)        


  • Analysis of variance (ANOVA)

It's a statistical method used to test whether there are significant differences between the means of 2 or more features, it returns 2 parameters (f-test score, p-value)

F-test score: The F-statistic is simply a ratio of two variances. Variances are a measure of dispersion, or how far the data are scattered from the mean. Larger values represent greater dispersion.

P-value: P-value tells how statistically significant our calculated score value is.

If our price variable is strongly correlated with the variable we are analyzing, we expect ANOVA to return a sizeable F-test score and a small p-value.

No alt text provided for this image

  • Chi-square

It's a test for association between categorical variables.

from scipy.stats import chi2_contingenc
??
data = [[207, 282, 241], [234, 242, 232]]
stat, p, dof, expected = chi2_contingency(data)        

It doesn't tell us the type of relationship that exists between both variables but only that a relationship exists or not.


Multivariate graphical analysis

Multivariate data uses graphics to display relationships between two or more sets of data. Common types of multivariate graphics include:

  • Scatter plot, which is used to plot data points on a horizontal and a vertical axis to show how much one variable is affected by another.

No alt text provided for this image

  • Multivariate chart, which is a graphical representation of the relationships between factors and a response.

import numpy as np

x=np.random.randn(100)
y1= x*5 +9
y2= -5*x
y3=np.random.randn(100)

# Plot
plt.rcParams.update({'figure.figsize':(10,8), 'figure.dpi':100})
plt.scatter(x, y1, label=f'First variable')
plt.scatter(x, y2, label=f'Second variable')
plt.scatter(x, y3, label=f'Third variable')

# Plot
plt.title('Multivariate relation')
plt.legend()
plt.show()p        
No alt text provided for this image

  • Time series plot, which is a line graph of data plotted over time.

import pandas as p
import matplotlib.pyplot as plt
import seaborn as sns

df = pd.DataFrame({"Date": ['01/01/2019','01/02/2019','01/03/2019','01/04/2019',
                             '01/05/2019','01/06/2019','01/07/2019','01/08/2019'],
                   "Price": [77,76,68,70,78,79,74,75]})
plt.figure(figsize = (15,8))
sns.lineplot(x = 'Date', y = 'Price',data = df)
        
No alt text provided for this image

  • Bubble chart, which is a data visualization that displays multiple circles (bubbles) in a two-dimensional plot.

import plotly.express as p

df = px.data.iris()

fig = px.scatter(df, x="sepal_width", y="sepal_length",
?? ??? ??? ??? ?color="species",
?? ??? ??? ??? ?size='petal_length',
?? ??? ??? ??? ?hover_data=['petal_width'])

fig.show()
        
No alt text provided for this image

  • Heat map, which is a graphical representation of data where values are depicted by color.

import numpy as np; np.random.seed(0

import seaborn as sns; sns.set_theme()

uniform_data = np.random.rand(10, 12)

ax = sns.heatmap(uniform_data))        
No alt text provided for this image

___________________________________________________

Specific statistical functions and techniques you can perform with EDA tools include:

  • Clustering and dimension reduction techniques, which help create graphical displays of high-dimensional data containing many variables.
  • Univariate visualization of each field in the raw dataset, with summary statistics.
  • Bivariate visualizations and summary statistics that allow you to assess the relationship between each variable in the dataset and the target variable you’re looking at.
  • Multivariate visualizations, for mapping and understanding interactions between different fields in the data.
  • K-means Clustering is a clustering method in unsupervised learning where data points are assigned into K groups, i.e. the number of clusters, based on the distance from each group’s centroid. The data points closest to a particular centroid will be clustered under the same category. K-means Clustering is commonly used in market segmentation, pattern recognition, and image compression.
  • Predictive models, such as linear regression, use statistics and data to predict outcomes.

_____________________________________________

Data preprocessing article:

You can check the guide for Data preprocessing

Reference: Data Analysis with pyhton - IBM

#data_analysis #data_preprocessing #Nourhan_Elsherbiny




要查看或添加评论,请登录

Nourhan Elsherbiny的更多文章

  • Data Preprocessing

    Data Preprocessing

    Data analysis journey must start with #data_preprocessing main steps in which we convert data from raw form to another…

社区洞察

其他会员也浏览了