登录查看更多内容

Exploratory data analysis

Nourhan Elsherbiny

ITI Trainee - Java Enterprise & Web Apps development || ISTQB? CTFL

发布日期: 2021年10月21日

It helps us in describing the data by means of statistical and visualization techniques in order to bring important aspects of that data into focus for further analysis.

Main four types of EDA:

Uni-variate non-graphical analysis
Uni-variate graphical analysis
Multivariate non-graphical analysis
Multivariate graphical analysis

_________________________________________________

Uni-variate non-graphical analysis

Where the data being analyzed consists of just one variable. The main purpose of uni-variate analysis is to describe the data that exist within it.

It can be done by calling 'describe' method to show the statistical analysis of this variable.

X.describe()

Uni-variate graphical analysis

Non-graphical methods don’t provide a full picture of the data. Graphical methods are therefore required. Common types of uni-variate graphics include:

Stem-and-leaf plots, which show all data values and the shape of the distribution.

Histograms, a bar plot in which each bar represents the frequency (count) or proportion (count/total count) of cases for a range of values.

Box plots, which graphically depict the five-number summary of minimum, first quartile, median, third quartile, and maximum.

Multivariate non-graphical analysis

Multivariate data arises from more than one variable. Multivariate non-graphical EDA techniques generally show the relationship between two or more variables of the data through cross-tabulation or statistics.

**It's important to know that correlation doesn't imply causation between variables.

Many methods can show relationships between variables include:

Pearson correlation

It measures the strength of correlation between 2 features with correlation coefficient and p-value.

领英推荐

Data Analysis: Unlocking Insights from Raw Information

Transights For Training and Consultancy 7 个月前

EDA and EDA Life Cycle

Gamaka AI 1 年前

Data Fallacies

Projecting Success 2 年前

For a strong correlation ==> corr. coeff. must be close to 1 or -1

==> p-value must be less than 0.001

from scipy.stats import pearsonr
corr, p_val = pearsonr(var1, var2)

Analysis of variance (ANOVA)

It's a statistical method used to test whether there are significant differences between the means of 2 or more features, it returns 2 parameters (f-test score, p-value)

F-test score: The F-statistic is simply a ratio of two variances. Variances are a measure of dispersion, or how far the data are scattered from the mean. Larger values represent greater dispersion.

P-value: P-value tells how statistically significant our calculated score value is.

If our price variable is strongly correlated with the variable we are analyzing, we expect ANOVA to return a sizeable F-test score and a small p-value.

Chi-square

It's a test for association between categorical variables.

from scipy.stats import chi2_contingenc
??
data = [[207, 282, 241], [234, 242, 232]]
stat, p, dof, expected = chi2_contingency(data)

It doesn't tell us the type of relationship that exists between both variables but only that a relationship exists or not.

Multivariate graphical analysis

Multivariate data uses graphics to display relationships between two or more sets of data. Common types of multivariate graphics include:

Scatter plot, which is used to plot data points on a horizontal and a vertical axis to show how much one variable is affected by another.

Multivariate chart, which is a graphical representation of the relationships between factors and a response.

import numpy as np

x=np.random.randn(100)
y1= x*5 +9
y2= -5*x
y3=np.random.randn(100)

# Plot
plt.rcParams.update({'figure.figsize':(10,8), 'figure.dpi':100})
plt.scatter(x, y1, label=f'First variable')
plt.scatter(x, y2, label=f'Second variable')
plt.scatter(x, y3, label=f'Third variable')

# Plot
plt.title('Multivariate relation')
plt.legend()
plt.show()p

Time series plot, which is a line graph of data plotted over time.

import pandas as p
import matplotlib.pyplot as plt
import seaborn as sns

df = pd.DataFrame({"Date": ['01/01/2019','01/02/2019','01/03/2019','01/04/2019',
                             '01/05/2019','01/06/2019','01/07/2019','01/08/2019'],
                   "Price": [77,76,68,70,78,79,74,75]})
plt.figure(figsize = (15,8))
sns.lineplot(x = 'Date', y = 'Price',data = df)

Bubble chart, which is a data visualization that displays multiple circles (bubbles) in a two-dimensional plot.

import plotly.express as p

df = px.data.iris()

fig = px.scatter(df, x="sepal_width", y="sepal_length",
?? ??? ??? ??? ?color="species",
?? ??? ??? ??? ?size='petal_length',
?? ??? ??? ??? ?hover_data=['petal_width'])

fig.show()

Heat map, which is a graphical representation of data where values are depicted by color.

import numpy as np; np.random.seed(0

import seaborn as sns; sns.set_theme()

uniform_data = np.random.rand(10, 12)

ax = sns.heatmap(uniform_data))

___________________________________________________

Specific statistical functions and techniques you can perform with EDA tools include:

Clustering and dimension reduction techniques, which help create graphical displays of high-dimensional data containing many variables.
Univariate visualization of each field in the raw dataset, with summary statistics.
Bivariate visualizations and summary statistics that allow you to assess the relationship between each variable in the dataset and the target variable you’re looking at.
Multivariate visualizations, for mapping and understanding interactions between different fields in the data.
K-means Clustering is a clustering method in unsupervised learning where data points are assigned into K groups, i.e. the number of clusters, based on the distance from each group’s centroid. The data points closest to a particular centroid will be clustered under the same category. K-means Clustering is commonly used in market segmentation, pattern recognition, and image compression.
Predictive models, such as linear regression, use statistics and data to predict outcomes.

_____________________________________________

Data preprocessing article:

You can check the guide for Data preprocessing

Reference: Data Analysis with pyhton - IBM

#data_analysis #data_preprocessing #Nourhan_Elsherbiny

要查看或添加评论，请登录

Nourhan Elsherbiny的更多文章

Data Preprocessing

2021年10月15日

Data Preprocessing

Data analysis journey must start with #data_preprocessing main steps in which we convert data from raw form to another…

Exploratory data analysis

Nourhan Elsherbiny

ITI Trainee - Java Enterprise & Web Apps development || ISTQB? CTFL

Main four types of EDA:

领英推荐

Specific statistical functions and techniques you can perform with EDA tools include:

Nourhan Elsherbiny的更多文章

社区洞察

其他会员也浏览了

Understanding Data Cleaning: Importance and Practical Examples

Data Cleaning and Preparation Techniques

Top 10 Tips to unlock hidden value using deep data analytics

Understanding and Handling Missing Values in Data Analysis

General Data Analysis and Statistical Questions