Exploratory Data Analysis

Exploratory Data Analysis

What is exploratory data analysis?

Exploratory data analysis (EDA) is used by data scientists to analyze and investigate data sets and summarize their main characteristics, often employing data visualization methods. It helps determine how best to manipulate data sources to get the answers you need, making it easier for data scientists to discover patterns, spot anomalies, test a hypothesis, or check assumptions.

EDA is primarily used to see what data can reveal beyond the formal modeling or hypothesis testing task and provides a provides a better understanding of data set variables and the relationships between them. It can also help determine if the statistical techniques you are considering for data analysis are appropriate. Originally developed by American mathematician John Tukey in the 1970s, EDA techniques continue to be a widely used method in the data discovery process today

Why is exploratory data analysis important in data science?

The main purpose of EDA is to help look at data?before making any assumptions.?It can help identify obvious errors, as well as better understand patterns within the data, detect outliers or anomalous events, find interesting relations among the variables.

Data scientists can use exploratory analysis to ensure the results they produce are valid and applicable to any desired business outcomes and goals. EDA also helps stakeholders by confirming they are asking the right questions. EDA can help answer questions about standard deviations, categorical variables, and confidence intervals. Once EDA is complete and insights are drawn, its features can then be used for more sophisticated data analysis or modeling, including?machine learning.

Exploratory data analysis tools

Specific statistical functions and techniques you can perform with EDA tools include:

  • Clustering and dimension reduction techniques, which help create graphical displays of high-dimensional data containing many variables.
  • Univariate visualization of each field in the raw dataset, with summary statistics.
  • Bivariate visualizations and summary statistics that allow you to assess the relationship between each variable in the dataset and the target variable you’re looking at.
  • Multivariate visualizations, for mapping and understanding interactions between different fields in the data.
  • K-means Clustering is a clustering method in?unsupervised learning?where data points are assigned into K groups, i.e. the number of clusters, based on the distance from each group’s centroid. The data points closest to a particular centroid will be clustered under the same category. K-means Clustering is commonly used in market segmentation, pattern recognition, and image compression.
  • Predictive models, such as linear regression, use statistics and data to predict outcomes.

Types of exploratory data analysis

There are four primary types of EDA:

  • Univariate non-graphical.?This is simplest form of data analysis, where the data being analyzed consists of just one variable. Since it’s a single variable, it doesn’t deal with causes or relationships. The main purpose of univariate analysis is to describe the data and find patterns that exist within it.
  • Univariate graphical.?Non-graphical methods don’t provide a full picture of the data. Graphical methods are therefore required. Common types of univariate graphics include:
  • Stem-and-leaf plots, which show all data values and the shape of the distribution.
  • Histograms, a bar plot in which each bar represents the frequency (count) or proportion (count/total count) of cases for a range of values.
  • Box plots, which graphically depict the five-number summary of minimum, first quartile, median, third quartile, and maximum.
  • Multivariate nongraphical:?Multivariate data arises from more than one variable. Multivariate non-graphical EDA techniques generally show the relationship between two or more variables of the data through cross-tabulation or statistics.
  • Multivariate graphical:?Multivariate data uses graphics to display relationships between two or more sets of data. The most used graphic is a grouped bar plot or bar chart with each group representing one level of one of the variables and each bar within a group representing the levels of the other variable.

Other common types of multivariate graphics include:

  • Scatter plot, which is used to plot data points on a horizontal and a vertical axis to show how much one variable is affected by another.
  • Multivariate chart, which is a graphical representation of the relationships between factors and a response.
  • Run chart, which is a line graph of data plotted over time.
  • Bubble chart, which is a data visualization that displays multiple circles (bubbles) in a two-dimensional plot.
  • Heat map, which is a graphical representation of data where values are depicted by color.

EDA explained using sample Data set:

To share my understanding of the concept and techniques I know,I’ll take an example of white variant of?Wine Quality data set?which is available on UCI Machine Learning Repository and try to catch hold of as many insights from the data set using EDA.


To starts with,I imported necessary libraries (for this example pandas, numpy,matplotlib and seaborn) and loaded the data set.

  • Note : Whatever inferences I could extract,I’ve mentioned with bullet points.

No alt text provided for this image

  • Original data is separated by delimiter “ ; “ in given data set.
  • To take a closer look at the data took help of “ .head()”function of pandas library which returns first five observations of the data set.Similarly “.tail()” returns last five observations of the data set.

I found out the total number of rows and columns in the data set using “.shape”.

No alt text provided for this image

  • Dataset comprises of 4898 observations and 12 characteristics.
  • Out of which one is dependent variable and rest 11 are independent variables — physico-chemical characteristics.

It is also a good practice to know the columns and their corresponding data types,along with finding whether they contain null values or not.


No alt text provided for this image

  • Data has only float and integer values.
  • No variable column has null/missing values.

The describe() function in pandas is very handy in getting various summary statistics.This function returns the count, mean, standard deviation, minimum and maximum values and the quantiles of the data

No alt text provided for this image

  • Here as you can notice mean value is less than median value of each column which is represented by 50%(50th percentile) in index column.
  • There is notably a large difference between 75th %tile and max values of predictors “residual sugar”,”free sulfur dioxide”,”total sulfur dioxide”.
  • Thus observations 1 and 2 suggests that there are extreme values-Outliers in our data set.

Few key insights just by looking at dependent variable are as follows:

No alt text provided for this image

  • Target variable/Dependent variable is discrete and categorical in nature.
  • “quality” score scale ranges from 1 to 10;where 1 being poor and 10 being the best.
  • 1,2 & 10 Quality ratings are not given by any observation. Only scores obtained are between 3 to 9.

No alt text provided for this image

  • This tells us vote count of each quality score in descending order.
  • “quality” has most values concentrated in the categories 5, 6 and 7.
  • Only a few observations made for the categories 3 & 9.

I got a good a glimpse of data. But that’s the thing with Data Science the more you get involved the harder it is for you to stop exploring.Let’s now explore data with beautiful graphs. Python has a visualization library ,Seaborn?which build on top of matplotlib. It provides very attractive statistical graphs in order to perform both?Univariate?and?Multivariate analysis.

To use linear regression for modelling,its necessary to remove correlated variables to improve your model.One can find correlations using pandas “.corr()” function and can visualize the correlation matrix using a heatmap in seaborn.

No alt text provided for this image

  • Dark shades represents positive correlation while lighter shades represents negative correlation.
  • If you set annot=True, you’ll get values by which features are correlated to each other in grid-cells.

It’s a good practice to remove correlated variables during feature selection.

No alt text provided for this image

  • Here we can infer that “density” has strong positive correlation with “residual sugar” whereas it has strong negative correlation with “alcohol”.
  • “free sulphur dioxide” and “citric acid” has almost no correlation with “quality”.
  • Since correlation is zero we can infer there is no linear relationship between these two predictors.However it is safe to drop these features in case you’re applying Linear Regression model to the dataset.

A box plot (or box-and-whisker plot) shows the distribution of quantitative data in a way that facilitates comparisons between variables.The box shows the quartiles of the dataset while the whiskers extend to show the rest of the distribution.

The box plot (a.k.a. box and whisker diagram) is a standardized way of displaying the distribution of data based on the five number summary:

  • Minimum
  • First quartile
  • Median
  • Third quartile
  • Maximum.

In the simplest box plot the central rectangle spans the first quartile to the third quartile (the interquartile range or IQR).

A segment inside the rectangle shows the median and “whiskers” above and below the box show the locations of the minimum and maximum.


No alt text provided for this image

Outliers are either 3×IQR or more above the third quartile or 3×IQR or more below the first quartile.

  • In our data set except “alcohol” all other features columns shows outliers.

Now to check the linearity of the variables it is a good practice to plot distribution graph and look for skewness of features. Kernel density estimate (kde) is a quite useful tool for plotting the shape of a distribution.

No alt text provided for this image

  • “pH” column appears to be normally distributed
  • remaining all independent variables are right skewed/positively skewed.

Lastly, to sum up all Exploratory Data Analysis is a philosophical and an artistical approach to guage every nuance from the data at early encounter.



要查看或添加评论,请登录

Fazal Hyder的更多文章

  • Text Extraction from Image

    Text Extraction from Image

    Extracting text from an image can be a cumbersome process. Most people just retype the text/data from the image; but…

  • Computer Vision

    Computer Vision

    One of the most powerful and compelling types of AI is computer vision which you’ve almost surely experienced in any…

  • GitHub

    GitHub

    GitHub, Inc. is a provider of Internet hosting for software development and version control using Git.

    1 条评论

社区洞察

其他会员也浏览了