Exploratory Data Analysis
What is exploratory data analysis?
Exploratory data analysis (EDA) is used by data scientists to analyze and investigate data sets and summarize their main characteristics, often employing data visualization methods. It helps determine how best to manipulate data sources to get the answers you need, making it easier for data scientists to discover patterns, spot anomalies, test a hypothesis, or check assumptions.
EDA is primarily used to see what data can reveal beyond the formal modeling or hypothesis testing task and provides a provides a better understanding of data set variables and the relationships between them. It can also help determine if the statistical techniques you are considering for data analysis are appropriate. Originally developed by American mathematician John Tukey in the 1970s, EDA techniques continue to be a widely used method in the data discovery process today
Why is exploratory data analysis important in data science?
The main purpose of EDA is to help look at data?before making any assumptions.?It can help identify obvious errors, as well as better understand patterns within the data, detect outliers or anomalous events, find interesting relations among the variables.
Data scientists can use exploratory analysis to ensure the results they produce are valid and applicable to any desired business outcomes and goals. EDA also helps stakeholders by confirming they are asking the right questions. EDA can help answer questions about standard deviations, categorical variables, and confidence intervals. Once EDA is complete and insights are drawn, its features can then be used for more sophisticated data analysis or modeling, including?machine learning.
Exploratory data analysis tools
Specific statistical functions and techniques you can perform with EDA tools include:
Types of exploratory data analysis
There are four primary types of EDA:
Other common types of multivariate graphics include:
EDA explained using sample Data set:
To share my understanding of the concept and techniques I know,I’ll take an example of white variant of?Wine Quality data set?which is available on UCI Machine Learning Repository and try to catch hold of as many insights from the data set using EDA.
To starts with,I imported necessary libraries (for this example pandas, numpy,matplotlib and seaborn) and loaded the data set.
I found out the total number of rows and columns in the data set using “.shape”.
It is also a good practice to know the columns and their corresponding data types,along with finding whether they contain null values or not.
The describe() function in pandas is very handy in getting various summary statistics.This function returns the count, mean, standard deviation, minimum and maximum values and the quantiles of the data
领英推荐
Few key insights just by looking at dependent variable are as follows:
I got a good a glimpse of data. But that’s the thing with Data Science the more you get involved the harder it is for you to stop exploring.Let’s now explore data with beautiful graphs. Python has a visualization library ,Seaborn?which build on top of matplotlib. It provides very attractive statistical graphs in order to perform both?Univariate?and?Multivariate analysis.
To use linear regression for modelling,its necessary to remove correlated variables to improve your model.One can find correlations using pandas “.corr()” function and can visualize the correlation matrix using a heatmap in seaborn.
It’s a good practice to remove correlated variables during feature selection.
A box plot (or box-and-whisker plot) shows the distribution of quantitative data in a way that facilitates comparisons between variables.The box shows the quartiles of the dataset while the whiskers extend to show the rest of the distribution.
The box plot (a.k.a. box and whisker diagram) is a standardized way of displaying the distribution of data based on the five number summary:
In the simplest box plot the central rectangle spans the first quartile to the third quartile (the interquartile range or IQR).
A segment inside the rectangle shows the median and “whiskers” above and below the box show the locations of the minimum and maximum.
Outliers are either 3×IQR or more above the third quartile or 3×IQR or more below the first quartile.
Now to check the linearity of the variables it is a good practice to plot distribution graph and look for skewness of features. Kernel density estimate (kde) is a quite useful tool for plotting the shape of a distribution.
Lastly, to sum up all Exploratory Data Analysis is a philosophical and an artistical approach to guage every nuance from the data at early encounter.