登录查看更多内容

Exploratory Data Analysis(EDA) using Python

ONASVEE BANARSE

Assistant Manager at Reliance JPL , BTech CE23" AISSMS IOIT

发布日期: 2021年6月20日

+ 关注

What is Exploratory Data Analysis(EDA)?

If we want to explain EDA in simple terms, it means trying to understand the given data much better, so that we can make some sense out of it.

In statistics, exploratory data analysis is an approach to analyzing data sets to summarize their main characteristics, often with visual methods. A statistical model can be used or not, but primarily EDA is for seeing what the data can tell us beyond the formal modeling or hypothesis testing task.

EDA in Python uses data visualization to draw meaningful patterns and insights. It also involves the preparation of data sets for analysis by removing irregularities in the data.

Based on the results of EDA, companies also make business decisions, which can have repercussions later.

If EDA is not done properly then it can hamper the further steps in the machine learning model building process.
If done well, it may improve the efficacy of everything we do next.

1. Data Sourcing

Data Sourcing is the process of finding and loading the data into our system. Broadly there are two ways in which we can find data.

Private Data
Public Data

2. Data Cleaning

After completing the Data Sourcing, the next step in the process of EDA is Data Cleaning. It is very important to get rid of the irregularities and clean the data after sourcing it into our system.

Irregularities are of different types of data.

Missing Values
Incorrect Format
Incorrect Headers
Anomalies/Outliers

Using Jupyter Notebook for analysis.

Import the necessary libraries and store the data in our system for analysis.

#import the useful libraries.
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

# Read the data set of "Marketing Analysis" in data.
data= pd.read_csv("marketing_analysis.csv")

# Printing the data
data

There are some discrepancies in the Column header for the first 2 rows. The correct data is from the index number 1. So, we have to fix the first two rows.

This is called Fixing the Rows and Columns.

#import the useful libraries.
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

# Read the file in data without first two rows as it is of no use.
data = pd.read_csv("marketing_analysis.csv",skiprows = 2)

#print the head of the data frame.
data.head()

Following are the steps to be taken while Fixing Rows and Columns:

Delete Summary Rows and Columns
Delete Header and Footer Rows
Delete Extra Rows like blank rows, page numbers, etc.
We can merge different columns if it makes for better understanding of the data
Similarly, we can also split one column into multiple columns based on our requirements or understanding.
Add Column names, it is very important to have column names to the dataset.

The customerid column has of no importance to our analysis, and also the jobedu column has both the information of job and education in it.

So, what we’ll do is, we’ll drop the customerid column and we’ll split the jobedu column into two other columns job and education and after that, we’ll drop the jobedu column as well.

# Drop the customer id as it is of no use.
data.drop('customerid', axis = 1, inplace = True)

#Extract job  & Education in newly from "jobedu" column.
data['job']= data["jobedu"].apply(lambda x: x.split(",")[0])
data['education']= data["jobedu"].apply(lambda x: x.split(",")[1])

# Drop the "jobedu" column from the dataframe.
data.drop('jobedu', axis = 1, inplace = True)

# Printing the Dataset
data

Missing Values

If there are missing values in the Dataset before doing any statistical analysis, we need to handle those missing values.

There are mainly three types of missing values.

MCAR(Missing completely at random)
MAR(Missing at random)
MNAR(Missing not at random)

# Checking the missing values

data.isnull().sum()

Drop the missing Values

# Dropping the records with age missing in data dataframe.
data  ?= data[~data.age.isnull()].copy()

# Checking the missing values in the dataset.
data.isnull().sum()



#drop the records with response missing in data.
data  ?= data[~data.response.isnull()].copy()
# Calculate the missing values in each column of #data frame 

data.isnull().sum()

Handling Outliers

Outliers are the values that are far beyond the next nearest data points.

There are two types of outliers:

Univariate outliers: Univariate outliers are the data points whose values lie beyond the range of expected values based on one variable.
Multivariate outliers: While plotting data, some values of one variable may not lie beyond the expected range, but when you plot the data with some other variable, these values may lie far from the expected value.

3. Visual Analysis

Scatter Plot

Take three columns ‘Balance’, ‘Age’ and ‘Salary’ from our dataset and see what we can infer by plotting to scatter plot between salary balance and age balance.

#plot the scatter plot of balance and salary variable in data
plt.scatter(data.salary,data.balance)
plt.show()

#plot the scatter plot of balance and age variable in data
data.plot.scatter(x="age",y="balance")
plt.show()

Bar Plot

#plot the bar graph of marital status with average value of response_rate
data.groupby('marital')['response_rate'].mean().plot.bar()
plt.show()

Pie plot

#calculate the percentage of each education category.
data.education.value_counts(normalize=True)

#plot the pie chart of education categories
data.education.value_counts(normalize=True).plot.pie()
plt.show()

Conclusion

This is how we’ll do Exploratory Data Analysis. Exploratory Data Analysis (EDA) helps us to look beyond the data.

Reference:

https://towardsdatascience.com/exploratory-data-analysis-8fc1cb20fd15
https://towardsdatascience.com/exploratory-data-analysis-eda-python-87178e35b14
https://www.analyticsvidhya.com/blog/2016/01/guide-data-exploration/
https://www.ibm.com/cloud/learn/exploratory-data-analysis

Vaibhav Joshi

Consultant at KPMG

3 年

Awesome ????

1 次回应

查看更多评论

要查看或添加评论，请登录

ONASVEE BANARSE的更多文章

Text extraction from Images

2021年8月12日

Text extraction from Images

Overview Optical Character Recognition (OCR) is a widely used system in the computer vision space Application in real…
Real-time Face detection

2021年8月8日

Real-time Face detection

What is Face Detection? The goal of face detection is to determine if there are any faces in the image or video. If…
Pose Estimation using Computer Vision

2021年7月31日

Pose Estimation using Computer Vision

Pose estimation is a popular task in Computer Vision. As a field of artificial intelligence (AI), computer vision…
Lets clear out the differences and an overview. Git vs GitHub

2021年6月8日

Lets clear out the differences and an overview. Git vs GitHub

Git: Git is a distributed version control system for tracking changes in source code during software development. It is…
Case Study What is Compiler? Explain different phases of Compiler in details.

2021年6月8日

Case Study What is Compiler? Explain different phases of Compiler in details.

Compiler: Where it all started … Software for early computers was primarily written in assembly language for many…

See all articles

Exploratory Data Analysis(EDA) using Python

ONASVEE BANARSE

Assistant Manager at Reliance JPL , BTech CE23" AISSMS IOIT

What is Exploratory Data Analysis(EDA)?

1. Data Sourcing

2. Data Cleaning

3. Visual Analysis

Scatter Plot

ONASVEE BANARSE的更多文章

社区洞察

其他会员也浏览了

Top 12 Python Skills Every Data Scientist Should Learn

Non-linear Functional Data Analysis

Seaborn: Elevating Data Visualization in Python

Python Practice Project : Netflix Stock Data Analysis | Investing Insights | Patterns | Trends | Forecasting

Matplotlib

Analyzing Excel Sales Data with Python Pandas and Seaborn - Part I

Analyzing Excel Sales Data with Python Pandas and Seaborn - Part II

MarkItDown: A Powerful Tool for Converting Data to Markdown for LLM Applications

Hands-on Debugging for Data Science

The One-Click Data Scientist: The Power of #GPT-4's New .CSV File Analysis with Python

What is Exploratory Data Analysis(EDA)?

1. Data Sourcing

2. Data Cleaning

3. Visual Analysis

Scatter Plot

ONASVEE BANARSE的更多文章

Text extraction from Images

Real-time Face detection

Pose Estimation using Computer Vision

Lets clear out the differences and an overview. Git vs GitHub

Case Study What is Compiler? Explain different phases of Compiler in details.

社区洞察

其他会员也浏览了

Top 12 Python Skills Every Data Scientist Should Learn

Non-linear Functional Data Analysis

Seaborn: Elevating Data Visualization in Python

Python Practice Project : Netflix Stock Data Analysis | Investing Insights | Patterns | Trends | Forecasting

Matplotlib

Analyzing Excel Sales Data with Python Pandas and Seaborn - Part I

Analyzing Excel Sales Data with Python Pandas and Seaborn - Part II

MarkItDown: A Powerful Tool for Converting Data to Markdown for LLM Applications

Hands-on Debugging for Data Science

The One-Click Data Scientist: The Power of #GPT-4's New .CSV File Analysis with Python