Exploratory Data Analysis(EDA) using Python

Exploratory Data Analysis(EDA) using Python

No alt text provided for this image

What is Exploratory Data Analysis(EDA)?

If we want to explain EDA in simple terms, it means trying to understand the given data much better, so that we can make some sense out of it.

In statistics, exploratory data analysis is an approach to analyzing data sets to summarize their main characteristics, often with visual methods. A statistical model can be used or not, but primarily EDA is for seeing what the data can tell us beyond the formal modeling or hypothesis testing task.

EDA in Python uses data visualization to draw meaningful patterns and insights. It also involves the preparation of data sets for analysis by removing irregularities in the data.

Based on the results of EDA, companies also make business decisions, which can have repercussions later.

  • If EDA is not done properly then it can hamper the further steps in the machine learning model building process.
  • If done well, it may improve the efficacy of everything we do next.

1. Data Sourcing

Data Sourcing is the process of finding and loading the data into our system. Broadly there are two ways in which we can find data.

  1. Private Data
  2. Public Data

2. Data Cleaning

After completing the Data Sourcing, the next step in the process of EDA is Data Cleaning. It is very important to get rid of the irregularities and clean the data after sourcing it into our system.

Irregularities are of different types of data.

  • Missing Values
  • Incorrect Format
  • Incorrect Headers
  • Anomalies/Outliers

Using Jupyter Notebook for analysis.

Import the necessary libraries and store the data in our system for analysis.

#import the useful libraries.
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

# Read the data set of "Marketing Analysis" in data.
data= pd.read_csv("marketing_analysis.csv")

# Printing the data
data
No alt text provided for this image

There are some discrepancies in the Column header for the first 2 rows. The correct data is from the index number 1. So, we have to fix the first two rows.

This is called Fixing the Rows and Columns.

#import the useful libraries.
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

# Read the file in data without first two rows as it is of no use.
data = pd.read_csv("marketing_analysis.csv",skiprows = 2)

#print the head of the data frame.
data.head()
No alt text provided for this image

Following are the steps to be taken while Fixing Rows and Columns:

  1. Delete Summary Rows and Columns
  2. Delete Header and Footer Rows
  3. Delete Extra Rows like blank rows, page numbers, etc.
  4. We can merge different columns if it makes for better understanding of the data
  5. Similarly, we can also split one column into multiple columns based on our requirements or understanding.
  6. Add Column names, it is very important to have column names to the dataset.

 The customerid column has of no importance to our analysis, and also the jobedu column has both the information of job and education in it.

So, what we’ll do is, we’ll drop the customerid column and we’ll split the jobedu column into two other columns job and education and after that, we’ll drop the jobedu column as well.

# Drop the customer id as it is of no use.
data.drop('customerid', axis = 1, inplace = True)

#Extract job  & Education in newly from "jobedu" column.
data['job']= data["jobedu"].apply(lambda x: x.split(",")[0])
data['education']= data["jobedu"].apply(lambda x: x.split(",")[1])

# Drop the "jobedu" column from the dataframe.
data.drop('jobedu', axis = 1, inplace = True)

# Printing the Dataset
data
No alt text provided for this image

Missing Values

If there are missing values in the Dataset before doing any statistical analysis, we need to handle those missing values.

There are mainly three types of missing values.

  1. MCAR(Missing completely at random)
  2. MAR(Missing at random)
  3. MNAR(Missing not at random)

# Checking the missing values

data.isnull().sum()

No alt text provided for this image

Drop the missing Values

# Dropping the records with age missing in data dataframe.
data  ?= data[~data.age.isnull()].copy()

# Checking the missing values in the dataset.
data.isnull().sum()



#drop the records with response missing in data.
data  ?= data[~data.response.isnull()].copy()
# Calculate the missing values in each column of #data frame 

data.isnull().sum()
No alt text provided for this image

Handling Outliers

Outliers are the values that are far beyond the next nearest data points.

There are two types of outliers:

  1. Univariate outliers: Univariate outliers are the data points whose values lie beyond the range of expected values based on one variable.
  2. Multivariate outliers: While plotting data, some values of one variable may not lie beyond the expected range, but when you plot the data with some other variable, these values may lie far from the expected value.
No alt text provided for this image

3. Visual Analysis

Scatter Plot

Take three columns ‘Balance’, ‘Age’ and ‘Salary’ from our dataset and see what we can infer by plotting to scatter plot between salary balance and age balance.

#plot the scatter plot of balance and salary variable in data
plt.scatter(data.salary,data.balance)
plt.show()

#plot the scatter plot of balance and age variable in data
data.plot.scatter(x="age",y="balance")
plt.show()
No alt text provided for this image

Bar Plot

#plot the bar graph of marital status with average value of response_rate
data.groupby('marital')['response_rate'].mean().plot.bar()
plt.show()
No alt text provided for this image

Pie plot

#calculate the percentage of each education category.
data.education.value_counts(normalize=True)

#plot the pie chart of education categories
data.education.value_counts(normalize=True).plot.pie()
plt.show()
No alt text provided for this image

Conclusion

This is how we’ll do Exploratory Data Analysis. Exploratory Data Analysis (EDA) helps us to look beyond the data.

Reference:

  • https://towardsdatascience.com/exploratory-data-analysis-8fc1cb20fd15
  • https://towardsdatascience.com/exploratory-data-analysis-eda-python-87178e35b14
  • https://www.analyticsvidhya.com/blog/2016/01/guide-data-exploration/
  • https://www.ibm.com/cloud/learn/exploratory-data-analysis

要查看或添加评论,请登录

ONASVEE BANARSE的更多文章

社区洞察

其他会员也浏览了