Exploratory Data Analysis(EDA) using Python
What is Exploratory Data Analysis(EDA)?
If we want to explain EDA in simple terms, it means trying to understand the given data much better, so that we can make some sense out of it.
In statistics, exploratory data analysis is an approach to analyzing data sets to summarize their main characteristics, often with visual methods. A statistical model can be used or not, but primarily EDA is for seeing what the data can tell us beyond the formal modeling or hypothesis testing task.
EDA in Python uses data visualization to draw meaningful patterns and insights. It also involves the preparation of data sets for analysis by removing irregularities in the data.
Based on the results of EDA, companies also make business decisions, which can have repercussions later.
- If EDA is not done properly then it can hamper the further steps in the machine learning model building process.
- If done well, it may improve the efficacy of everything we do next.
1. Data Sourcing
Data Sourcing is the process of finding and loading the data into our system. Broadly there are two ways in which we can find data.
- Private Data
- Public Data
2. Data Cleaning
After completing the Data Sourcing, the next step in the process of EDA is Data Cleaning. It is very important to get rid of the irregularities and clean the data after sourcing it into our system.
Irregularities are of different types of data.
- Missing Values
- Incorrect Format
- Incorrect Headers
- Anomalies/Outliers
Using Jupyter Notebook for analysis.
Import the necessary libraries and store the data in our system for analysis.
#import the useful libraries. import numpy as np import pandas as pd import seaborn as sns import matplotlib.pyplot as plt %matplotlib inline # Read the data set of "Marketing Analysis" in data. data= pd.read_csv("marketing_analysis.csv") # Printing the data data
There are some discrepancies in the Column header for the first 2 rows. The correct data is from the index number 1. So, we have to fix the first two rows.
This is called Fixing the Rows and Columns.
#import the useful libraries. import numpy as np import pandas as pd import seaborn as sns import matplotlib.pyplot as plt %matplotlib inline # Read the file in data without first two rows as it is of no use. data = pd.read_csv("marketing_analysis.csv",skiprows = 2) #print the head of the data frame. data.head()
Following are the steps to be taken while Fixing Rows and Columns:
- Delete Summary Rows and Columns
- Delete Header and Footer Rows
- Delete Extra Rows like blank rows, page numbers, etc.
- We can merge different columns if it makes for better understanding of the data
- Similarly, we can also split one column into multiple columns based on our requirements or understanding.
- Add Column names, it is very important to have column names to the dataset.
The customerid column has of no importance to our analysis, and also the jobedu column has both the information of job and education in it.
So, what we’ll do is, we’ll drop the customerid column and we’ll split the jobedu column into two other columns job and education and after that, we’ll drop the jobedu column as well.
# Drop the customer id as it is of no use. data.drop('customerid', axis = 1, inplace = True) #Extract job & Education in newly from "jobedu" column. data['job']= data["jobedu"].apply(lambda x: x.split(",")[0]) data['education']= data["jobedu"].apply(lambda x: x.split(",")[1]) # Drop the "jobedu" column from the dataframe. data.drop('jobedu', axis = 1, inplace = True) # Printing the Dataset data
Missing Values
If there are missing values in the Dataset before doing any statistical analysis, we need to handle those missing values.
There are mainly three types of missing values.
- MCAR(Missing completely at random)
- MAR(Missing at random)
- MNAR(Missing not at random)
# Checking the missing values
data.isnull().sum()
Drop the missing Values
# Dropping the records with age missing in data dataframe. data ?= data[~data.age.isnull()].copy() # Checking the missing values in the dataset. data.isnull().sum() #drop the records with response missing in data. data ?= data[~data.response.isnull()].copy() # Calculate the missing values in each column of #data frame data.isnull().sum()
Handling Outliers
Outliers are the values that are far beyond the next nearest data points.
There are two types of outliers:
- Univariate outliers: Univariate outliers are the data points whose values lie beyond the range of expected values based on one variable.
- Multivariate outliers: While plotting data, some values of one variable may not lie beyond the expected range, but when you plot the data with some other variable, these values may lie far from the expected value.
3. Visual Analysis
Scatter Plot
Take three columns ‘Balance’, ‘Age’ and ‘Salary’ from our dataset and see what we can infer by plotting to scatter plot between salary balance and age balance.
#plot the scatter plot of balance and salary variable in data plt.scatter(data.salary,data.balance) plt.show() #plot the scatter plot of balance and age variable in data data.plot.scatter(x="age",y="balance") plt.show()
Bar Plot
#plot the bar graph of marital status with average value of response_rate data.groupby('marital')['response_rate'].mean().plot.bar() plt.show()
Pie plot
#calculate the percentage of each education category. data.education.value_counts(normalize=True) #plot the pie chart of education categories data.education.value_counts(normalize=True).plot.pie() plt.show()
Conclusion
This is how we’ll do Exploratory Data Analysis. Exploratory Data Analysis (EDA) helps us to look beyond the data.
Reference:
- https://towardsdatascience.com/exploratory-data-analysis-8fc1cb20fd15
- https://towardsdatascience.com/exploratory-data-analysis-eda-python-87178e35b14
- https://www.analyticsvidhya.com/blog/2016/01/guide-data-exploration/
- https://www.ibm.com/cloud/learn/exploratory-data-analysis
Consultant at KPMG
3 年Awesome ????