Exploratory Data Analysis in Python

Exploratory Data Analysis in Python

Complete implementation of Exploratory Data Analysis (EDA) in Python

What is?Exploratory?Data Analysis (EDA)?

Exploratory Data Analysis is all about analyzing the dataset and summarizing the key insights and characteristics of the data. EDA is one of the first steps that we follow in a Data Science Project to understand the data better. We can also include some Data Visualization tasks in EDA. Once we get this basic understanding, we can move on with the predictive & prescriptive part.

Checklist for EDA:

1. Checking the different features present in the dataset & its shape

2. Checking the data type of each columns

3. Encoding the labels for classification problems

4. Checking for missing values

5. Descriptive summary of the dataset

6. Checking the distribution of the target variable

7. Grouping the data based on target variable

Data Visualization:

8. Distribution plot for all the columns

9. Count plot for Categorical columns

10. Pair plot

11. Checking for Outliers

12. Correlation matrix

13. Inference from EDA

Let’s try to understand the first 7 steps with an use case. I’ll create a separate post on Data Visualization after this.

Understanding EDA with an interesting use case in Python:

Dataset:?In order to understand EDA, we will be working on the Breast Cancer Wisconsin (Diagnostic) Data Set. Here, Features are computed from a digitized image of a fine needle aspirate (FNA) of a breast mass. They describe characteristics of the cell nuclei present in the image. You can find this dataset in Kaggle or UCI ML Repository. You can also download the dataset from?here.

Using this dataset, we can build a classification system which can predict whether a person has Benign or Malignant tumor. Malignant tumors are considered cancerous. In the EDA part, we will try to understand the characteristics of the data and its descriptive measures.

As a starter, let’s import the dependencies.

import numpy as np
import pandas as pd
from sklearn.preprocessing import LabelEncoder        

The next step is to load the dataset from the csv file to a pandas DataFrame:

breast_cancer_data = pd.read_csv('/content/data.csv')        

  1. Checking the different features present in the dataset:

For this, we can use the?head()?function in pandas

breast_cancer_data.head()
        
No alt text provided for this image

Checking the shape of the dataset:

breast_cancer_data.shape        

(569, 32)

As we can see here, the dataset contains 569 rows (data points) and 32 columns (features).
The second column is “diagnosis”, where, “M” represents Malignant & “B” represents Benign. This is our Target column.

2. Checking the data type of each columns and non-null count:

breast_cancer_data.info()        

I’ll include the first 10 rows of the output.

No alt text provided for this image
As we can see here, the ‘id’ is in the form of integer; ‘diagnosis’ column is in the form of ‘object’. So, it is a categorical variable. Whereas, the remaining are continuous numerical variables.

3. Encoding the labels for classification problems:

Now let’s encode the “diagnosis” column, so that all the columns are in the numerical format. We will encode “B” as 0 and “M” as 1.

label_encode = LabelEncoder()
labels = label_encode.fit_transform(breast_cancer_data['diagnosis'])
breast_cancer_data['target'] = labels
breast_cancer_data.drop(columns=['id','diagnosis'], axis=1,    inplace=True)        
Here, we are encoding the “diagnosis” column, storing it in a different column called “target” and removing the “diagnosis” column. We are also removing the “id” column as it is not necessary.

4. Checking for missing values:

Now, let’s check whether there are any missing values in the dataset.

breast_cancer_data.isnull().sum()        

first few rows of the output:

No alt text provided for this image
The above line of code gives an output on how many missing values are there in each column. I included first few rows of the output here.
As we can see here, there are no missing values in this case. If there are missing values in a dataset, we will handle them in “Feature Engineering” part.

5. Descriptive summary of the dataset:

The next step is to get some statistical measures about the dataset. This is what we call as “Descriptive Statistics” which is a summarization of the data. For this, we can use?describe()?function in pandas.

breast_cancer_data.describe()        

Showing few columns of the output in this image:

No alt text provided for this image
The main inference that we can get here is, for most of the columns, the mean value is larger than median value (50th percentile: 50%). This is an indication that those features have a right skewed data. This information will be visible for us when we create distribution plot for individual features in Data Visualization part.

6. Checking the distribution of the target variable:

The next step is to check the distribution of the dataset based on the target variable to see if there is an imbalance. This is an exclusive step for Classification problems.

breast_cancer_data['diagnosis'].value_counts()        
“0” 357
“1” 212
Name: diagnosis, dtype: int64

As we can see, there is a slight imbalance in the dataset ( number of Benign(0) cases is more than number of Malignant(1) cases). The imbalance is not too much to worry about in this case.

7. Grouping the data based on target variable:

This step is also exclusive for Classification problems. This is to group the dataset based on the target variable. We will be grouping the data points as 0 & 1 representing Benign & Malignant respectively. This grouping is done with the mean value of all the columns.

breast_cancer_data.groupby('target').mean()        
No alt text provided for this image
This clearly tells us that the mean value for most of the features are greater for Malignant cases than the mean value for Benign cases. This inference is very important.


Inferences so far:

  • The dataset has 569 rows & 32 columns.
  • We don’t have any missing values in the dataset.
  • We could see that the data is right skewed for most of the features.
  • There is a slight imbalance in the dataset (Benign cases are more than Malignant cases).
  • The mean value for most of the features are greater for Malignant cases than the mean value for Benign cases.

NOTE:?The EDA is not completed yet. We have to do some Data Visualization tasks to understand the data better. Those topics will be covered in the next post.


Sridhar Somasundharam

Vice President, Lead Data Engineer at Wells Fargo

1 年

Wonderful How to connect

回复
Nanduri Kameswar

ASDE 2 @ Publicis Sapient | Software Development | GenAI | IIIT Bhubaneswar Alumni

3 年

Great post, Thanks!

回复

要查看或添加评论,请登录

社区洞察

其他会员也浏览了