ç™»å½•æŸ¥çœ‹æ›´å¤šå†…å®¹

Exploratory Data Analysis: Four Must-Know Techniques

Benjamin Bennett Alexander

å‘å¸ƒæ—¥æœŸ: 2024å¹´5æœˆ18æ—¥

Introduction

Exploratory data analysis (EDA) is a crucial step in the data analysis process. Data analysts use EDA to get a deeper understanding of the structure, patterns, and potential issues of the data. In this article, I want to share with you four techniques that every data analyst should master: univariate analysis, bivariate analysis, multivariate analysis, and feature engineering. We are going to explore these technical terms and their implications. We'll break down the technical terms behind these techniques and see how they can be applied to a data analysis process. First, let's load the libraries and the humans.csv dataset that we are going to use in this article.

Univariate Analysis

A DataFrame is made up of different variables, which are represented by columns. Univariate analysis involves examining the distribution and statistical properties of each variable (column) in isolation. This is important when you want to get a deep understanding of the values in a column. Univariate analysis includes summary statistics, visualizations, and outlier detection. For example, we can use a box plot for outlier detection. Outliers are data points that are significantly different from the rest of the dataset. Outliers can significantly impact the results and insights derived from the data. Identifying outliers allows you to verify their validity and determine if they are genuine data points or errors that need correction. In the example below, we use the box plot to catch outliers in the "Height" column.

In this example, the box (the orange box) represents the middle 50% of the data in the "Height" column, and the line inside the box represents the median value. The whiskers extend from the box to the minimum and maximum values that are within 1.5 times the interquartile range (IQR). Anything outside this range is considered an outlier. In this example, we can consider 200 and 210 as upper outliers and 145 and 140 as lower outliers, as they fall outside the range of whiskers.

Build the Confidence to Tackle Data Analysis Projects (40% OFF)

To build a successful data analysis project, one must have skills in data cleaning and preprocessing, visualization, modeling, EDA, and so forth. The main purpose of this book is to ensure that you develop data analysis skills with Python by tackling challenges. By the end, you should be confident enough to take on any data analysis project with Python. Start your 50-day challenge now. Click here to get 40% off.

Other Resources

Want to learn Python fundamentals the easy way? Check out Master Python Fundamentals: The Ultimate Guide for Beginners.

Challenge yourself with Python challenges. Check out 50 Days of Python: A Challenge a Day.

100 Python Tips and Tricks, Python Tips and Tricks: A Collection of 100 Basic & Intermediate Tips & Tricks.

Bivariate Analysis

Once you are done analyzing individual columns, you also want to explore the relationship between two variables. This is known as bivariate analysis. The significance of this analysis is that you want to explore correlations, associations, or differences between pairs of variables. For example, you may want to know how the height of the person impacts their weight. This means that you will have to analyze the "Height" and "Weight" columns for correlations. We can use a scatter plot to analyze if there is a correlation between the two variables:

é¢†è‹±æŽ¨è

PANDAS PROFILING

360DigiTMG 1 å¹´å‰

Why is data cleaning and preparation crucial for accurate statistical analysis?

Why is data cleaning and preparation crucial forâ€¦

Data Science Council of America 3 å‘¨å‰

Data Science

Gopi Raghavendra 2 å¹´å‰

You can see that most of the data points are concentrated in the middle of the plot, indicating that a large number of individuals in the dataset have heights around 170 cm and weights around 70 kg. But we do not see any strong signs of correlation between the two variables. We can safely conclude that the scatter plot shows a weak or non-existent correlation between height and weight in the dataset.

Multivariate Analysis

Apart from analyzing two variables, you can also analyze relationships between more than two variables. This type of analysis is known as Multivariate Analysis. A good starting point for such an analysis is using a pairplot. It tackles the challenge of visualizing relationships between multiple variables simultaneously. A pairplot provides a "bivariate analysis" by examining two variables at a time, but the overall pairplot itself is a tool for multivariate analysis because it allows you to see the relationships between multiple variables in a single visualization. Here is a pairplot of the three variables in the dataset.

You can see that a pairplot provides both a comprehensive view of the relationships between multiple variables (bivariate analysis) and a univariate view of each variable. The diagonal elements of the plot (histograms) show the distribution of the data for each variable independently. You can use the graphs to assess the shape of the distribution (e.g., normal, skewed, or uniform). The scatter plots in the pairplot help you visualize the relationship between two variables, revealing patterns, correlations, and potential outliers. This pairplot does not show strong relationships between height, weight, and age in this dataset. This is evident from the scatter plots, which do not show any discernible patterns or trends.

Feature Engineering

Feature engineering can also be used during EDA to gain deeper insights into the data. It can be used to transform data into a more informative and usable format for various purposes Feature engineering involves adding new variables to the dataset or transforming existing variables. Let's perform feature engineering on our dataset by adding the body mass index (BMI) column.

The column "BMI" has been added to the DataFrame. So, by creating new features or transforming existing ones, you might uncover features that are more relevant to your analysis and remove redundant or irrelevant ones.

Conclusion

These are just a few examples of the various types of EDA techniques used in data analysis. Depending on the nature of the dataset and the specific goals of the analysis, you may use different combinations of these techniques to gain a comprehensive understanding of the data. The book "50 Days of Data Analysis with Python: The Ultimate Challenge Book for Beginners" provides a comprehensive set of challenges to help you learn various types of EDA. Join this LinkedIn group for Python students and professionals to learn more about Python-related topics.

Newsletter Sponsorship

You can reach a highly engaged audience of over 260,000 tech-savvy subscribers and grow your brand with a newsletter sponsorship. Contact me at benjaminbennettalexander@gmail.com today to learn more about the sponsorship opportunities.

å¸¦æœ‰æ¤å›¾æ ‡çš„é“¾æŽ¥ ç”±é¢†è‹±åˆ›å»ºï¼Œä¸å¸¦æ¤å›¾æ ‡çš„é“¾æŽ¥ç”±ä½œè€…æ·»åŠ ã€‚

Python, Data Analytics & AI

350,505 ä½å…³æ³¨è€…

è®¢é˜…

Isaac Kwesi Atta Inkoom

Data Analyst | Microsoft Excel / SQL / Microsoft Power BI | Python

10 ä¸ªæœˆ

Was the dataset already cleaned? You didn't check if there was null values and all that.

èµž

å›žå¤

1 æ¬¡å›žåº”

Saidi Namtanga

Researcher| Data Scientist

10 ä¸ªæœˆ

Thanks for sharing

èµž

å›žå¤

æŸ¥çœ‹æ›´å¤šè¯„è®º

è¦æŸ¥çœ‹æˆ–æ·»åŠ è¯„è®ºï¼Œè¯·ç™»å½•

Benjamin Bennett Alexanderçš„æ›´å¤šæ–‡ç«

Ah Ah! Stop it! These Bad Habits Are Ruining Your Python Code

2025å¹´3æœˆ29æ—¥

Ah Ah! Stop it! These Bad Habits Are Ruining Your Python Code

Bad habits are like cracked pavement: they may seem small at first, but they can lead to bigger problems down the road.â€¦

11 æ¡è¯„è®º
Shut Up and Do the Work

2025å¹´3æœˆ26æ—¥

Shut Up and Do the Work

Everybody Wants To Go To Heaven But Nobody Wants To Die I often get messages from people interested in a career inâ€¦

37 æ¡è¯„è®º
Data Analysts, Stop Ignoring Pandas Series

2025å¹´3æœˆ22æ—¥

Data Analysts, Stop Ignoring Pandas Series

When we talk about pandas, the bulk of the conversation revolves around the pandas DataFrame. Barely any attention isâ€¦

11 æ¡è¯„è®º
No, Just Learning Python Will Not Get You Hired

2025å¹´3æœˆ20æ—¥

No, Just Learning Python Will Not Get You Hired

One of the most common questions I hear is, "If I learn Python, will it guarantee me a job?" While learning Python is aâ€¦

13 æ¡è¯„è®º
Data Loading with Pandas: Understanding the Intricacies of the read_csv Function

2025å¹´3æœˆ15æ—¥

Data Loading with Pandas: Understanding the Intricacies of the read_csv Function

Introduction One of the most popular formats for structured data is CSV (Comma-Separated Values). CSV files are plainâ€¦

17 æ¡è¯„è®º
5 Python Tricks You Wish You Knew Earlier

2025å¹´3æœˆ12æ—¥

5 Python Tricks You Wish You Knew Earlier

Tired of Typing? Code 3x Faster with Wispr Flow Wispr Flow for Windows just landed to make coding (and documentation) aâ€¦

14 æ¡è¯„è®º
A Deep Dive into SQL Recursive Queries

2025å¹´3æœˆ8æ—¥

A Deep Dive into SQL Recursive Queries

Build the Confidence to Tackle Data Analysis Projects [40% OFF] To build a successful data analysis project, one mustâ€¦

9 æ¡è¯„è®º
Stop! Avoid These Habits When Writing Python Loops

2025å¹´3æœˆ6æ—¥

Stop! Avoid These Habits When Writing Python Loops

Announcement: Master Python Fundamentals [40% OFF] Learning Python. Trying to learn Python in 2025? This resource willâ€¦

15 æ¡è¯„è®º
How to Structure a Winning Data Analysis Project Report

2025å¹´3æœˆ1æ—¥

How to Structure a Winning Data Analysis Project Report

Build the Confidence to Tackle Data Analysis Projects To build a successful data analysis project, one must have skillsâ€¦

12 æ¡è¯„è®º
Master Python Classes: Object-Oriented Programming Crash Course

2025å¹´2æœˆ27æ—¥

Master Python Classes: Object-Oriented Programming Crash Course

What I have discovered about Python is that many people learning Python struggle to wrap their heads around the conceptâ€¦

10 æ¡è¯„è®º

See all articles

Exploratory Data Analysis: Four Must-Know Techniques

Benjamin Bennett Alexander

Introduction

Univariate Analysis

Build the Confidence to Tackle Data Analysis Projects (40% OFF)

Other Resources

Bivariate Analysis

é¢†è‹±æŽ¨è

Multivariate Analysis

Feature Engineering

Conclusion

Python, Data Analytics & AI

350,505 ä½å…³æ³¨è€…

Benjamin Bennett Alexanderçš„æ›´å¤šæ–‡ç«

ç¤¾åŒºæ´žå¯Ÿ

å…¶ä»–ä¼šå‘˜ä¹Ÿæµè§ˆäº†

Data Science

Data Science Project Stages

The Power of Exploratory Data Analysis (EDA) in Data Science: Understanding the Basics and Best Practices.

Exploratory Data Analysis: A Journey Made Simple with Pandas

Data Science vs Data Analytics

7 Essential Data Science Projects

Steps For An End-to-End Data Science Project

Data Analysis with Pandas: Harnessing the Power of Data Manipulation and Visualization

Top 10 Data Science Project Ideas Every Beginner Must Know in 2023

Introduction

Univariate Analysis

Build the Confidence to Tackle Data Analysis Projects (40% OFF)

Other Resources

Bivariate Analysis

é¢†è‹±æŽ¨è

Multivariate Analysis

Feature Engineering

Conclusion

Python, Data Analytics & AI

350,505 ä½å…³æ³¨è€…

Benjamin Bennett Alexanderçš„æ›´å¤šæ–‡ç«

Ah Ah! Stop it! These Bad Habits Are Ruining Your Python Code

Shut Up and Do the Work

Data Analysts, Stop Ignoring Pandas Series

No, Just Learning Python Will Not Get You Hired

Data Loading with Pandas: Understanding the Intricacies of the read_csv Function

5 Python Tricks You Wish You Knew Earlier

A Deep Dive into SQL Recursive Queries

Stop! Avoid These Habits When Writing Python Loops

How to Structure a Winning Data Analysis Project Report

Master Python Classes: Object-Oriented Programming Crash Course

ç¤¾åŒºæ´žå¯Ÿ

å…¶ä»–ä¼šå‘˜ä¹Ÿæµè§ˆäº†

Data Science

Data Science Project Stages

The Power of Exploratory Data Analysis (EDA) in Data Science: Understanding the Basics and Best Practices.

Exploratory Data Analysis: A Journey Made Simple with Pandas

Data Science vs Data Analytics

7 Essential Data Science Projects

Steps For An End-to-End Data Science Project

Data Analysis with Pandas: Harnessing the Power of Data Manipulation and Visualization

Top 10 Data Science Project Ideas Every Beginner Must Know in 2023

é¢†è‹±æŽ¨è

350,505 ä½å…³æ³¨è€…

å…¶ä»–ä¼šå‘˜ä¹Ÿæµè§ˆäº†