登录查看更多内容

Data Analysis with Python: Handling Missing Values with Pandas and Scikit-Learn

Benjamin Bennett Alexander

发布日期: 2024年9月28日

Data is rarely perfect, and missing values are a common challenge when working with real-world datasets. These gaps can arise due to human error, incomplete observations, or data collection issues. In Python, missing data is typically represented by NaN values. Before diving into exciting tasks like exploratory analysis or building models, it’s essential to handle missing values as part of the data cleaning process. There are several strategies to address missing data, including dropping rows or columns with missing values, imputing with the mean, median, or mode, replacing with constant values, or using adjacent values. Pandas and Sklearn provide the tools for implementing these strategies. In this article, we are going to explore how we can use these two libraries to handle missing values in data.

Dropping Missing Values with Pandas

Sometimes the best strategy of dealing with missing values is to drop them. This approach is the simplest. If the amount of missing data is small and the data is not critical to the analysis, removing rows or columns with missing values can be a valid choice. As a rule of thumb, when the dataset is large and the amount of missing data is small (e.g., <5%), it may make sense to drop the rows or columns with missing data. Let's look at an example of how we can drop missing values with Pandas. We are going to load a simple dataset with missing values:

To check for missing values in a dataset, we can use the isnull() method. This method returns a boolean value of True for missing values and False for non-missing values. By applying the sum() function to the result, we can count the total number of missing values in each column, as True values are treated as 1 and False values as 0. This provides a quick overview of how many missing values exist in each column of the dataset:

This gives us a total of missing values in each column. The 'Animal' column has no missing values, but the rest of the columns have missing values. The 'Age' column has the most missing values (3 missing values).

Let's say a decision has been made to drop all rows with missing values. To drop the rows with missing values, we can use the Pandas dropna() method:

In this code, because we want to drop rows with missing values, we use the dropna() method with the axis=0 parameter (which is the default). However, in this case, dropping rows has led to a significant loss of data (70%), which could negatively impact the analysis. This does not seem like a good idea.

Now, if we want to drop all columns with missing values instead, we can pass axis=1 to the dropna() method. Here is the code:

This does not seem to be a good idea either, as we end up losing all the columns except the 'Animals' column that does not have missing values. You can see that while this method is simple, it can lead to loss of valuable data if many rows or columns are dropped. Use this carefully, especially if your dataset is small.

An effective strategy to avoid a huge loss of data would be to use a condition to drop columns that have minimum missing values. For example, we can drop columns that have one missing value. See below:

Here, instead of dropping all columns with missing data, first we identified columns with 1 missing value and dropped them from the dataset. This means only two columns have been dropped. This reduces the amount of lost data.

Build the Confidence to Tackle Data Analysis Projects in 2024

Ready to go in and do some real data analysis? The main purpose of this book is to ensure that you develop data analysis skills with Python by tackling challenges. By the end, you should be confident enough to take on any data analysis project with Python. Start your journey with "50 Days of Data Analysis with Python: The Ultimate Challenge Book for Beginners."

领英推荐

How to Work with Data in Python: A Beginner's Guide

Bow River Solutions Inc. 8 个月前

Data Analysis with Seaborn: Analyzing Data Using…

Benjamin Bennett Alexander 4 个月前

Data Cleaning with Python: Handling Duplicates with…

Benjamin Bennett Alexander 6 个月前

Other Resources

My new Python course on classes and functions will help you master these important fundamentals: Check out: Master Python Fundamentals: Classes and Functions

Challenge yourself with Python challenges. Check out 50 Days of Python: A Challenge a Day.

100 Python Tips and Tricks, Python Tips and Tricks: A Collection of 100 Basic & Intermediate Tips & Tricks.

Imputation (Filling Missing Data with Pandas)

Instead of dropping missing values, we can fill the missing values based on a strategy. Common imputation techniques include filling missing values with the mean, median, or mode of the column. We are going to use the mean strategy for the 'Age' column, the median strategy for the 'Weight_kg' column, the mode strategy for the 'Habitat' column, and a constant value strategy for the 'Endangered' column. Here, we assume that the missing value in this column might indicate an endangered species. Let's put everything in a function:

You can see in the output that the 'Animal' and 'Age' columns have been filled with the mean and median of the respective columns, respectively. In the real world, the mean strategy is recommended for normally distributed data (bell-shaped) as it provides a central tendency that reflects the average value of the data. However, this strategy is vulnerable to outliers, which are extreme values that can significantly affect the average. But when the data is skewed, it is recommended to use the median strategy.

For the categorical column 'Habitat', we fill the missing value with the most frequent value (mode) in the column. This is a suitable strategy for categorical data. For the boolean column 'Endangered', we fill the missing value with True. This assumes that missing values in this column might indicate an endangered species. This basically demonstrates how you can use pandas to deal with missing data using different strategies.

Imputation (Filling Missing Data with Sklearn)

In the previous example, we used pandas to fill missing values. While the fillna() method in Pandas works effectively, it's not the only option for handling missing data. Sklearn offers the SimpleImputer class, which provides a flexible and efficient way to manage missing values. Let’s reimplement the same strategies using Sklearn to demonstrate how this powerful library can be used for the task:

In this code, we first create four imputers using the SimpleImputer class. Each imputer applies a specific strategy for filling missing values. After creating the imputers, we use the fit_transform() method to apply these strategies to the appropriate columns. As you can see in the output, the results are identical to those achieved using Pandas' fillna() method.

When should you use Sklearn instead or pandas? The choice largely depends on your specific task. If you're building a pipeline for training machine learning models, SimpleImputer() integrates well into the pipeline, making it more suitable for machine learning workflows. On the other hand, pandas is excellent for general data manipulation and analysis tasks. So, while pandas is perfect for exploratory data analysis and preprocessing, Sklearn's SimpleImputer shines in machine learning projects where consistent and reproducible data preprocessing is required.

Final Thoughts

These examples illustrate how Pandas and Scikit-Learn can be used to handle missing data and the different strategies you can employ. Handling missing data is a critical aspect of data analysis and machine learning, as it directly impacts the quality and accuracy of insights drawn from your dataset.

The choice between Pandas and Scikit-Learn depends on the nature of the task. For exploratory analysis and basic data manipulation, Pandas is often the ideal tool. However, for machine learning projects that require more structured preprocessing and seamless integration into modeling pipelines, Scikit-Learn’s tools, such as SimpleImputer, are more advantageous.

If you want to further develop your skills, the book "50 Days of Data Analysis with Python: The Ultimate Challenges Book for Beginners" provides valuable insights and challenges to deepen your understanding of how to effectively handle missing data using Python. Thanks for reading.

带有此图标的链接由领英创建，不带此图标的链接由作者添加。

Python, Data Analytics & AI

346,645 位关注者

Douglas Anderson

Geochemist / Geologist / Engineer

4 个月

Brilliant article on how to deal with missing data

Utibe Okon

Lecturer/Researcher: Animal Products and Processing, Meat Quality, Food Safety

5 个月

Great stuff.

Bo?tjan Dolin?ek

5 个月

OK Bo?tjan Dolin?ek

Peter Bellen

Blog for AI Articles

5 个月

Several new Articles; Interesting stuff..... Read them and enjoy. Look at -> Overview of the last published articles Leave a comment or question?on the article site if you like it or give your opinion. Thanks. Any interaction on the Article Site is welcome If you have an idea for a new article; tell me; Thanks. English : https://aifornoobsandexperts.com/ Dutch :?https://aivoorjanenalleman.nl/

Jorge Alfredo Pizarro Castro

Telco engineer | 12y+ experience in CORE Network 2G,3G,4G,5G | Solution Architect, CORE Network engineer and Python automation passionate | LOOKING FOR A JOB OPPORTUNITY | Available for working Abroad!!!

5 个月

great!!!

查看更多评论

要查看或添加评论，请登录

Benjamin Bennett Alexander的更多文章

Master Python Classes: Object-Oriented Programming Crash Course

2025年2月27日

Master Python Classes: Object-Oriented Programming Crash Course

What I have discovered about Python is that many people learning Python struggle to wrap their heads around the concept…

8 条评论
50 Days of Data Analysis: Analyzing Data with NumPy

2025年2月22日

50 Days of Data Analysis: Analyzing Data with NumPy

Master the Skills Required in Data Analysis and Machine Learning Start a transformative journey with "50 Days of Data…

7 条评论
Four Machine Learning Questions that Every Data Analyst Must Answer

2025年2月20日

Four Machine Learning Questions that Every Data Analyst Must Answer

Master the Skills Required in Data Analysis and Machine Learning Start a transformative journey with "50 Days of Data…

17 条评论
Things You Probably Didn’t Know About the ORDER BY Clause

2025年2月15日

Things You Probably Didn’t Know About the ORDER BY Clause

Start a transformative journey with "50 Days of Data Analysis with Python." Dive into the world of Python libraries…

9 条评论
Humanizing Data: Tiankai Feng on AI, Music, and the Key to Data Success

2025年2月13日

Humanizing Data: Tiankai Feng on AI, Music, and the Key to Data Success

Beyond all the systems, tables, and algorithms, it’s still people who are calling the shots. The best way to learn…

5 条评论
Manipulating Pandas DataFrame Columns Like a Pro: 5 Essential Techniques

2025年2月8日

Manipulating Pandas DataFrame Columns Like a Pro: 5 Essential Techniques

Start a transformative journey with "50 Days of Data Analysis with Python." Dive into the world of Python libraries…

6 条评论
Stop Falling Victim to these Common Python Traps

2025年2月6日

Stop Falling Victim to these Common Python Traps

Python Course Trying to learn Python in 2025? Over 100 videos (more to come) have already been added to the Master…

9 条评论
The Realities of Data Analysis: 5 Things You Wish Were True

2025年2月1日

The Realities of Data Analysis: 5 Things You Wish Were True

More often than not, reality doesn't align with our expectations. Many of us have found ourselves in jobs where the…

3 条评论
How to Become a Data Analyst in 2025

2025年1月30日

How to Become a Data Analyst in 2025

With the increasing capabilities of AI, many data professionals are wondering about the future of their roles. Well…

19 条评论
50 Days of Python A Challenge a Day: Three Years Later

2025年1月25日

50 Days of Python A Challenge a Day: Three Years Later

Almost three years ago, I released a book called "50 Days of Python: A Challenge a Day" (March 2022). The idea behind…

8 条评论

See all articles

Data Analysis with Python: Handling Missing Values with Pandas and Scikit-Learn

Benjamin Bennett Alexander

Dropping Missing Values with Pandas

Build the Confidence to Tackle Data Analysis Projects in 2024

领英推荐

Other Resources

Imputation (Filling Missing Data with Pandas)

Imputation (Filling Missing Data with Sklearn)

Final Thoughts

Python, Data Analytics & AI

346,645 位关注者

Benjamin Bennett Alexander的更多文章

社区洞察

其他会员也浏览了

Data Analysis with Python: Concatenating Datasets with Pandas

Data Analysis with Python: Stop Reading and Start Doing (Analyzing Financial Data)

Python 3.12: Unpacking Three Exciting New Features

Data Analysis 101 with Python: Stop Reading and Start Doing (Analyzing Financial Data)

Mastering Data Analysis with Python: Essential Tips and Tricks

Automating Data Extraction from Excel Files in Python: A Step-by-Step Guide

AIML26-Pandas and Python Tips and Tricks for Data Science and Data Analysis

Ultimate Guide to Python for Data Science

Exploring Raw Material Data: Analyzing Trends and Insights with Python

Generator Functions in Python: For Data Nerds!

Dropping Missing Values with Pandas

Build the Confidence to Tackle Data Analysis Projects in 2024

领英推荐

Other Resources

Imputation (Filling Missing Data with Pandas)

Imputation (Filling Missing Data with Sklearn)

Final Thoughts

Python, Data Analytics & AI

346,645 位关注者

Benjamin Bennett Alexander的更多文章

Master Python Classes: Object-Oriented Programming Crash Course

50 Days of Data Analysis: Analyzing Data with NumPy

Four Machine Learning Questions that Every Data Analyst Must Answer

Things You Probably Didn’t Know About the ORDER BY Clause

Humanizing Data: Tiankai Feng on AI, Music, and the Key to Data Success

Manipulating Pandas DataFrame Columns Like a Pro: 5 Essential Techniques

Stop Falling Victim to these Common Python Traps

The Realities of Data Analysis: 5 Things You Wish Were True

How to Become a Data Analyst in 2025

50 Days of Python A Challenge a Day: Three Years Later

社区洞察

其他会员也浏览了

Data Analysis with Python: Concatenating Datasets with Pandas

Data Analysis with Python: Stop Reading and Start Doing (Analyzing Financial Data)

Python 3.12: Unpacking Three Exciting New Features

Data Analysis 101 with Python: Stop Reading and Start Doing (Analyzing Financial Data)

Mastering Data Analysis with Python: Essential Tips and Tricks

Automating Data Extraction from Excel Files in Python: A Step-by-Step Guide

AIML26-Pandas and Python Tips and Tricks for Data Science and Data Analysis

Ultimate Guide to Python for Data Science

Exploring Raw Material Data: Analyzing Trends and Insights with Python

Generator Functions in Python: For Data Nerds!