登录查看更多内容

20 Methods for Handling Missing Values in Python Pandas

Ali Soltanhosseini

Data Analyst | SQL, Python, Power BI | Optimizing Data for Actionable Business Insights

发布日期: 2024年8月10日

Effectively identifying and managing missing data is vital for accurate data analysis and model performance. Handling missing values in Python Pandas is crucial for preparing datasets for analysis. Pandas provides a rich set of methods for uncovering missing values, including examples of how to check for missing values in a Python DataFrame, count missing values in columns, and find null values in specific columns. Additionally, techniques for filling missing values in datasets, such as using fillna in Pandas, are essential for maintaining data integrity. Mastering these Pandas Python methods for missing values will enhance the data cleaning process and enable more reliable insights from data. In this article, I explore 20 methods for uncovering missing values in a dataset using Pandas. These techniques will enhance the data cleaning process and ensure more accurate analysis and modeling.

1. Loading A Dataset

Start by loading your dataset into a Pandas DataFrame:

import pandas as pd

# Load dataset
df = pd.read_csv('your_dataset.csv')

In data reading via pandas, if you have a massive data, it is possible to call the entire max rows and columns by

pd.set_option('display.max_rows', None)
pd.set_option('display.max_columns', None)

and you can also customize the code by changing None into your desired rows and columns:

pd.set_option('display.max_rows', 20)
pd.set_option('display.max_columns', 10)

2. Basic Overview of Missing Values

Get a quick summary of missing values:

missing_values = df.isna().sum() 
print(missing_values)

if you ran this and it shows a missing value in certain column, for example in a column called Region, you can run the following code to find the specified rows:

missing_region = df['Region'].isna()
missing_row = df[missing_region]
print(missing_row)

where missing_row contains only the rows where Region is missing and missing_region is a boolean Series identifying which rows in Region are NaN.

For example:

import pandas as pd
import numpy as np

df1 = pd.DataFrame({
    'Country': ['USA', 'Canada', 'Mexico'],
    'Region': ['North America', np.nan, 'Central America']
})

missing_region = df1['Region'].isna()
missing_row = df1[missing_region]
print(missing_row)

  Country  Region
1  Canada    NaN

3. Percentage of Missing Values

Calculate the percentage of missing values per column:

missing_percentage = df.isna().mean() * 100
print(missing_percentage)

4. Visualize Missing Data

Use the missingno library to visualize missing values:

import missingno as msno
# Matrix plot msno.matrix(df)

Please consider that you should firstly install it by the following command:

pip install missingno

In msno.matrix(df) library, this function generates a matrix plot that visualizes the pattern of missing data in your DataFrame df.

Missingno is a useful library for visualizing missing data in pandas DataFrames. For example:

import pandas as pd
import numpy as np
import missingno as msno

# Example DataFrame with missing values
df = pd.DataFrame({
    'A': [1, 2, np.nan, 4],
    'B': [5, np.nan, np.nan, 8],
    'C': [9, 10, 11, 12]
})

# Visualize the missing data
msno.matrix(df)

5. Heatmap of Missing Data

Generate a heatmap to visualize missing data patterns:

# Heatmap plot 
msno.heatmap(df)

6. Dendrogram of Missing Data

Explore hierarchical clustering of missing data with a dendrogram:

# Dendrogram plot
msno.dendrogram(df)

7. Check for Missing Values in Specific Rows

Identify rows with any missing values:

rows_with_na = df[df.isna().any(axis=1)]
print(rows_with_na)

8. Check for Missing Values in Specific Columns

Identify columns with missing values:

Benjamin Bennett Alexander 1 个月前

Data Analysis with Python: Stop Reading and Start…

Benjamin Bennett Alexander 2 周前

2023 Data Analysis & Visualization in python…

Free Online Courses With Printable Certificates 1 年前

columns_with_na = df.columns[df.isna().any()].tolist()
print(columns_with_na)

9. Count of Missing Values per Row

Count missing values for each row:

row_missing_count = df.isna().sum(axis=1)
print(row_missing_count)

10. Count of Missing Values per Column

Count missing values for each column:

column_missing_count = df.isna().sum()
print(column_missing_count)

11. Summary of Non-Null Values

Get a summary of non-null values using the info() method:

df.info()

12. Display Rows with All Missing Values

Find rows where all values are missing:

rows_all_na = df[df.isna().all(axis=1)]
print(rows_all_na)

13. Display Columns with All Missing Values

Find columns where all values are missing:

columns_all_na = df.columns[df.isna().all()].tolist()
print(columns_all_na)

14. Use describe() Method for Missing Data

Get statistical summary and check for missing data in numerical columns:

desc = df.describe()
print(desc)

15. Use Boolean Indexing for Conditional Missing Values

Find missing values based on specific conditions:

conditional_missing = df[df['some_column'].isna() & (df['another_column'] > threshold)]
print(conditional_missing)

16. Check for Missing Values in a Subset of Columns

Identify missing values in selected columns:

subset_missing = df[['column1', 'column2']].isna().sum()
print(subset_missing)

17. Count Unique Missing Value Patterns

Count unique patterns of missing values across rows:

unique_patterns = df.isna().astype(int).groupby(df.columns.tolist()).size()
print(unique_patterns)

18. Analyze Missing Data by Group

Examine missing data within specific groups:

grouped_missing = df.groupby('group_column').apply(lambda x: x.isna().sum())
print(grouped_missing)

19. Use dropna() to Identify Missing Data Removal

Check which rows/columns would be dropped if dropna() was applied:

print(df.dropna().shape)

20. Impute Missing Values and Compare

Impute missing values using mean and compare results:

df_imputed = df.fillna(df.mean())
print(df_imputed.isna().sum())

Conclusion

When working with missing values in Python Pandas, leveraging the array of methods available for handling and filling these gaps is essential for accurate data analysis and model performance. From using fillna to addressing missing values in a DataFrame, Pandas offers comprehensive tools for identifying and managing null values. By applying techniques such as counting missing values in columns, finding null values, and filling missing values in datasets, you ensure a robust and complete dataset ready for analysis. Utilizing these Pandas Python methods for missing values not only streamlines the data cleaning process but also enhances the reliability of your insights. Embracing these practices in Jupyter Notebook or any Python environment will significantly improve the data preparation workflow.

Article source: https://www.databizex.com/handling-missing-values-in-python-pandas/

20 Methods for Handling Missing Values in Python Pandas

Ali Soltanhosseini

Data Analyst | SQL, Python, Power BI | Optimizing Data for Actionable Business Insights

领英推荐

更多精彩文章

社区洞察

其他会员也浏览了

Python Big Data Exploration & Visualization: A Guide

AUTOVIZ - Python package

What strategies can be employed using Python to optimize database queries for a high-traffic social media platform?

Top 10 Python Libraries Every Data Science

Data Visualization in Python

Revolutionizing Data Analysis: How Python Integration with Excel Empowers Data Analysts

What Makes Python a Great Pick for Data Analysis?

Empowering Data Analysis with Python: Unleash Your Analytical Superpowers!

Python Roadmap For Data Analysis

5 Essential Python Libraries for Data Analysts

领英推荐

How Much SQL Do You Really Know? A Look at Essential SQL Commands for Data Mastery

2024年9月7日

The Strategic Value of SQL RegExp in Data Analysis

2024年8月30日

Immersive virtual reality VR via Python

2024年8月23日

Integrating MySQL with python

2024年8月21日