20 Methods for Handling Missing Values in Python Pandas
Ali Soltanhosseini
Data Analyst | SQL, Python, Power BI | Optimizing Data for Actionable Business Insights
Effectively identifying and managing missing data is vital for accurate data analysis and model performance. Handling missing values in Python Pandas is crucial for preparing datasets for analysis. Pandas provides a rich set of methods for uncovering missing values, including examples of how to check for missing values in a Python DataFrame, count missing values in columns, and find null values in specific columns. Additionally, techniques for filling missing values in datasets, such as using fillna in Pandas, are essential for maintaining data integrity. Mastering these Pandas Python methods for missing values will enhance the data cleaning process and enable more reliable insights from data. In this article, I explore 20 methods for uncovering missing values in a dataset using Pandas. These techniques will enhance the data cleaning process and ensure more accurate analysis and modeling.
1. Loading A Dataset
Start by loading your dataset into a Pandas DataFrame:
import pandas as pd
# Load dataset
df = pd.read_csv('your_dataset.csv')
In data reading via pandas, if you have a massive data, it is possible to call the entire max rows and columns by
pd.set_option('display.max_rows', None)
pd.set_option('display.max_columns', None)
and you can also customize the code by changing None into your desired rows and columns:
pd.set_option('display.max_rows', 20)
pd.set_option('display.max_columns', 10)
2. Basic Overview of Missing Values
Get a quick summary of missing values:
missing_values = df.isna().sum()
print(missing_values)
if you ran this and it shows a missing value in certain column, for example in a column called Region, you can run the following code to find the specified rows:
missing_region = df['Region'].isna()
missing_row = df[missing_region]
print(missing_row)
where missing_row contains only the rows where Region is missing and missing_region is a boolean Series identifying which rows in Region are NaN.
For example:
import pandas as pd
import numpy as np
df1 = pd.DataFrame({
'Country': ['USA', 'Canada', 'Mexico'],
'Region': ['North America', np.nan, 'Central America']
})
missing_region = df1['Region'].isna()
missing_row = df1[missing_region]
print(missing_row)
Country Region
1 Canada NaN
3. Percentage of Missing Values
Calculate the percentage of missing values per column:
missing_percentage = df.isna().mean() * 100
print(missing_percentage)
4. Visualize Missing Data
Use the missingno library to visualize missing values:
import missingno as msno
# Matrix plot msno.matrix(df)
Please consider that you should firstly install it by the following command:
pip install missingno
In msno.matrix(df) library, this function generates a matrix plot that visualizes the pattern of missing data in your DataFrame df.
Missingno is a useful library for visualizing missing data in pandas DataFrames. For example:
import pandas as pd
import numpy as np
import missingno as msno
# Example DataFrame with missing values
df = pd.DataFrame({
'A': [1, 2, np.nan, 4],
'B': [5, np.nan, np.nan, 8],
'C': [9, 10, 11, 12]
})
# Visualize the missing data
msno.matrix(df)
5. Heatmap of Missing Data
Generate a heatmap to visualize missing data patterns:
# Heatmap plot
msno.heatmap(df)
6. Dendrogram of Missing Data
Explore hierarchical clustering of missing data with a dendrogram:
# Dendrogram plot
msno.dendrogram(df)
7. Check for Missing Values in Specific Rows
Identify rows with any missing values:
rows_with_na = df[df.isna().any(axis=1)]
print(rows_with_na)
8. Check for Missing Values in Specific Columns
Identify columns with missing values:
领英推荐
columns_with_na = df.columns[df.isna().any()].tolist()
print(columns_with_na)
9. Count of Missing Values per Row
Count missing values for each row:
row_missing_count = df.isna().sum(axis=1)
print(row_missing_count)
10. Count of Missing Values per Column
Count missing values for each column:
column_missing_count = df.isna().sum()
print(column_missing_count)
11. Summary of Non-Null Values
Get a summary of non-null values using the info() method:
df.info()
12. Display Rows with All Missing Values
Find rows where all values are missing:
rows_all_na = df[df.isna().all(axis=1)]
print(rows_all_na)
13. Display Columns with All Missing Values
Find columns where all values are missing:
columns_all_na = df.columns[df.isna().all()].tolist()
print(columns_all_na)
14. Use describe() Method for Missing Data
Get statistical summary and check for missing data in numerical columns:
desc = df.describe()
print(desc)
15. Use Boolean Indexing for Conditional Missing Values
Find missing values based on specific conditions:
conditional_missing = df[df['some_column'].isna() & (df['another_column'] > threshold)]
print(conditional_missing)
16. Check for Missing Values in a Subset of Columns
Identify missing values in selected columns:
subset_missing = df[['column1', 'column2']].isna().sum()
print(subset_missing)
17. Count Unique Missing Value Patterns
Count unique patterns of missing values across rows:
unique_patterns = df.isna().astype(int).groupby(df.columns.tolist()).size()
print(unique_patterns)
18. Analyze Missing Data by Group
Examine missing data within specific groups:
grouped_missing = df.groupby('group_column').apply(lambda x: x.isna().sum())
print(grouped_missing)
19. Use dropna() to Identify Missing Data Removal
Check which rows/columns would be dropped if dropna() was applied:
print(df.dropna().shape)
20. Impute Missing Values and Compare
Impute missing values using mean and compare results:
df_imputed = df.fillna(df.mean())
print(df_imputed.isna().sum())
Conclusion
When working with missing values in Python Pandas, leveraging the array of methods available for handling and filling these gaps is essential for accurate data analysis and model performance. From using fillna to addressing missing values in a DataFrame, Pandas offers comprehensive tools for identifying and managing null values. By applying techniques such as counting missing values in columns, finding null values, and filling missing values in datasets, you ensure a robust and complete dataset ready for analysis. Utilizing these Pandas Python methods for missing values not only streamlines the data cleaning process but also enhances the reliability of your insights. Embracing these practices in Jupyter Notebook or any Python environment will significantly improve the data preparation workflow.