登录查看更多内容

Filling the Gaps: A Beginner's Guide to Handling Missing Data in Python

Pravin Tiwari

Passionate Machine Learning Engineer | Eager to Solve Complex Problems and Drive Data-Driven Innovation

发布日期: 2024年12月18日

Missing values aren’t the end of the story; they’re just the start of a cleaner, more insightful narrative.

Hello, folks! Hope you’re all doing well. Recently, while working on a project, I encountered a significant challenge with missing data. This inspired me to share my approach and insights through an article on handling missing data effectively using Python. I believe this will be helpful for anyone dealing with similar issues in their data analysis or machine learning workflows.

When working with large datasets, you may encounter different types of missing values. These are typically categorized into three types:

Missing Completely at Random (MCAR):

What it means: Data is missing for no specific reason. It doesn’t depend on the data values or any other factors in the dataset.
Example: Suppose you're surveying people, and some responses are missing because a few forms accidentally got lost. This missing data has nothing to do with the respondents or their answers—it’s purely random.

Missing at Random (MAR):

What it means: Data is missing, but it’s related to other observed information in the dataset, not the missing values themselves.
Example: In a medical study, older participants are less likely to report their income. The missing income data is related to the age of the participants (which is available), but not to the actual income values.

Missing Not at Random (MNAR):

What it means: Data is missing due to reasons that are directly related to the missing values themselves.
Example: In a survey asking about alcohol consumption, people who drink a lot may skip answering this question because they’re uncomfortable sharing it. The missing data is related to their actual drinking habits.

Working on the sample dataset:

For this tutorial, I will be working with the famous Titanic dataset. I’ll demonstrate some beginner-friendly techniques to handle missing values effectively, making it easier for you to apply them in your own projects.

Loading the Titanic Dataset

The Titanic dataset is a classic dataset widely used for learning data analysis and machine learning. It contains details about the passengers aboard the RMS Titanic, which sank in 1912. This dataset is often used for classification tasks, such as predicting whether a passenger survived based on features like age, class, and gender.

How to Load the Dataset

The Titanic dataset is readily available in Python’s Seaborn library. You can load it with the following code:

import seaborn as sns
import pandas as pd

# Load Titanic dataset
df = sns.load_dataset('titanic')

Basic Dataset Exploration

Once the dataset is loaded, you can explore it using these commands:

1. View Dataset Content

Top 5 rows:

df.head()

Last 5 rows:

df.tail()

5 Random rows:

df.sample(5)

2. Dataset Shape

To see the number of rows and columns:

df.shape

领英推荐

Python Challenge: Most Profitable Companies

StrataScratch 6 个月前

Mastering Linear Search: A Comprehensive Guide for…

2M Infotech Pvt. Ltd. 1 年前

Understanding Bayesian with Examples In Python

Rany ElHousieny, PhD??? 1 年前

3. Dataset Information

To check column names, data types, and non-null counts:

df.info()

4. Statistical Summary

To get the summary statistics (mean, min, max, etc.) for numerical columns:

df.describe()

5. Check for Missing Values

To identify columns with missing values and their counts:

df.isnull().sum()

6. Check for Duplicate Values

To find the number of duplicate rows:

df.duplicated().sum()

Techniques to handle missing values:

A dataset with missing values is like a puzzle—solve it wisely, and the picture becomes clear.

If you’re new to data analysis and machine learning workflows, here are two beginner-friendly techniques to handle missing values effectively:

Solution 1: Delete Rows or Data Points with Missing Values

Description: Simply remove the rows or columns containing missing data. This is straightforward but should be used cautiously.
When to Use: When the proportion of missing data is very small (e.g., less than 5-10%).When removing these rows won’t significantly impact your analysis or model.
Drawback: Can lead to loss of valuable data, especially if many rows or columns are removed.

# For rows
df.dropna().shape

# For columns
df.dropna(axis=1).shape

Solution 2: Imputation Techniques

Instead of removing data, you can fill in missing values using imputation. Here are three common techniques:

Mean Imputation:

Description: Replace missing values with the mean of the column.
When to Use: When the data is normally distributed (symmetrical with no extreme outliers).
Example: If the "Age" column has missing values and is normally distributed, replace the missing values with the average age of the dataset.

df['age_mean'] = df['age'].fillna(df['age'].mean())

df[['age_mean', 'age']]

df[['age_mean', 'age']].isnull().sum()

Median Imputation:

Description: Replace missing values with the median of the column.
When to Use: When the data is skewed or contains outliers (e.g., income or house prices).
Example: If a column contains outliers, such as very high salaries, median imputation is more robust than mean imputation.

df['age_median'] = df['age'].fillna(df['age'].median())

df[['age_median', 'age']]

df[['age_median', 'age']].isnull().sum()

Mode Imputation:

Description: Replace missing values with the mode (most frequent value) of the column.
When to Use: For categorical data, such as "Gender" or "Embarked" in the Titanic dataset.
Example: If "Embarked" has missing values, replace them with the most common port of embarkation.

mode = df[df['embarked'].notna()]['embarked'].mode()[0]

df['embarked_mode'] = df['embarked'].fillna(mode)

df[['embarked_mode', 'embarked']].isnull().sum()

That’s all for today! If you face any challenges while applying the techniques mentioned above, feel free to reach out—I’m always here to help. Also, if you’re looking for support with your projects, I’m available for freelance work and data analysis consultancy. Let’s connect and collaborate!

Lastly, I’d love to hear your thoughts and experiences after reading this article. If you try implementing these techniques, please share your results or any feedback in the comments. Your insights might help others in their data journey too!

要查看或添加评论，请登录

Pravin Tiwari的更多文章

Balancing Act: How to Master Imbalanced Datasets Like a Pro

2024年12月18日

Balancing Act: How to Master Imbalanced Datasets Like a Pro

A good model listens to all its data, not just the loudest class. Hello readers! Hope you’re doing well.

Filling the Gaps: A Beginner's Guide to Handling Missing Data in Python

Pravin Tiwari

Passionate Machine Learning Engineer | Eager to Solve Complex Problems and Drive Data-Driven Innovation

Working on the sample dataset:

Loading the Titanic Dataset

How to Load the Dataset

Basic Dataset Exploration

1. View Dataset Content

2. Dataset Shape

领英推荐

3. Dataset Information

4. Statistical Summary

5. Check for Missing Values

6. Check for Duplicate Values

Techniques to handle missing values:

Solution 1: Delete Rows or Data Points with Missing Values

Solution 2: Imputation Techniques

Mean Imputation:

Median Imputation:

Mode Imputation:

Pravin Tiwari的更多文章

社区洞察

其他会员也浏览了

Topic Modeling and LDA in Python

Why Is Python Used for Machine Learning

AIML 09- Data Augmentation in Python: Everything You Need to Know

Machine Learning - All you need to know about Outliers

Python MACHINE LEARNING

Building a Machine Learning Model from Scratch Using?Python

A detailed K-nearest Neighbors classifier in Python

Day 5: Python Casting – Mastering Variable Types!

5 Beginner Friendly Steps to Learn Machine Learning and Data Science with Python

A Gentle Introduction to XGBoost for Applied Machine Learning

Working on the sample dataset:

Loading the Titanic Dataset

How to Load the Dataset

Basic Dataset Exploration

1. View Dataset Content

2. Dataset Shape

领英推荐

3. Dataset Information

4. Statistical Summary

5. Check for Missing Values

6. Check for Duplicate Values

Techniques to handle missing values:

Solution 1: Delete Rows or Data Points with Missing Values

Solution 2: Imputation Techniques

Mean Imputation:

Median Imputation:

Mode Imputation:

Pravin Tiwari的更多文章

Balancing Act: How to Master Imbalanced Datasets Like a Pro

社区洞察

其他会员也浏览了

Topic Modeling and LDA in Python

Why Is Python Used for Machine Learning

AIML 09- Data Augmentation in Python: Everything You Need to Know

Machine Learning - All you need to know about Outliers

Python MACHINE LEARNING

Building a Machine Learning Model from Scratch Using?Python

A detailed K-nearest Neighbors classifier in Python

Day 5: Python Casting – Mastering Variable Types!

5 Beginner Friendly Steps to Learn Machine Learning and Data Science with Python

A Gentle Introduction to XGBoost for Applied Machine Learning