Filling the Gaps: A Beginner's Guide to Handling Missing Data in Python
Pravin Tiwari
Passionate Machine Learning Engineer | Eager to Solve Complex Problems and Drive Data-Driven Innovation
Missing values aren’t the end of the story; they’re just the start of a cleaner, more insightful narrative.
Hello, folks! Hope you’re all doing well. Recently, while working on a project, I encountered a significant challenge with missing data. This inspired me to share my approach and insights through an article on handling missing data effectively using Python. I believe this will be helpful for anyone dealing with similar issues in their data analysis or machine learning workflows.
When working with large datasets, you may encounter different types of missing values. These are typically categorized into three types:
Missing Completely at Random (MCAR):
Missing at Random (MAR):
Missing Not at Random (MNAR):
Working on the sample dataset:
For this tutorial, I will be working with the famous Titanic dataset. I’ll demonstrate some beginner-friendly techniques to handle missing values effectively, making it easier for you to apply them in your own projects.
Loading the Titanic Dataset
The Titanic dataset is a classic dataset widely used for learning data analysis and machine learning. It contains details about the passengers aboard the RMS Titanic, which sank in 1912. This dataset is often used for classification tasks, such as predicting whether a passenger survived based on features like age, class, and gender.
How to Load the Dataset
The Titanic dataset is readily available in Python’s Seaborn library. You can load it with the following code:
import seaborn as sns
import pandas as pd
# Load Titanic dataset
df = sns.load_dataset('titanic')
Basic Dataset Exploration
Once the dataset is loaded, you can explore it using these commands:
1. View Dataset Content
df.head()
df.tail()
df.sample(5)
2. Dataset Shape
df.shape
领英推荐
3. Dataset Information
df.info()
4. Statistical Summary
df.describe()
5. Check for Missing Values
df.isnull().sum()
6. Check for Duplicate Values
df.duplicated().sum()
Techniques to handle missing values:
A dataset with missing values is like a puzzle—solve it wisely, and the picture becomes clear.
If you’re new to data analysis and machine learning workflows, here are two beginner-friendly techniques to handle missing values effectively:
Solution 1: Delete Rows or Data Points with Missing Values
# For rows
df.dropna().shape
# For columns
df.dropna(axis=1).shape
Solution 2: Imputation Techniques
Instead of removing data, you can fill in missing values using imputation. Here are three common techniques:
Mean Imputation:
df['age_mean'] = df['age'].fillna(df['age'].mean())
df[['age_mean', 'age']]
df[['age_mean', 'age']].isnull().sum()
Median Imputation:
df['age_median'] = df['age'].fillna(df['age'].median())
df[['age_median', 'age']]
df[['age_median', 'age']].isnull().sum()
Mode Imputation:
mode = df[df['embarked'].notna()]['embarked'].mode()[0]
df['embarked_mode'] = df['embarked'].fillna(mode)
df[['embarked_mode', 'embarked']].isnull().sum()
That’s all for today! If you face any challenges while applying the techniques mentioned above, feel free to reach out—I’m always here to help. Also, if you’re looking for support with your projects, I’m available for freelance work and data analysis consultancy. Let’s connect and collaborate!
Lastly, I’d love to hear your thoughts and experiences after reading this article. If you try implementing these techniques, please share your results or any feedback in the comments. Your insights might help others in their data journey too!