Filling the Gaps: A Beginner's Guide to Handling Missing Data in Python

Filling the Gaps: A Beginner's Guide to Handling Missing Data in Python

Missing values aren’t the end of the story; they’re just the start of a cleaner, more insightful narrative.

Hello, folks! Hope you’re all doing well. Recently, while working on a project, I encountered a significant challenge with missing data. This inspired me to share my approach and insights through an article on handling missing data effectively using Python. I believe this will be helpful for anyone dealing with similar issues in their data analysis or machine learning workflows.

When working with large datasets, you may encounter different types of missing values. These are typically categorized into three types:

Missing Completely at Random (MCAR):

  • What it means: Data is missing for no specific reason. It doesn’t depend on the data values or any other factors in the dataset.
  • Example: Suppose you're surveying people, and some responses are missing because a few forms accidentally got lost. This missing data has nothing to do with the respondents or their answers—it’s purely random.

Missing at Random (MAR):

  • What it means: Data is missing, but it’s related to other observed information in the dataset, not the missing values themselves.
  • Example: In a medical study, older participants are less likely to report their income. The missing income data is related to the age of the participants (which is available), but not to the actual income values.

Missing Not at Random (MNAR):

  • What it means: Data is missing due to reasons that are directly related to the missing values themselves.
  • Example: In a survey asking about alcohol consumption, people who drink a lot may skip answering this question because they’re uncomfortable sharing it. The missing data is related to their actual drinking habits.


Working on the sample dataset:

For this tutorial, I will be working with the famous Titanic dataset. I’ll demonstrate some beginner-friendly techniques to handle missing values effectively, making it easier for you to apply them in your own projects.

Loading the Titanic Dataset

The Titanic dataset is a classic dataset widely used for learning data analysis and machine learning. It contains details about the passengers aboard the RMS Titanic, which sank in 1912. This dataset is often used for classification tasks, such as predicting whether a passenger survived based on features like age, class, and gender.

How to Load the Dataset

The Titanic dataset is readily available in Python’s Seaborn library. You can load it with the following code:

import seaborn as sns
import pandas as pd

# Load Titanic dataset
df = sns.load_dataset('titanic')        

Basic Dataset Exploration

Once the dataset is loaded, you can explore it using these commands:

1. View Dataset Content

  • Top 5 rows:

df.head()        

  • Last 5 rows:

df.tail()        

  • 5 Random rows:

df.sample(5)        

2. Dataset Shape

  • To see the number of rows and columns:

df.shape        

3. Dataset Information

  • To check column names, data types, and non-null counts:

df.info()        

4. Statistical Summary

  • To get the summary statistics (mean, min, max, etc.) for numerical columns:

df.describe()        

5. Check for Missing Values

  • To identify columns with missing values and their counts:

df.isnull().sum()        

6. Check for Duplicate Values

  • To find the number of duplicate rows:

df.duplicated().sum()        

Techniques to handle missing values:

A dataset with missing values is like a puzzle—solve it wisely, and the picture becomes clear.

If you’re new to data analysis and machine learning workflows, here are two beginner-friendly techniques to handle missing values effectively:

Solution 1: Delete Rows or Data Points with Missing Values

  • Description: Simply remove the rows or columns containing missing data. This is straightforward but should be used cautiously.
  • When to Use: When the proportion of missing data is very small (e.g., less than 5-10%).When removing these rows won’t significantly impact your analysis or model.
  • Drawback: Can lead to loss of valuable data, especially if many rows or columns are removed.

# For rows
df.dropna().shape

# For columns
df.dropna(axis=1).shape        

Solution 2: Imputation Techniques

Instead of removing data, you can fill in missing values using imputation. Here are three common techniques:

Mean Imputation:

  • Description: Replace missing values with the mean of the column.
  • When to Use: When the data is normally distributed (symmetrical with no extreme outliers).
  • Example: If the "Age" column has missing values and is normally distributed, replace the missing values with the average age of the dataset.

df['age_mean'] = df['age'].fillna(df['age'].mean())

df[['age_mean', 'age']]

df[['age_mean', 'age']].isnull().sum()        

Median Imputation:

  • Description: Replace missing values with the median of the column.
  • When to Use: When the data is skewed or contains outliers (e.g., income or house prices).
  • Example: If a column contains outliers, such as very high salaries, median imputation is more robust than mean imputation.

df['age_median'] = df['age'].fillna(df['age'].median())

df[['age_median', 'age']]

df[['age_median', 'age']].isnull().sum()        

Mode Imputation:

  • Description: Replace missing values with the mode (most frequent value) of the column.
  • When to Use: For categorical data, such as "Gender" or "Embarked" in the Titanic dataset.
  • Example: If "Embarked" has missing values, replace them with the most common port of embarkation.

mode = df[df['embarked'].notna()]['embarked'].mode()[0]

df['embarked_mode'] = df['embarked'].fillna(mode)

df[['embarked_mode', 'embarked']].isnull().sum()        

That’s all for today! If you face any challenges while applying the techniques mentioned above, feel free to reach out—I’m always here to help. Also, if you’re looking for support with your projects, I’m available for freelance work and data analysis consultancy. Let’s connect and collaborate!

Lastly, I’d love to hear your thoughts and experiences after reading this article. If you try implementing these techniques, please share your results or any feedback in the comments. Your insights might help others in their data journey too!

要查看或添加评论,请登录

Pravin Tiwari的更多文章

社区洞察

其他会员也浏览了