Data Analysis with Python: Handling Missing Values with Pandas and Scikit-Learn
Data is rarely perfect, and missing values are a common challenge when working with real-world datasets. These gaps can arise due to human error, incomplete observations, or data collection issues. In Python, missing data is typically represented by NaN values. Before diving into exciting tasks like exploratory analysis
Dropping Missing Values with Pandas
Sometimes the best strategy of dealing with missing values is to drop them. This approach is the simplest. If the amount of missing data is small and the data is not critical to the analysis, removing rows or columns with missing values can be a valid choice. As a rule of thumb, when the dataset is large and the amount of missing data is small (e.g., <5%), it may make sense to drop the rows or columns with missing data. Let's look at an example of how we can drop missing values with Pandas. We are going to load a simple dataset with missing values:
To check for missing values in a dataset, we can use the isnull() method. This method returns a boolean value of True for missing values and False for non-missing values. By applying the sum() function to the result, we can count the total number of missing values in each column, as True values are treated as 1 and False values as 0. This provides a quick overview of how many missing values exist in each column of the dataset:
This gives us a total of missing values in each column. The 'Animal' column has no missing values, but the rest of the columns have missing values. The 'Age' column has the most missing values (3 missing values).
Let's say a decision has been made to drop all rows with missing values. To drop the rows with missing values, we can use the Pandas dropna() method:
In this code, because we want to drop rows with missing values, we use the dropna() method with the axis=0 parameter (which is the default). However, in this case, dropping rows has led to a significant loss of data (70%), which could negatively impact the analysis. This does not seem like a good idea.
Now, if we want to drop all columns with missing values instead, we can pass axis=1 to the dropna() method. Here is the code:
This does not seem to be a good idea either, as we end up losing all the columns except the 'Animals' column that does not have missing values. You can see that while this method is simple, it can lead to loss of valuable data if many rows or columns are dropped. Use this carefully, especially if your dataset is small.
An effective strategy to avoid a huge loss of data would be to use a condition to drop columns that have minimum missing values. For example, we can drop columns that have one missing value. See below:
Here, instead of dropping all columns with missing data, first we identified columns with 1 missing value and dropped them from the dataset. This means only two columns have been dropped. This reduces the amount of lost data.
Build the Confidence to Tackle Data Analysis Projects in 2024
Ready to go in and do some real data analysis? The main purpose of this book is to ensure that you develop data analysis skills with Python by tackling challenges. By the end, you should be confident enough to take on any data analysis project with Python. Start your journey with "50 Days of Data Analysis with Python: The Ultimate Challenge Book for Beginners."
领英推荐
Other Resources
My new Python course on classes and functions will help you master these important fundamentals: Check out: Master Python Fundamentals: Classes and Functions
Challenge yourself with Python challenges. Check out 50 Days of Python: A Challenge a Day.
100 Python Tips and Tricks, Python Tips and Tricks: A Collection of 100 Basic & Intermediate Tips & Tricks.
Imputation (Filling Missing Data with Pandas)
Instead of dropping missing values, we can fill the missing values based on a strategy. Common imputation techniques include filling missing values
You can see in the output that the 'Animal' and 'Age' columns have been filled with the mean and median of the respective columns, respectively. In the real world, the mean strategy is recommended for normally distributed data (bell-shaped) as it provides a central tendency that reflects the average value of the data. However, this strategy is vulnerable to outliers, which are extreme values that can significantly affect the average. But when the data is skewed, it is recommended to use the median strategy.
For the categorical column 'Habitat', we fill the missing value with the most frequent value (mode) in the column. This is a suitable strategy for categorical data. For the boolean column 'Endangered', we fill the missing value with True. This assumes that missing values in this column might indicate an endangered species. This basically demonstrates how you can use pandas to deal with missing data using different strategies.
Imputation (Filling Missing Data with Sklearn)
In the previous example, we used pandas to fill missing values. While the fillna() method in Pandas works effectively, it's not the only option for handling missing data. Sklearn offers the SimpleImputer class, which provides a flexible and efficient way to manage missing values. Let’s reimplement the same strategies using Sklearn to demonstrate how this powerful library can be used for the task:
In this code, we first create four imputers using the SimpleImputer class. Each imputer applies a specific strategy for filling missing values. After creating the imputers, we use the fit_transform() method to apply these strategies to the appropriate columns. As you can see in the output, the results are identical to those achieved using Pandas' fillna() method.
When should you use Sklearn instead or pandas? The choice largely depends on your specific task. If you're building a pipeline for training machine learning models, SimpleImputer() integrates well into the pipeline, making it more suitable for machine learning workflows. On the other hand, pandas is excellent for general data manipulation and analysis tasks. So, while pandas is perfect for exploratory data analysis and preprocessing, Sklearn's SimpleImputer shines in machine learning projects
Final Thoughts
These examples illustrate how Pandas and Scikit-Learn can be used to handle missing data and the different strategies you can employ. Handling missing data is a critical aspect of data analysis and machine learning, as it directly impacts the quality and accuracy of insights drawn from your dataset.
The choice between Pandas and Scikit-Learn depends on the nature of the task. For exploratory analysis and basic data manipulation, Pandas is often the ideal tool. However, for machine learning projects that require more structured preprocessing and seamless integration into modeling pipelines, Scikit-Learn’s tools, such as SimpleImputer, are more advantageous.
If you want to further develop your skills, the book "50 Days of Data Analysis with Python: The Ultimate Challenges Book for Beginners" provides valuable insights and challenges to deepen your understanding of how to effectively handle missing data using Python. Thanks for reading.
Geochemist / Geologist / Engineer
4 个月Brilliant article on how to deal with missing data
Lecturer/Researcher: Animal Products and Processing, Meat Quality, Food Safety
5 个月Great stuff.
OK Bo?tjan Dolin?ek
Blog for AI Articles
5 个月Several new Articles; Interesting stuff..... Read them and enjoy. Look at -> Overview of the last published articles Leave a comment or question?on the article site if you like it or give your opinion. Thanks. Any interaction on the Article Site is welcome If you have an idea for a new article; tell me; Thanks. English : https://aifornoobsandexperts.com/ Dutch :?https://aivoorjanenalleman.nl/
Telco engineer | 12y+ experience in CORE Network 2G,3G,4G,5G | Solution Architect, CORE Network engineer and Python automation passionate | LOOKING FOR A JOB OPPORTUNITY | Available for working Abroad!!!
5 个月great!!!