登录查看更多内容

Data Cleaning with Python: Handling Duplicates with Pandas

Benjamin Bennett Alexander

发布日期: 2024年8月17日

One of the common issues that everyone who works with data has to deal with is the issue of duplicates. Duplicates are simply records that appear more than once in a dataset. Duplicates in data can arise for a variety of reasons, often depending on the nature of the data collection process, the structure of the dataset, or the way data is handled and processed. Understanding why duplicates occur helps in devising better strategies to prevent, detect, and handle them in your datasets. Yes, handling duplicates ensures data integrity. The pandas library has great tools for detecting and handling duplicates. In this article, I will talk about how you can use pandas to detect duplicates and the different strategies that you can employ to handle the detected duplicates in the dataset.

The Duplicated Method

Pandas uses the duplicated() method to identify duplicate rows within a DataFrame. It returns a boolean Series where True indicates a duplicate row and False indicates unique rows. By default, the duplicated() method sets the first occurrence of the duplicated value to False and all others on True. This means that if there are 2 duplicate rows in the DataFrame, the first row will be set to False as a non-duplicate. The second value will be marked True as a duplicate value. Below, we have a DataFrame loaded with data from Kaggle, and we are going to use the duplicate() method to check for duplicates in the data.

Now that we have loaded the dataset, let's check for duplicates in the DataFrame. We are going to call the duplicated() method:

You can see that all the rows above have been marked as False. This simply means that these rows are not duplicated. Since our dataset has 3000 rows, we cannot confirm if we have no duplicates because we are unable to see all the rows. To confirm whether there are any duplicates in the dataset, we can use the sum() function in combination with the duplicated() method. By passing the output of duplicated() to sum(), we can easily count the number of duplicate rows.

The logic behind this is straightforward: a boolean value of True is equivalent to one (1), and a boolean value of False is equivalent to zero (0). Therefore, when the sum() function encounters a True value (indicating a duplicate), it adds one to the total count. Conversely, False values are treated as zeros and do not contribute to the sum. The final result will give you the total number of duplicates in the dataset. Here is the code below:

So, we can confirm that we only have one row in the dataset that is duplicated. The problem is that we do not know which row this is. To view duplicated rows, we have to filter the DataFrame using the boolean Series returned by the duplicated() method. See below:

Here, the results are a DataFrame (duplicated_rows) that contains only the rows where the duplicated() method returned True. Since there is only one duplicated row in the dataset, we expected this to return one row, and it has not disappointed us. We can see the duplicated row in the dataset.

By default, the duplicated method checks all the rows in the dataset. However, we can also check for duplicates in specific columns. The duplicated() method has a subset parameter. We can pass the column that we want to check for duplicates as an argument to this parameter. If we want to check how many duplicates are in the 'Lifter Name' column, here is how we can do it:

So we have a staggering 2990 duplicate rows in the 'Lifter Name' column.

Build the Confidence to Tackle Data Analysis Projects

Ready to go in and do some real data analysis? The main purpose of this book is to ensure that you develop data analysis skills with Python by tackling challenges. By the end, you should be confident enough to take on any data analysis project with Python. Start your journey with "50 Days of Data Analysis with Python: The Ultimate Challenge Book for Beginners."

Other Resources

Want to learn Python fundamentals the easy way? Check out Master Python Fundamentals: The Ultimate Guide for Beginners.

Sedar Sahin 4 个月前

How to Work with Data in Python: A Beginner's Guide

Bow River Solutions Inc. 3 个月前

The Python Data Ecosystem with Sigmund Freud

Matt Hughes 5 年前

Challenge yourself with Python challenges. Check out 50 Days of Python: A Challenge a Day.

100 Python Tips and Tricks, Python Tips and Tricks: A Collection of 100 Basic & Intermediate Tips & Tricks.

Handling Duplicates

When you discover duplicates in the dataset, there are several strategies that you can use to handle them. Obviously, understanding why duplicates occur is the first step to handling the problem of duplicates in data. However, once duplicates have been detected in data, there are several ways you can handle them. Let's look at some of the ways we can handle duplicates:

Removing duplicates with drop_duplicates(): Once you've identified the duplicates, you may want to remove them to clean your data. The drop_duplicates() method allows you to remove all duplicate rows from your DataFrame, keeping only the first occurrence. Let's say we want to drop all the duplicates in the 'Lifter Name' and 'Age' columns. We are going to pass the 'Lifter Name' and 'Age' columns as arguments to the subset parameter of the drop_duplicates() method.

Our dataset has gone from a staggering 3000 rows to just 470 rows because we have dropped duplicates in the 'Lifter Name' and 'Age' columns. This is one strategy that you can employ if you decide that duplicates may interfere with your analysis.

Marking duplicates instead of dropping them: In some cases, you might not want to remove duplicates immediately but instead mark them for further investigation. It is possible that the values may not be duplicates at all, which is why marking them instead of dropping them may be the best strategy. You can create a new column that flags whether a row is a duplicate. Here is an example:

Here we have created a column called "suspected_duplicate'' which marks every duplicate True. Once we have marked the suspected duplicates, we can investigate them to confirm if they are really duplicates.

Dropping the most duplicated value: Depending on the goal of your analysis, you may decide to drop only the most duplicated value in the data. This is because if a particular value is highly duplicated, it can dominate the analysis, leading to biased results. For example, if one lifter name appears repeatedly in a dataset, they might disproportionately influence the analysis. Another advantage of removing the most duplicated element is that it can reduce the size of the dataset, making it more manageable and easier to work with. Here is an example:

In this function, we use the mode() method to find the most frequent name in the DataFrame. Since the mode() method might return multiple rows if there are ties, .iloc[0] selects the first row from the mode DataFrame. Once we have this name, we drop all instances of this name from the DataFrame using filtering.

Conclusion

Handling duplicates and ensuring they are managed effectively is an essential skill for anyone working with data. The pandas library offers powerful tools for dealing with duplicates, making it crucial to become proficient with these methods if you're using Python for data analysis. To truly master the duplicated() and drop_duplicates() methods, it's important to practice using them on real datasets. Thanks for reading.

Newsletter Sponsorship

You can reach a highly engaged audience of over 300,000 tech-savvy subscribers and grow your brand with a newsletter sponsorship. Contact me at [email protected] today to learn more about the sponsorship opportunities.

Python, Data Analytics & AI

327,383 位关注者

Kishore Kumar R

Aspiring Data Analyst | Proficient in Python, SQL, and R | Passionate About Turning Data into Insights

1 个月

Benjamin, handling duplicates is indeed crucial for effective data cleaning. I’m excited to check out your article on using pandas for this task. It’s a valuable skill for anyone in data analysis. Thanks for sharing!

Emmanuel Osele

Medical Clinical Officer | Bridging Healthcare & Technology | Leveraging Data Science & NLP to Advance Patient Care | Passionate About Innovating Healthcare Solutions | Software Engineering Student

1 个月

Data is like a reflection—sometimes it duplicates itself. But just as twins have their unique identities, every piece of data should too. Data cleaning with pandas does help so much uncover the secrets to handling those pesky duplicates. This was helpful!

1 次回应

Rick Norman

1 个月

This looks like 99+% of other examples of the use of Python in regards to data wrangling and curation. Very academic examples at best. This website should focus on world class techniques that Fortune 500 companies have long used and should still demand. Once confidence in the data is lost is is very hard and very expensive to get back! Please show how to use code/logic/rules to best choose which duplicate record to retain based on given business rules. Also, please show how Python can dynamically bucket data given various scenarios? With SAS, this can be done most efficiently within a single data step. It includes error handling and logging which is another feature I don't see with Python.

1 次回应

Ola Orgil (Ph.D)

Ph.D in Molecular Biology | Research Project Manager

1 个月

Great article!

DRISHYA SHRIVASTAVA

Data Science Intern at Sabudh Foundation || POWER BI || SQL || DATA ANALYST || PYTHON || MACHINE LEARNING

1 个月

well explained

1 次回应

查看更多评论

要查看或添加评论，请登录

查看全部

Data Cleaning with Python: Handling Duplicates with Pandas

Benjamin Bennett Alexander

The Duplicated Method

Build the Confidence to Tackle Data Analysis Projects

Other Resources

领英推荐

Handling Duplicates

Conclusion

Newsletter Sponsorship

Python, Data Analytics & AI

327,383 位关注者

更多精彩文章

社区洞察

其他会员也浏览了

Box Plot using Python: Data Summary by 5 Numbers

Python Dictionaries

5 Python Libraries for Lazy Data Scientists

Python Panda Operations and Run on Google Colaboratory

Exploring Python: Pandas and NumPy

Python Tip #4: Understanding Collections — Tuples, Lists, Sets, and Dictionaries

Mastering Python Dictionaries & Sets

Python Datatype - Part 5 (Dictionary)

Python for Process Improvement: Part V

Python Data Structures

The Duplicated Method

Build the Confidence to Tackle Data Analysis Projects

Other Resources

领英推荐

Handling Duplicates

Conclusion

Newsletter Sponsorship

Python, Data Analytics & AI

327,383 位关注者

Pandas Vs. SQL: String Formatting and Preprocessing Data

2024年10月5日

Data Analysis with Python: Handling Missing Values with Pandas and Scikit-Learn

2024年9月28日

Master Python Fundamentals: Learn Functions and Classes Today

2024年9月25日

Data Analysis with Python: Stop Reading and Start Doing (Analyzing Financial Data)

2024年9月21日

Master Python Fundamentals: Encapsulation in User-Defined Classes

2024年9月14日

Breaking Into Data Analytics: Alex the Analyst Shares His Journey and Insights

2024年9月11日

50 Days of Data Analysis With Python: One Year Later

2024年9月7日

The Accidental Analyst: How Thais Cooke Found Her Passion in Data

2024年9月4日

How to Leverage Pandas GroupBy for Data Analysis

2024年8月31日

From Occupational Therapy to Data Analytics: Lauren Rosenthal's Journey and Insights

2024年8月28日

社区洞察

其他会员也浏览了

Box Plot using Python: Data Summary by 5 Numbers

Python Dictionaries

5 Python Libraries for Lazy Data Scientists

Python Panda Operations and Run on Google Colaboratory

Exploring Python: Pandas and NumPy

Python Tip #4: Understanding Collections — Tuples, Lists, Sets, and Dictionaries

Mastering Python Dictionaries & Sets

Python Datatype - Part 5 (Dictionary)

Python for Process Improvement: Part V

Python Data Structures