Data Cleaning with Python: Handling Duplicates with Pandas
One of the common issues that everyone who works with data has to deal with is the issue of duplicates. Duplicates are simply records that appear more than once in a dataset. Duplicates in data can arise for a variety of reasons, often depending on the nature of the data collection process, the structure of the dataset, or the way data is handled and processed. Understanding why duplicates occur helps in devising better strategies to prevent, detect, and handle them in your datasets. Yes, handling duplicates ensures data integrity. The pandas library has great tools for detecting and handling duplicates. In this article, I will talk about how you can use pandas to detect duplicates and the different strategies that you can employ to handle the detected duplicates in the dataset.
The Duplicated Method
Pandas uses the duplicated() method to identify duplicate rows within a DataFrame. It returns a boolean Series where True indicates a duplicate row and False indicates unique rows. By default, the duplicated() method sets the first occurrence of the duplicated value to False and all others on True. This means that if there are 2 duplicate rows in the DataFrame, the first row will be set to False as a non-duplicate. The second value will be marked True as a duplicate value. Below, we have a DataFrame loaded with data from Kaggle, and we are going to use the duplicate() method to check for duplicates in the data.
Now that we have loaded the dataset, let's check for duplicates in the DataFrame. We are going to call the duplicated() method:
You can see that all the rows above have been marked as False. This simply means that these rows are not duplicated. Since our dataset has 3000 rows, we cannot confirm if we have no duplicates because we are unable to see all the rows. To confirm whether there are any duplicates in the dataset, we can use the sum() function in combination with the duplicated() method. By passing the output of duplicated() to sum(), we can easily count the number of duplicate rows.
The logic behind this is straightforward: a boolean value of True is equivalent to one (1), and a boolean value of False is equivalent to zero (0). Therefore, when the sum() function encounters a True value (indicating a duplicate), it adds one to the total count. Conversely, False values are treated as zeros and do not contribute to the sum. The final result will give you the total number of duplicates in the dataset. Here is the code below:
So, we can confirm that we only have one row in the dataset that is duplicated. The problem is that we do not know which row this is. To view duplicated rows, we have to filter the DataFrame using the boolean Series returned by the duplicated() method. See below:
Here, the results are a DataFrame (duplicated_rows) that contains only the rows where the duplicated() method returned True. Since there is only one duplicated row in the dataset, we expected this to return one row, and it has not disappointed us. We can see the duplicated row in the dataset.
By default, the duplicated method checks all the rows in the dataset. However, we can also check for duplicates in specific columns. The duplicated() method has a subset parameter. We can pass the column that we want to check for duplicates as an argument to this parameter. If we want to check how many duplicates are in the 'Lifter Name' column, here is how we can do it:
So we have a staggering 2990 duplicate rows in the 'Lifter Name' column.
Build the Confidence to Tackle Data Analysis Projects
Ready to go in and do some real data analysis? The main purpose of this book is to ensure that you develop data analysis skills with Python by tackling challenges. By the end, you should be confident enough to take on any data analysis project with Python. Start your journey with "50 Days of Data Analysis with Python: The Ultimate Challenge Book for Beginners."
Other Resources
Want to learn Python fundamentals the easy way? Check out Master Python Fundamentals: The Ultimate Guide for Beginners.
领英推荐
Challenge yourself with Python challenges. Check out 50 Days of Python: A Challenge a Day.
100 Python Tips and Tricks, Python Tips and Tricks: A Collection of 100 Basic & Intermediate Tips & Tricks.
Handling Duplicates
When you discover duplicates in the dataset, there are several strategies that you can use to handle them. Obviously, understanding why duplicates occur is the first step to handling the problem of duplicates in data. However, once duplicates have been detected in data, there are several ways you can handle them. Let's look at some of the ways we can handle duplicates:
Removing duplicates with drop_duplicates(): Once you've identified the duplicates, you may want to remove them to clean your data. The drop_duplicates() method allows you to remove all duplicate rows from your DataFrame, keeping only the first occurrence. Let's say we want to drop all the duplicates in the 'Lifter Name' and 'Age' columns. We are going to pass the 'Lifter Name' and 'Age' columns as arguments to the subset parameter of the drop_duplicates() method.
Our dataset has gone from a staggering 3000 rows to just 470 rows because we have dropped duplicates in the 'Lifter Name' and 'Age' columns. This is one strategy that you can employ if you decide that duplicates may interfere with your analysis.
Marking duplicates instead of dropping them: In some cases, you might not want to remove duplicates immediately but instead mark them for further investigation. It is possible that the values may not be duplicates at all, which is why marking them instead of dropping them may be the best strategy. You can create a new column that flags whether a row is a duplicate. Here is an example:
Here we have created a column called "suspected_duplicate'' which marks every duplicate True. Once we have marked the suspected duplicates, we can investigate them to confirm if they are really duplicates.
Dropping the most duplicated value: Depending on the goal of your analysis, you may decide to drop only the most duplicated value in the data. This is because if a particular value is highly duplicated, it can dominate the analysis, leading to biased results. For example, if one lifter name appears repeatedly in a dataset, they might disproportionately influence the analysis. Another advantage of removing the most duplicated element is that it can reduce the size of the dataset, making it more manageable and easier to work with. Here is an example:
In this function, we use the mode() method to find the most frequent name in the DataFrame. Since the mode() method might return multiple rows if there are ties, .iloc[0] selects the first row from the mode DataFrame. Once we have this name, we drop all instances of this name from the DataFrame using filtering.
Conclusion
Handling duplicates and ensuring they are managed effectively is an essential skill for anyone working with data. The pandas library offers powerful tools for dealing with duplicates, making it crucial to become proficient with these methods if you're using Python for data analysis. To truly master the duplicated() and drop_duplicates() methods, it's important to practice using them on real datasets. Thanks for reading.
Newsletter Sponsorship
You can reach a highly engaged audience of over 300,000 tech-savvy subscribers and grow your brand with a newsletter sponsorship. Contact me at [email protected] today to learn more about the sponsorship opportunities.
Aspiring Data Analyst | Proficient in Python, SQL, and R | Passionate About Turning Data into Insights
1 个月Benjamin, handling duplicates is indeed crucial for effective data cleaning. I’m excited to check out your article on using pandas for this task. It’s a valuable skill for anyone in data analysis. Thanks for sharing!
Medical Clinical Officer | Bridging Healthcare & Technology | Leveraging Data Science & NLP to Advance Patient Care | Passionate About Innovating Healthcare Solutions | Software Engineering Student
1 个月Data is like a reflection—sometimes it duplicates itself. But just as twins have their unique identities, every piece of data should too. Data cleaning with pandas does help so much uncover the secrets to handling those pesky duplicates. This was helpful!
This looks like 99+% of other examples of the use of Python in regards to data wrangling and curation. Very academic examples at best. This website should focus on world class techniques that Fortune 500 companies have long used and should still demand. Once confidence in the data is lost is is very hard and very expensive to get back! Please show how to use code/logic/rules to best choose which duplicate record to retain based on given business rules. Also, please show how Python can dynamically bucket data given various scenarios? With SAS, this can be done most efficiently within a single data step. It includes error handling and logging which is another feature I don't see with Python.
Ph.D in Molecular Biology | Research Project Manager
1 个月Great article!
Data Science Intern at Sabudh Foundation || POWER BI || SQL || DATA ANALYST || PYTHON || MACHINE LEARNING
1 个月well explained