Pandas - Duplicate Row Detection and Grouping
Let's connect! Send me a connection invitation. I regularly share Jupyter Notebooks on Pandas and would love to expand my network
Explore my profile: Head to my profile to see more about my work, skills, and experience.
If you're feeling generous: Repost this article with your network and help spread the word!
Description:
You are a data scientist working for an e-commerce company. The marketing team has collected customer data from various sources, including website interactions, social media, and customer surveys. However, due to the diverse sources, there are duplicate records in the dataset.
Task:
Your task is to identify and combine duplicate rows based on specific criteria, and calculate the total spend for each unique customer.
Bonus Question:
# import libraries
import pandas as pd
import numpy as np
Generate the data
Here is a tiny dataset composed of 12 rows that represents customer information, including their ID, name, email, and amount spent.
领英推荐
Columns:
# sample data placed in a dictionary
data = {
'CustomerID': ['C001', 'C002', 'C003', 'C001', 'C002', 'C004', 'C005', 'C003', 'C006'],
'Name': ['John', 'Mary', 'David', 'John', 'Mary', 'Emily', 'Michael', 'David', 'Sarah'],
'Email': ['[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]'],
'Spent': [100, 200, 300, 100, 200, 400, 500, 300, 600]
}
# create the dataframe
df = pd.DataFrame(data)
# introduce duplicates
duplicates = pd.DataFrame({'CustomerID': ['C001', 'C002', 'C003'], 'Name': ['John', 'Mary', 'David'], 'Email': ['[email protected]', '[email protected]', '[email protected]'], 'Spent': [100, 200, 300]})
# combine the dataframes
df = pd.concat([df, duplicates], ignore_index=True)
df
Identify Duplicates
df[df.duplicated()].sort_values(by='CustomerID')
# Intentionally not removing duplicates, as they represent additional payments from the same customer
Total Spent per Customer
I probably would have removed the duplicate rows. In this example, we are treating the duplicates as additional payments received from the customer. Remember, we are collecting data from various sources.
group = df.groupby(['CustomerID','Name','Email'])
# calculate the sum
group.sum()
Can you solve the BONUS question?
17+ years in Tech | Follow me for posts on Data Wrangling
5 个月?? Free Pandas Course: https://hedaro.gumroad.com/l/tqqfq