Pandas - Duplicate Row Detection and Grouping

Pandas - Duplicate Row Detection and Grouping

Let's connect! Send me a connection invitation. I regularly share Jupyter Notebooks on Pandas and would love to expand my network
Explore my profile: Head to my profile to see more about my work, skills, and experience.
If you're feeling generous: Repost this article with your network and help spread the word!


Description:

You are a data scientist working for an e-commerce company. The marketing team has collected customer data from various sources, including website interactions, social media, and customer surveys. However, due to the diverse sources, there are duplicate records in the dataset.

Task:

Your task is to identify and combine duplicate rows based on specific criteria, and calculate the total spend for each unique customer.

  • Identify duplicate rows based on CustomerID, Name, and Email.
  • Combine duplicate rows into a single row, adding up the values in the Spent column.
  • Calculate the total spend for each unique customer.


Bonus Question:

  • What is the average spend per customer for the top 3 customers with the highest total spend? (Answer: 700.00)


# import libraries
import pandas as pd
import numpy as np        


Generate the data

Here is a tiny dataset composed of 12 rows that represents customer information, including their ID, name, email, and amount spent.

Columns:

  • CustomerID (string): unique customer identifier
  • Name (string): customer name
  • Email (string): customer email
  • Spent (integer): amount spent by the customer


# sample data placed in a dictionary
data = {
'CustomerID': ['C001', 'C002', 'C003', 'C001', 'C002', 'C004', 'C005', 'C003', 'C006'],
'Name': ['John', 'Mary', 'David', 'John', 'Mary', 'Emily', 'Michael', 'David', 'Sarah'],
'Email': ['[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]'],
'Spent': [100, 200, 300, 100, 200, 400, 500, 300, 600]
}

# create the dataframe
df = pd.DataFrame(data)

# introduce duplicates
duplicates = pd.DataFrame({'CustomerID': ['C001', 'C002', 'C003'], 'Name': ['John', 'Mary', 'David'], 'Email': ['[email protected]', '[email protected]', '[email protected]'], 'Spent': [100, 200, 300]})

# combine the dataframes
df = pd.concat([df, duplicates], ignore_index=True)

df        

Identify Duplicates

df[df.duplicated()].sort_values(by='CustomerID')

# Intentionally not removing duplicates, as they represent additional payments from the same customer        


Total Spent per Customer

I probably would have removed the duplicate rows. In this example, we are treating the duplicates as additional payments received from the customer. Remember, we are collecting data from various sources.

group = df.groupby(['CustomerID','Name','Email'])

# calculate the sum
group.sum()        


Can you solve the BONUS question?

  • What is the average spend per customer for the top 3 customers with the highest total spend? (Answer:?700.00)

David Rojas, E.I.

17+ years in Tech | Follow me for posts on Data Wrangling

5 个月

?? Free Pandas Course: https://hedaro.gumroad.com/l/tqqfq

回复

要查看或添加评论,请登录

David Rojas, E.I.的更多文章

  • Optimizing Santas Workshop

    Optimizing Santas Workshop

    Let's connect! Send me a connection invitation. I regularly share Jupyter Notebooks on Pandas and would love to expand…

    1 条评论
  • Tourism Trends

    Tourism Trends

    Let's connect! Send me a connection invitation. I regularly share Jupyter Notebooks on Pandas and would love to expand…

  • Customer Purchase Analysis for a Fashion Retailer

    Customer Purchase Analysis for a Fashion Retailer

    Let's connect! Send me a connection invitation. I regularly share Jupyter Notebooks on Pandas and would love to expand…

  • Data Cleaning Job

    Data Cleaning Job

    Let's connect! Send me a connection invitation. I regularly share Jupyter Notebooks on Pandas and would love to expand…

    3 条评论
  • Pandas - GroupBy and Plot

    Pandas - GroupBy and Plot

    Let's connect! Send me a connection invitation. I regularly share Jupyter Notebooks on Pandas and would love to expand…

  • Challenge: "Sales Analysis"

    Challenge: "Sales Analysis"

    Let's connect! Send me a connection invitation. I regularly share Jupyter Notebooks on Pandas and would love to expand…

  • Movie Madness

    Movie Madness

    Let's connect! Send me a connection invitation. I regularly share Jupyter Notebooks on Pandas and would love to expand…

  • How to Export Excel Cells into Text Files

    How to Export Excel Cells into Text Files

    Let's connect! Send me a connection invitation. I regularly share Jupyter Notebooks on Pandas and would love to expand…

  • Analyzing Student Performance

    Analyzing Student Performance

    Let's connect! Send me a connection invitation. I regularly share Jupyter Notebooks on Pandas and would love to expand…

  • Election Insights: Uncovering Voter Trends

    Election Insights: Uncovering Voter Trends

    Let's connect! Send me a connection invitation. I regularly share Jupyter Notebooks on Pandas and would love to expand…

社区洞察

其他会员也浏览了