登录查看更多内容

Pandas - Duplicate Row Detection and Grouping

David Rojas, E.I.

17+ years in Tech | Follow me for posts on Data Wrangling

发布日期: 2024年6月19日

Let's connect! Send me a connection invitation. I regularly share Jupyter Notebooks on Pandas and would love to expand my network

Explore my profile: Head to my profile to see more about my work, skills, and experience.

If you're feeling generous: Repost this article with your network and help spread the word!

Description:

You are a data scientist working for an e-commerce company. The marketing team has collected customer data from various sources, including website interactions, social media, and customer surveys. However, due to the diverse sources, there are duplicate records in the dataset.

Task:

Your task is to identify and combine duplicate rows based on specific criteria, and calculate the total spend for each unique customer.

Identify duplicate rows based on CustomerID, Name, and Email.
Combine duplicate rows into a single row, adding up the values in the Spent column.
Calculate the total spend for each unique customer.

Bonus Question:

What is the average spend per customer for the top 3 customers with the highest total spend? (Answer: 700.00)

# import libraries
import pandas as pd
import numpy as np

Generate the data

Here is a tiny dataset composed of 12 rows that represents customer information, including their ID, name, email, and amount spent.

领英推荐

Deepchecks Demo Dashboard - Try It Out Yourself !

Aishwarya Srinivasan 2 年前

Extracting Coordinates from Google Maps URLs and…

Umer Saeed 2 个月前

??Creating a Matplotlib Histogram

StrataScratch 1 周前

Columns:

CustomerID (string): unique customer identifier
Name (string): customer name
Email (string): customer email
Spent (integer): amount spent by the customer

# sample data placed in a dictionary
data = {
'CustomerID': ['C001', 'C002', 'C003', 'C001', 'C002', 'C004', 'C005', 'C003', 'C006'],
'Name': ['John', 'Mary', 'David', 'John', 'Mary', 'Emily', 'Michael', 'David', 'Sarah'],
'Email': ['[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]'],
'Spent': [100, 200, 300, 100, 200, 400, 500, 300, 600]
}

# create the dataframe
df = pd.DataFrame(data)

# introduce duplicates
duplicates = pd.DataFrame({'CustomerID': ['C001', 'C002', 'C003'], 'Name': ['John', 'Mary', 'David'], 'Email': ['[email protected]', '[email protected]', '[email protected]'], 'Spent': [100, 200, 300]})

# combine the dataframes
df = pd.concat([df, duplicates], ignore_index=True)

df

Identify Duplicates

df[df.duplicated()].sort_values(by='CustomerID')

# Intentionally not removing duplicates, as they represent additional payments from the same customer

Total Spent per Customer

I probably would have removed the duplicate rows. In this example, we are treating the duplicates as additional payments received from the customer. Remember, we are collecting data from various sources.

group = df.groupby(['CustomerID','Name','Email'])

# calculate the sum
group.sum()

Can you solve the BONUS question?

What is the average spend per customer for the top 3 customers with the highest total spend? (Answer:?700.00)

David Rojas, E.I.

17+ years in Tech | Follow me for posts on Data Wrangling

5 个月

?? Free Pandas Course: https://hedaro.gumroad.com/l/tqqfq

要查看或添加评论，请登录

David Rojas, E.I.的更多文章

Optimizing Santas Workshop

2024年12月3日

Optimizing Santas Workshop

Let's connect! Send me a connection invitation. I regularly share Jupyter Notebooks on Pandas and would love to expand…

1 条评论
Tourism Trends

2024年11月26日

Tourism Trends

Let's connect! Send me a connection invitation. I regularly share Jupyter Notebooks on Pandas and would love to expand…
Customer Purchase Analysis for a Fashion Retailer

2024年11月19日

Customer Purchase Analysis for a Fashion Retailer

Let's connect! Send me a connection invitation. I regularly share Jupyter Notebooks on Pandas and would love to expand…
Data Cleaning Job

2024年11月12日

Data Cleaning Job

Let's connect! Send me a connection invitation. I regularly share Jupyter Notebooks on Pandas and would love to expand…

3 条评论
Pandas - GroupBy and Plot

2024年11月5日

Pandas - GroupBy and Plot

Let's connect! Send me a connection invitation. I regularly share Jupyter Notebooks on Pandas and would love to expand…
Challenge: "Sales Analysis"

2024年10月29日

Challenge: "Sales Analysis"

Let's connect! Send me a connection invitation. I regularly share Jupyter Notebooks on Pandas and would love to expand…
Movie Madness

2024年10月22日

Movie Madness

Let's connect! Send me a connection invitation. I regularly share Jupyter Notebooks on Pandas and would love to expand…
How to Export Excel Cells into Text Files

2024年10月15日

How to Export Excel Cells into Text Files

Let's connect! Send me a connection invitation. I regularly share Jupyter Notebooks on Pandas and would love to expand…
Analyzing Student Performance

2024年10月8日

Analyzing Student Performance

Let's connect! Send me a connection invitation. I regularly share Jupyter Notebooks on Pandas and would love to expand…
Election Insights: Uncovering Voter Trends

2024年10月1日

Election Insights: Uncovering Voter Trends

Let's connect! Send me a connection invitation. I regularly share Jupyter Notebooks on Pandas and would love to expand…

See all articles

Pandas - Duplicate Row Detection and Grouping

David Rojas, E.I.

17+ years in Tech | Follow me for posts on Data Wrangling

Description:

Task:

Bonus Question:

Generate the data

领英推荐

Identify Duplicates

Total Spent per Customer

Can you solve the BONUS question?

David Rojas, E.I.的更多文章

社区洞察

其他会员也浏览了

Top 5 ‘newbie’ principles of data scrubbing — 2024

Visualizing Missing Values in a DataFrame Using Matplotlib

Hey Data.. Let's Talk- when your data frame can literally talk back to you and even write its own codes for its own analysis

How to “apply” your Panda(s)

What we love and hate about R

Creating Histograms with NumPy's numpy.histogram

Simple Way to Compute Time Series Rolling Average/Trend Predition over Time with Laplace Transform

Effective Feature Selection for Extraction, Transformation and Loading(ETL)

Feature Selection and Data Visualization

Using numpy.memmap for Memory-Mapped File Storage

Description:

Task:

Bonus Question:

Generate the data

领英推荐

Identify Duplicates

Total Spent per Customer

Can you solve the BONUS question?

David Rojas, E.I.的更多文章

Optimizing Santas Workshop

Tourism Trends

Customer Purchase Analysis for a Fashion Retailer

Data Cleaning Job

Pandas - GroupBy and Plot

Challenge: "Sales Analysis"

Movie Madness

How to Export Excel Cells into Text Files

Analyzing Student Performance

Election Insights: Uncovering Voter Trends

社区洞察

其他会员也浏览了

Top 5 ‘newbie’ principles of data scrubbing — 2024

Visualizing Missing Values in a DataFrame Using Matplotlib

Hey Data.. Let's Talk- when your data frame can literally talk back to you and even write its own codes for its own analysis

How to “apply” your Panda(s)

What we love and hate about R

Creating Histograms with NumPy's numpy.histogram

Simple Way to Compute Time Series Rolling Average/Trend Predition over Time with Laplace Transform

Effective Feature Selection for Extraction, Transformation and Loading(ETL)

Feature Selection and Data Visualization

Using numpy.memmap for Memory-Mapped File Storage