登录查看更多内容

Pandas - The Donor Data Debacle

David Rojas, E.I.

17+ years in Tech | Follow me for posts on Data Wrangling

发布日期: 2024年6月25日

Let's connect! Send me a connection invitation. I regularly share Jupyter Notebooks on Pandas and would love to expand my network.

Explore my profile: Head to my profile to see more about my work, skills, and experience.

If you're feeling generous: Repost this article with your network and help spread the word!

Description:

You're a data analyst for a non-profit organization, and you've been tasked with cleaning up a messy dataset of donations. The data is a bit of a disaster, with missing values, duplicates, and inconsistent formatting. Your mission is to use your Pandas skills to wrangle the data into shape.

Tasks:

Clean up the Mess: Remove duplicates, handle missing values, and ensure data types are correct.
Standardize the data: Normalize the 'Donation Amount' column and convert the date column to a standard format.
Data quality check: Identify and correct any inconsistent or invalid data.

The Data

The columns below represent information about individual donations, the date they were made, and the campaign that drove the donation. The goal is to clean, transform, and prepare this data for analysis.

Here's a breakdown of what each column in the sample data represents:

Donor ID: A unique identifier for each donor
Donation Amount: The amount donated by each donor ( initially in a mix of numeric and string formats, requiring cleanup)
Date: The date each donation was made
Campaign: The marketing campaign or channel that led to the donation

Important Note about the Donation Amount Column:

The logic below will generate a mix of:

Numeric values (e.g., 10.50, 500.00)
String values with words (e.g., "10 thousand", "5 dollars and 25 cents")
String values with currency symbols (e.g., "$50", "$1000")

Your task will be to clean up this column by converting all values to a standard numeric format, handling the various string formats, and dealing with any potential errors or inconsistencies. Good luck!

10000 rows × 4 columns

Let's start by looking at the datatypes.

As you can expect, Pandas is treating all of the columns as strings. Let the clean up process begin.

<class 'pandas.core.frame.DataFrame'> RangeIndex: 10000 entries, 0 to 9999 Data columns (total 4 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 donor_id 10000 non-null object 1 date 10000 non-null object 2 campaign 10000 non-null object 3 donation_amount 6666 non-null object dtypes: object(4) memory usage: 312.6+ KB

Clean up the Mess:

Remove duplicates, handle missing values, and ensure data types are correct.

If we assume that we will not be able to get the correct donation amounts, we might as well remove those rows from the data.

<class 'pandas.core.frame.DataFrame'> Index: 6666 entries, 1 to 9998 Data columns (total 4 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 donor_id 6666 non-null object 1 date 6666 non-null object 2 campaign 6666 non-null object 3 donation_amount 6666 non-null object dtypes: object(4) memory usage: 260.4+ KB

The marketing manager told us to replace any missing dates with '1970-01-01' so we can identify these and deal with them later.

Here is where we set the dates to 1970-01-01.

6666 rows × 4 columns

领英推荐

How to Leverage Pandas GroupBy for Data Analysis

Benjamin Bennett Alexander 6 个月前

Joining the Data Industry in 2025

Leon Gordon 2 个月前

MDS Newsletter #29

Aayush Jain 2 年前

Although we successfully converted the strings into dates, the date column remains in string format.

Convert string column to a datetime object.

1 2022-03-07 2 2022-11-08 4 2022-04-07 5 1970-01-01 7 2022-10-20 ... 9992 2022-01-13 9994 2022-07-21 9995 1970-01-01 9997 2022-08-24 9998 2022-10-07 Name: date, Length: 6666, dtype: datetime64[ns]

This morning, for some reason I can't get these datatypes to behave... the code below did not work.

We can also take care of the Donor ID pretty easily.

This also did not work...

This did the trick for me to get the date types to be represented correctly.

<class 'pandas.core.frame.DataFrame'> Index: 6666 entries, 1 to 9998 Data columns (total 4 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 donor_id 6666 non-null Int64 1 date 6666 non-null datetime64[ns] 2 campaign 6666 non-null string 3 donation_amount 6666 non-null string dtypes: Int64(1), datetime64[ns](1), string(2) memory usage: 266.9 KB

Donation Amount Cleanup

Remove the dollar sign
Apply a custom function to convert the values to a numeric format

1    6 dollars and 98 cents
2               76 thousand
4                        81
5         77.55500350452421
7               76 thousand
Name: donation_amount, dtype: string

1                 6.98
2                76000
4                   81
5    77.55500350452421
7                76000
Name: donation_amount, dtype: string

Now let's fix the datatype for the donation amount.

<class 'pandas.core.frame.DataFrame'> Index: 6666 entries, 1 to 9998 Data columns (total 4 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 donor_id 6666 non-null Int64 1 date 6666 non-null datetime64[ns] 2 campaign 6666 non-null string 3 donation_amount 6666 non-null float64 dtypes: Int64(1), datetime64[ns](1), float64(1), string(1) memory usage: 266.9 KB

OK, so we have taken care of a lot here.

The donor_id column is now in integer format
The date column is now in the correct format
The donation_amount column has been successfully cleaned up and converted to the correct numeric format

Data Gaze

I am going to recommend you get this data into Microsoft Excel and do a quick glance. Excel does a much better job at letting you analyze the data on your nice and big monitor.

If you find a better way to update datatypes, please share it with me.

David Rojas, E.I.

17+ years in Tech | Follow me for posts on Data Wrangling

5 个月

?? Free Pandas Course: https://hedaro.gumroad.com/l/tqqfq

要查看或添加评论，请登录

David Rojas, E.I.的更多文章

Optimizing Santas Workshop

2024年12月3日

Optimizing Santas Workshop

Let's connect! Send me a connection invitation. I regularly share Jupyter Notebooks on Pandas and would love to expand…

1 条评论
Tourism Trends

2024年11月26日

Tourism Trends

Let's connect! Send me a connection invitation. I regularly share Jupyter Notebooks on Pandas and would love to expand…
Customer Purchase Analysis for a Fashion Retailer

2024年11月19日

Customer Purchase Analysis for a Fashion Retailer

Let's connect! Send me a connection invitation. I regularly share Jupyter Notebooks on Pandas and would love to expand…
Data Cleaning Job

2024年11月12日

Data Cleaning Job

Let's connect! Send me a connection invitation. I regularly share Jupyter Notebooks on Pandas and would love to expand…

3 条评论
Pandas - GroupBy and Plot

2024年11月5日

Pandas - GroupBy and Plot

Let's connect! Send me a connection invitation. I regularly share Jupyter Notebooks on Pandas and would love to expand…
Challenge: "Sales Analysis"

2024年10月29日

Challenge: "Sales Analysis"

Let's connect! Send me a connection invitation. I regularly share Jupyter Notebooks on Pandas and would love to expand…
Movie Madness

2024年10月22日

Movie Madness

Let's connect! Send me a connection invitation. I regularly share Jupyter Notebooks on Pandas and would love to expand…
How to Export Excel Cells into Text Files

2024年10月15日

How to Export Excel Cells into Text Files

Let's connect! Send me a connection invitation. I regularly share Jupyter Notebooks on Pandas and would love to expand…
Analyzing Student Performance

2024年10月8日

Analyzing Student Performance

Let's connect! Send me a connection invitation. I regularly share Jupyter Notebooks on Pandas and would love to expand…
Election Insights: Uncovering Voter Trends

2024年10月1日

Election Insights: Uncovering Voter Trends

Let's connect! Send me a connection invitation. I regularly share Jupyter Notebooks on Pandas and would love to expand…

See all articles

Pandas - The Donor Data Debacle

David Rojas, E.I.

17+ years in Tech | Follow me for posts on Data Wrangling

Description:

Tasks:

The Data

Clean up the Mess:

领英推荐

Donation Amount Cleanup

Data Gaze

If you find a better way to update datatypes, please share it with me.

David Rojas, E.I.的更多文章

社区洞察

其他会员也浏览了

Scantron Generation: The 5 worst questions we ask data scientists, and how we can do better.

How to Create Custom Aggregation Functions in Pandas

Advanced Custom Aggregation Functions in Pandas

Exploring Datasets Using Pandas: Info and Shape Methods

Improve your Data Science workflow with Optimus

Towards Easy and Fast Data Science Workflows with Optimus

Unlocking AI’s Secret Superpower: Simulate SQL Without Code!

Data Wrangling of Electoral Data

Trying Google Gemini for Data & Code Analysis

How I Stumbled Upon Dataiku and Ended Up Mapping 13 Years of My Life

Description:

Tasks:

The Data

Clean up the Mess:

领英推荐

Donation Amount Cleanup

Data Gaze

If you find a better way to update datatypes, please share it with me.

David Rojas, E.I.的更多文章

Optimizing Santas Workshop

Tourism Trends

Customer Purchase Analysis for a Fashion Retailer

Data Cleaning Job

Pandas - GroupBy and Plot

Challenge: "Sales Analysis"

Movie Madness

How to Export Excel Cells into Text Files

Analyzing Student Performance

Election Insights: Uncovering Voter Trends

社区洞察

其他会员也浏览了

Scantron Generation: The 5 worst questions we ask data scientists, and how we can do better.

How to Create Custom Aggregation Functions in Pandas

Advanced Custom Aggregation Functions in Pandas

Exploring Datasets Using Pandas: Info and Shape Methods

Improve your Data Science workflow with Optimus

Towards Easy and Fast Data Science Workflows with Optimus

Unlocking AI’s Secret Superpower: Simulate SQL Without Code!

Data Wrangling of Electoral Data

Trying Google Gemini for Data & Code Analysis

How I Stumbled Upon Dataiku and Ended Up Mapping 13 Years of My Life