DoorDash Data Analysis
Stuart Walker
Fraud Prevention Analyst @ M&G PLC | Data Analyst | Data Scientist | Python | SQL | Machine Learning | Data Analytics | Excel | Tableau | Power BI | R
In the world we now live in we have a multitude of options to stay at home and get hot delicious food and groceries delivered directly to your door.?These tend to vary country to country and in my location I have the option to order from multiple businesses including but not limited to Just Eat, Uber Eats, Scoffable, Deliveroo and I could go on and on.?I was presented with an opportunity to analyse a dataset using Excel from the American company DoorDash which I found exciting to see what insights I could find.
*Please Note: This project is actually modified from an iFood job interview case study given by the Brazilian equivalent of DoorDash, iFood.?The data is 98% real, but slightly modified for educational purposes.?
Using the dataset I was provided with, I was tasked to:
I would do this by using various Excel features including the following:
DATA
As previously mentioned the data comes from an iFood Dataset which can be found here on Kaggle.?The dataset includes customer information for annual income, the total amount spent, acceptance of offer per campaign, customer age, total number of days as a customer, date joined, and lots more from DEC 2014 to NOV 2016.?There is data pertaining to currency within the dataset but it does not confirm which.?Therefore for the purposes of this project I have used USD$.
ANALYSIS
After uploading the data to Excel, the first action I made was to give each of the customer’s present on the dataset a unique id which would show as CustomerID on Column A.
Next I used various aggregate functions such as COUNT, MAX, MIN, AVERAGE, and SUM to gain a brief overview and insight into the data.?I also added a new column, PctIncome which calculates the percentage the customers spent on using the service annually from their yearly income.?After the column was in place I was curious to see who was spending the most from their income using the service.?After sorting this new column, I noticed the amount spent and related percentage for customer 21 was astronomically high compared to all other customer’s within the dataset.?Now this could be down to a data entry error and could be entirely innocent.?However with my experience working in a fraud background I am unable to ignore the fact that this could be fraud related and that customer 21 could be using stolen debit, credit or gift cards to make multiple purchases.?Either way this anomaly certainly needs to be investigated further.
For the next part of my analysis I wanted to check the correlation between how much each customer spent in relation to their annual income, for this I made a scatter plot.?Once I created the scatter plot I added a trendline and the R2 value to the plot.?According to the R2 value generated from the trendline (75%) I am very confident that there is a strong correlation between customer income and total spent.?Also during my analysis I spotted two outliers on the graph (highlighted by the red arrows).?One of the outliers is customer 21 who was mentioned in length above.?The other outlier on the scatter plot shows that the income was $113,371.00 and the amount spent was $274.00.?Using these data points I was able to ascertain that the outlier belonged to customer 1873.?My findings showed that customer 1873 is a relatively new customer (just over two months).?I was also able to confirm that they are 75 years old and the majority of their spend was on sweets.?There is a fairly high probability that customer 1873 only used the service to buy sweets for their grandchildren. The other possibility is that their spend was low due to the short time spent using the service.
I was curious to see what the age demographic is within the dataset as I felt this could create some useful insights.?To do this I created a new column named AgeGroup and wrote a nested IF statement to assign each customer into an age group.?I then created a pie chart to visualise the age distribution which shows that most of the customers fall within the 35-50 age range which was followed fairly closely by the 51-65 age range.?I could see that the least amount of customers was within the 24-35 age range which was surprising as I would have thought this range would have had a lot more if not the most.??
领英推荐
After checking the age demographic of the dataset I wanted to then check the respectful income and amount spent by age group. Both of the above bar charts clearly show that the 35-50 age range not only earn the most at $45,756,663.00 but they also spent the most at $459,264.00 followed very closely by the 51-65 age range who earn $40,892,975.00 and the spent $450,634.00.?They are then followed by the 66+ group who earn $18,292,647.00 and spent $218,659.00 and finally trailing behind is the 24-34 age group who earn the least at $8,884,434.00 and also spent the least at $112,339.00.?The bar charts and numbers clearly reinforce my analysis of the above scatter plot and R2 value that there is a strong correlation between income and total spend.
To continue my analysis of money spent I decided to create a histogram to show the amount spent by the customers.?To provide more insight I decided to organise the amount spent by each customer into six separate bins which did provide some good insight.?It showed that over half of the customers spent between $4.00 & $418.5 (the lowest value bin) and 19 customers spent between $2076.5 & 2491.00 (the highest value bin).?Even though over half of the customers spent between $4.00 & $418.5 the average amount spent is higher at $562.76.
After delving deep into the age, income and spend data I then decided to have a look at when customers joined the service.?We can see from the above bar chart that most customer joined in January with 223 and the least amount joined in December with 137.?More people may have joined at the beginning of the year for a few reasons, the first of which is after a very busy Christmas/Holiday period they may have decided to take it easy and decide to use a service which can deliver food directly to your door rather than having to go shopping.?Another reason could be because a lot of people received gift cards over the Christmas/Holiday period and signed up to make use of them.?On the flip side we can see the least amount of sign ups in December with 137 and the number of sign ups starts to decrease from October onwards.?This could be down to the fact that customers are starting to tighten their belts in the run up to the Christmas/Holiday period.?Overall though we can see that the number of sign ups from January to December remained fairly consistent with an average number of sign ups being 183 per month.
One of my main tasks was to see which customers purchased (or said yes) to a specific marketing campaign (specifically campaign 6 but more on that shortly) and that’s where the above bar chart comes in.?From the above bar chart it is clearly obvious that campaign 6 was the most successful at 333 and campaign 2 has the worst uptake with only 30 saying yes.
Now as part of my task I needed to analyse Campaign 6 more closely to capture more insights from that specific campaign.?To do this I decided to create a pivot table to capture the information pertaining to Campaign 6.?The above snippet shows the ages represented (red arrow) and those who said yes (blue arrow).?The total amount spent by each age (yellow arrow) and the average amount spent per age (orange arrow).?The data was able to tell me that customers aged 50 spent the most at $14,558.00 and had the most customers at 16.?However the highest average spend came from customer aged 32 with $2048.00 (there were only 2 customers for this age).
I was also tasked to check the amount of customers who have children and accepted campaign 6.?To do this I made a few changes to my pivot table which now shows how many customers have children and how it effects their spend.?We can see the amount of children represented within the dataset (red arrow), the total amount spent (yellow arrow), the amount of customers with 1, 2 or 0 children (blue arrow) and the average amount spent (orange arrow).?With all the data present we can clearly see that the majority of customer who accepted campaign 6 have no children and they also spent the most.?This could be that campaign 6 is marketed more towards single people or couples with no children.?The other probability is that customers with no children have more disposiable income due to not having children to look after.
MY INSIGHTS
MY RECOMMENDATIONS
Thank you for taking the time and reading my project. I welcome any feedback and if you enjoyed this project please feel free to connect as I will be posting more projects very soon.
Data Analyst | SQL & Tableau Specialist | Transforming Data into Actionable Insights.
3 个月Nicely done! The pie chart was my favorite!
Business Data Analyst | Administrative Assistant @ Life Safety Consultants
3 个月Enjoyed reading this- I think Excel is underestimated; you really showcased its capabilities! Your comments about the outliers potentially pointing to fraud in some cases caught my attention. Looking forward to reading your Fraud Detection Using XGBoost post!
Data Analyst specializing in | SQL I Tableau I Python | Transforming data into strategic insights for business growth ? Committed to continuous learning
1 年Wow! Your analysis blows my mind! It's so detailed and very appealing to the eyes. I'm impressed by your 3D visuals. Did you use Python to create those? How easy was writing articles on GitHub compared with LinkedIn? This is something I need to learn. Good job, Stuart!
building Fahm | MSDS @ UT | Huth Lab
1 年Love your video at the start! Very well written, and cool graphs, thanks for sharing.
Compassionate and Skilled Registered Imaging Professional Saving Lives One Mammogram at a Time. Mammography: Screening & Diagnostic | Radiology | Healthcare | Hologic | Breast Biopsy | QC | MQSA | EPIC | PACS
2 年Great article, Stuart! I loved the 3-D effect on your visuals and your cover video really catches attention.