An analysis of four years of Amazon shopping data
Online shopping continues to increase in popularity and it has become a part of everyday life of hundreds of thousands of people. Amazon is obviously one of the biggest players in the market and has grown from being an online bookstore to an Aladdin's lamp when it comes to checking off the shopping list.
Without getting into my personal opinion on Amazon or how I think the online shopping sector promotes consumption beyond need, let me just say that earlier this year I asked a friend if I could access her Amazon account to analyse some of her shopping data and see if I could find something interesting. This friend has requested, nay forcefully insisted, that she remains anonymous.
Summary of the dataset:
- The data extracted covers the four year period from 1 Jan 2016 to 31 December 2019.
- The data is only related to her Amazon shopping, although I know for a fact that she has shopped on Facebook Marketplace, Gumtree and ebay in parallel!
- The data is of 407 products, purchased through 185 unique transactions over this four-year period.
- For reasons that are quite obvious, I’m certain that Amazon collects an immense amount of data from each user. However, in the UK at least, one cannot extract order history from an account, though I have read on a few forums that it is possible to do so in the US. I also don’t know how to use SQL or a way to scrape off the data from an account, so I took a very tedious route to capture her shopping history. Instead of having to go through her Amazon account every time, I searched for all emails between 1 Jan 2016 and 31 Dec 2019 that had originated from the email address '[email protected]'; this returned only those emails between the set dates that were sent as an order confirmation and I manually entered data from the summary in each email.
- The fields that I captured for every purchase were the date of purchase, the time of purchase, the product description, the product category and the purchase amount.
- I segregated purchased products into the following categories: household items, furniture, tech, toiletries, educational, clothing, medication, food, gifts for others and products for baby (she had a baby during this time!)
- I grouped product pricing into low price (between £0.01 and £19.99), mid price (between £20.00 and £49.99) and high price (greater than £50.00)
- I grouped purchase time into four slots: morning (between 6am and 12pm), afternoon (between 12pm and 6pm), evening (between 6pm and 12am) and late night (between 12am and 6am).
Ideally, I would have liked to have extracted a file into a spreadsheet format directly which would have saved me X number of hours (where X is a ridiculous number) but Amazon UK doesn’t allow you to do that.
Limitations and constraints:
- There is no way to find out how long my friend was considering the purchase of an item (e.g. by studying stay time on a product page). This would have been insightful in trying to find a possible correlation between purchase consideration time and eventual purchase decision.
- The purchase process on such platforms occurs in two stages: “send to basket” and “check out”. There is a certain probability that a product that is sent to the basket will not be checked-out and will be returned but there is no way for a user to get data on that, which would otherwise have also been very insightful.
- I could have captured a few more variables if I wanted to, provided I was happy to spend Y number of hours (where Y was a number that I wasn’t willing to put in). Some of these variables could have been the aggregate rating of the purchased product by other customers; whether my friend ended up using the product purchased or throwing it away or selling it shortly afterwards; and what other comparable products (and their prices) were offered by Amazon’s creepy algorithm as alternatives every time she was on the page of a product that she eventually bought. Needless to say, the company captures much more information from every user and that is eventually used to "optimise the user experience" (best said in a slightly sarcastic tone).
- I was using only MS Excel for my analysis which does restrict the complexity in statistical computation, especially when it comes to advanced forms of regression.
Assumptions:
- I assumed that the purchase decision was actually made when my friend checked out her shopping basket and made the payment.
- I only wanted to consider the purchases made by her so I removed all the purchases made either by her husband using her account or any purchases that were made for others.
- I also assumed that the order confirmation email was sent to her mailbox as soon as payment was processed, thereby using the email arrival time as a proxy for the time that she made the purchase decision. In reality, it is very likely that she put a number of products into the basket and processed the payment at another time (possibly when she’s considered her decision completely!).
Methodology of analysis
The analysis was done in a very simple manner so no snazzy analytics there. Gather the data, clean it up, format it, slice and dice multi-layered pivot tables and then run a multiple regression.
Findings:
- In terms of basic summary statistics
- My friend purchased products worth £6,483 over the four-year period. An interesting comparison to this figure would be what she spent on high street shopping during this same period.
- Mean purchases were of £15 but with a median of £10. This indicates a skewness towards the right, meaning that more products of lower price were purchased than higher priced products.
- The top 70 purchases (out of 407) contributed to 50% of all purchases by value. The remaining 50% was contributed by the other 337 products. In fact, 81% of all purchases were less than £20 each.
- Over the four years, on average, most purchases were made on a Wednesday (18% of all products purchased), contributing to 24% of the value of total purchases. Further, 30% of all high price purchases were made on a Wednesday, contributing to 44% of value of all high price purchases. This is interesting as I would have expected the weekend to be the time when she would make most purchases. However, during 2016 and 2017 my friend was working in a role where she would often be back home early in the afternoons on Wednesdays. It's possible that she would come back home and spend some time sorting her shopping list.
- On average, most purchases were made in Feb and May (11% and 12% respectively). The least number of purchases were made in January. I know that she has travelled back to her home country often in December every year, so perhaps that might explain a dip in purchases when she's back.
- Across all price ranges, most purchases were made either in the afternoon or late night. Specifically for high price purchases, 30% of all high price purchases were made in the afternoon (42% of value) while 39% were made late night (39% of value). It's interesting how she purchased very little between 6pm and 12am, but it's also insightful how much shopping has been done between 12am and 6am.
- The shopping habits changed dramatically in 2019, however, when half of all purchases (51% of value) were made late night. Specifically for high price purchases, 67% of all products purchased in the year (62% by value) were made between 12am and 6am. My friend had a baby in early 2019 and I wouldn't be surprised if she scrolled Amazon whilst Baby kept her up in the night.
- Out of the 185 transactions, the two categories purchased most often together were household items and baby products.
- Across the four years, the most frequently purchased category was household items (36% by count, 32% by value). However, in only 2019, baby products contributed to 56% of all purchases (60% of value) that year.
- Using the purchase amount as the dependent variable, I carried out simple multi-variable OLS regression but found the purchase amount to not be significantly correlated with month, day or time of purchase (even at 10% levels). This is unsurprising since there are probably other factors that will affect the purchase amount. It's also possible that more complicated forms of regression (such as multinomial logistic) may be more insightful if, for example, the three price categories are regressed against the time slots, days, months and product categories.
- I did carry out another multiple OLS regression keeping time of purchase as the dependent variable and the month and day as the independent variable. I found that there was correlation at a significant level. This is probably because the time of purchase is, to some extent, linked to which day and month the purchase was done. Again, a more advanced form of regression may lead to more decisive results.
So what?
What actions can be taken depend on who and what you ask.
Even with such a restricted amount and level of data, Amazon, its suppliers and brand advertisers can improve the way they sell their respective products and/or services to my friend. But they have much more data on her habits. They know where she's logged into her account from, what device she's using, when she's into the account, what kind of search words she's using, what's her stay time on the product pages from a particular combination of search words, how long she's looking at a product and where it appears on her search results, and much more. Imagine being able to slice and dice all this data and personalise it in such a way that she turns to Amazon for every purchase need. Multiply that by the number of Amazon users and it doesn't seem shocking that the organisation's current market capitalisation is over $1 Trillion ($1,000,000,000,000).
My friend now knows that, over the four years, she has been using her phone a lot between 12am and 6am and she's been buying a lot of things during this unholy time slot. She has made a lot of low price purchases, which may seem harmless (and very convenient) but it means that her stay time on the platform may be substantially prolonged. The purchase categories are heavily skewed towards baby products, both in terms of number of purchases and their values; clubbing these with other product categories can lead to more time saving, though perhaps dilute the price impact of a single product. And perhaps (hopefully) my seemingly-futile effort to log her four-year purchase data may make her contemplate whether a purchase every 3.5 days (on average) justifies the amount of time that she has spent online on Amazon's platform.
The insights gleaned from data of such limited depth and breadth are interesting and can provide retrospect of one's mobile shopping habits. There's no doubt that with the right amount and type of data, one's purchasing can be made more effective and efficient (from the purchaser's point of view). This would, in essence, bring control back towards the buyer and away from the seller, where it currently resides.
Epilogue
?Jeff Hammerbacher, once the head of Facebook's data team, is quoted as having said: “The best minds of my generation are thinking about how to make people click ads. That sucks.”
Health Systems & Policy Research ? Health Workforce ? Gender ? Inequalities ? Migration Health ? M&E ? Health Economics ? Data Visualization ? Photography ? Illustration
3 年Loved.
Engineering Manager @ R?dlinger primus line GmbH
5 年When can you do my amazon history :)?
Championing Mental Health, Fostering Disability Inclusion, and Elevating Aged Care Support for a Brighter Tomorrow
5 年This is indeed very insightful Hammad! Interesting would be to check if the brands are really taking advantage of these dissections.?
Corpay (NYSE: CPAY) | FX Payments, Risk Management, Treasury, IBANs, Alternative Banking
5 年Nicely presented!