登录查看更多内容

Data Science: Learnings from 91,000 rows of cashier log data

Aan Patel

Software Engineer 2, AI @ Tanium, MSCS @ UT Austin, Tech Speaker

发布日期: 2021年3月10日

For an assignment within the "Data Science in the Business World" course at UNC-Chapel Hill, my team was given a cashier log of a real cafe operating out of North Carolina. While I cannot disclose the cafe's actual name, it is a licensed, top-3 coffee brand, outlet. Our task was to come up with recommendations that boost this cafe's sales and increase its profits. Both these had plummeted over the past few months due to COVID-19. I will describe my key takeaways and learnings from this project and how my team came up with a business recommendation, without revealing the actual dataset or data. This, because we are bound by a non-disclosure agreement (NDA).

The Data Science Pipeline

We used the following Data Science pipeline as the underlying approach, blueprint, or methodology to follow for coming up with pertinent business recommendations:

Image copyright: Wolfram Research

Question Collect Wrangle

Data science is all about problem-solving and answering questions. We started the process by asking or rather framing a broad question - how can the coffee shop recover quickly and what the cafe management needed to do/change for ensuring success?

At this juncture, we were yet to look at the data that was provided to us. So, we will come back to the question, rephrase it later.

Next came the data collection and data wrangling stage, which was pretty straightforward. We did not have to do anything much since the data was provided to us. The data came as a Pandas DataFrame in pickled form, along with a PDF file listing the costs of making certain (not all) items that the coffee shop sold.

Using the df.info() method on the loaded DataFrame (df), we instantly got to know what kind of data we were dealing with. Here are the dataset's columns that we found important:

The Item column (the name of the item purchased)
The Category column (the type of the item purchased)
The BasketID column (the unique ID for every group of items purchased together in a single order).*See footnotes for a comment on why this was not a good choice
The Qty column (the count of the particular item purchased in that transaction). Qty values are positive when the transaction is a purchase, and negative when it is a refund.
The Gross Sales column (the selling price for the transaction, before factoring in the discount). In the data cleaning step, we stripped this column of the minus signs and considered only absolute values, for reasons that will become clear in the "Explore" step.
The Discount column (the discount given to the customer for that transaction if it is a purchase or the discount value deducted from the amount returned if it is a return/cancellation). In the data cleaning step, we noticed some Discount values were negative (whenever the transaction was a purchase) stripped this column of the minus signs, and considered only absolute values, for reasons that will become clear in the "Explore" step.
The TimeStamp column, which we used as the index of the DataFrame after cleaning (this column records the exact date and time of the transaction)

Lesson 1: Carefully consider the overview of the data before doing anything else. Always think about the question you are trying to answer.

Data Cleaning

This was the stage where we spent most of our time. One might think that if the data is from a software cashier log, it should be pretty clean, right? To back that up, one might say that the software uses a structured database system to record all the transactions. Well, that is only a partial yes. Although we had structured data from the cashier log, we found many records that were either missing data or had inconsistencies.

Many columns, including those supposed to be numerical columns, had stray characters in them. Those had to be stripped. Some items and categories were either duplicated or spelled slightly differently. We used the following code snippet, in general, to get rid of the string-related issues:

df['col'] = df['col'].str.replace('stray_char(s)','replacement_char(s)')
# where col is the column name, and replacement_char(s) can be empty.

There were transactions involving items with the same name categorized as different things on the menu. There were price point names, such as 'Large', 'Medium', and 'Small', but we also had ounce measures for the same item, such as '8 oz', '12 oz', '16 oz', and '20 oz', and then there were price points like 'Regular' and 'Kids size', all for the same item name.

Some items were protein shakes and bars, which were not made by the coffee shop, but rather were simply stocked and sold. All of them had a price point label of 'Regular', and we left them at that. With empirical evidence to back us up, we got all liquid items designated with ounce measures. For example, 'Large' became '20 oz' and 'Kids size', was mapped to '8 oz' (I later found out from my professor that this was a mistake on our part, please see the footnote for details). We wrote neat helper functions that created filters according to substrings matching one or more items. We then used those filters to modify such groups of records quickly.

We had to write many tricky bits of code as explained above. My team had members working from different time zones. We soon realized we needed to collaborate in a professional, well-documented manner to keep the code going in the right direction.

Continuing with data cleaning, we now turned to type-checking each column. We found many columns were of type string, when they could be much more meaningful had they been of type, say, float64 or category. For doing those type changes, we used:

# we have imported the Pandas module as pd

df['col'] = df['col'].astype('float64') # to set the column's type to float

df['col'] = df['col'].astype('category') # to set the column's type to category

df['col']= pd.to_datetime(df['col']) # to set the column's type to datetime64


#there is a quicker way to set the same type for multiple columns:

df[['col_1','col_2']]=df[['col_1','col_2']].astype('float64')

Lesson 2: It is very important to create detailed documentation that captures the intention and thought process of the original programmer, especially when you are working in a team. Document enough to make sure it is easy for all the other team members to make sense of what sense you had already made by analyzing the data at hand.

Lesson 3: Data cleaning usually takes the most time, and so plan the time allocation accordingly - unless if you already have super clean data.

So far so good. Let's continue!

Explore, Model, Validate

Once we got the data clean enough, we started making visualizations. I first modeled the revenue column based on the Gross Sales and Discounts columns. This was pretty straightforward to code:

# NB: this works because first converted Gross Sales and Discounts into absolute
# values

df['Revenue'] = (df['Gross Sales']-df['Discounts'])*np.sign(df['Qty'])

I introduced the Qty column, denoting the number of items purchased. However, here comes a surprise - careful exploration of the unique values, ranges, means, medians, and standard deviations of the different numerical columns revealed that the Qty column had negative values as well! This was because sometimes, there were returns, wherein we had a negative quantity value in the record. So, for those records, we need to subtract the revenue, not add it. We did that by using np.sign over the Qty column (we imported the Numpy module as np). This formula works because we had converted all the Discounts and Gross Sales column numbers to absolute values, and we rely solely on the sign of 'Qty' to decide if we have a purchase or a return.

So, as a rule of thumb, before going deep into doing some kind of modeling for data, always write a couple of good old print statements, sit back, relax the mind and inspect the following things while sipping a cup of coffee:

df['col'].describe()
# for numeric data, this returns the count, mean, std, min, max, upper and lower quartiles, and the median.

df['col'].value_counts()
# returns a series containing counts of unique values

df['col'].unique()
# returns the unique values in the column

(ref: Pandas documentation)

Lesson 4: Do not presume things will be fine and dandy before they actually are. Keep your eyes wide open for surprises, since data has a habit of pulling a rabbit or two from the hat more often than not. So be open-minded and do not be shocked when the data springs a surprise at you. If you spot and foresee things; great - but that may not always be the case. So, learn to accept and deal with the unexpected. Keep moving ahead!

With the revenue model done we began making plots of the monthly revenue. As expected there was a decline in (sales, and therefore) revenue due to COVID.

With the revenue trends figured out, we started modeling to find out the profitability of the business. We wanted to create a model of the costs incurred by the business using which we could then calculate the profits. We found there were three things in every record that decided how much it cost to the business for that particular transaction:

The name of the item.
The price point of the item (8 oz, 12 oz, 16 oz, 20 oz, Kids size, Regular, etc...)
The modifications that were asked by the customer for that item (like adding more flavor, etc.)

We created two columns - a 'Pre-modification costs' column, and a 'Modifier costs' column. As we did for data cleaning, we created helper functions for doing all the heavy lifting (creating filters for given item names or item name substrings, finding and setting/assigning different cost values for different prices and modifiers, etc.).

We ended up with the two columns of meaningful data which we were able to use for calculating the profits. With this critical piece of information now available to us, it was time for us to refine our question to "How can the coffee shop cafe business increase profits?"

Now, this was a solid question that had the interest and attention of my entire team. Our next step was to initiate data exploration. Exploring data can best be done not by looking at long lists or tables filled with numbers, but by looking at smart visualizations that take the help of our brain's innate abilities to recognize patterns and things that seem to break out of the norm or stand out from the expected patterns (source 1, source 2).

Lesson 5: Good visualization is key when it comes to data exploration. Thus, for data exploration, one visualization is worth thousands of records.

Here are the plots we derived by resampling, slicing, and dicing our DataFrame in different ways:

Showing the most commonly clubbed orders (bar plot)
Number of items sold in total by the hour (bar plot)
Aggregate revenue by the time of day (bar plot)
Aggregate revenue by day of the week (bar plot)
Average daily profit by the hour (bar with line plot)
Average daily profit by hour, pre-COVID (bar with line plot)
Profit margin vs. sales volume (scatter plot, every plotted point corresponds to a certain menu item)

The last chart was our a-ha moment. We realized that some items were doing decent sales volumes (number of items sold), that we can budge the profit margin a bit but still keep good demand, based on other items at that same level.

We also found some items were lagging in terms of sales, but still had decent sale numbers. These items had the potential to do better if marketed/presented better.

Further, we found that 70% of the customers only bought a single item per visit. This information was like a gold nugget. The potential to offer additional attractively priced slow-moving items was enormous.

So we decided to go ahead with clustering. First, I created a co-occurrence matrix - a matrix that counts the co-occurrence for every item with another item bought in the same basket (i.e., order). Given that the data had just below 300 different items, we needed to use an algorithm that can take this high-dimensional of data and give out something sensible from it. We settled for t-SNE. The result was a 2-D scatter plot showing clustered items as points with different colors. This helped us visualize which items sold well together. Not surprisingly, these groups often were some drink along with a food item. In Data Science/Business parlance, this is called Market Basket Analysis. We extracted the data of the t-SNE output to a CSV, which we then plotted via Tableau (see redacted plot below). A plot with colors is given here. Sizes of the bubbles represent the relative revenue of items, and the position of the point is just the t-SNE output. Typically, items with the same color should be close together, and that is something we see here.

From the profit margin vs. sales scatter plot, and with the knowledge of the item co-occurrence clusters, we finally came up with the business recommendation of bundling items together. We recommended two bundles, each one of them being a coffee drink plus a food item combination. The recommendation was to bundle the offer smartly and price attractively especially in relation to the pricing of the items if they were bought separately i.e. not in the same basket.

The combined profit that the bundle yielded was more than the profit margin for just one item. Although the overall profit margin was lower, it was merely a notional thought because the additional second item sale was by and large non-existent prior to the combo deal being offered.

The clubbed deal would potentially utilize the infrastructure, manpower deployed, and the overheads that were anyway being used on the cost front in a better manner by helping generate greater revenues that would lead to increased profitability. It would suffice to say that these combo deals will not only boost the customer's satisfaction but go a long way in creating a compelling value proposition that would in turn generates repeat footfalls that yield incremental/additional sales.

Further, this recommendation was backed up by the idea that we could boost some lagging items by bundling them together with the most popular superhit (high-in-demand, high-sales) items. Plus, we picked the items such that they co-occurred in the same cluster. Here is a redacted (due to NDA) image from our final business recommendation presentation that shows how we suggested our bundles:

To validate this idea, we simply created a numerical model in Google Sheets and assumed a 10% conversion to buying the bundle from both Drink A-only and Snack A-only buyers. We finally predicted an 11% increase in overall profitability and about 2,000 more sales transactions.

Storytelling

This step is the culmination of all our efforts. We need to get the business owner or management i.e. our client to join the ride. Right from the point where we get the raw data to the formulation of the question and onwards to finding and reaching the answers, try and involve the people who actually run the business. Doing it all alone and playing with data alone presents a pitfall or risk of missing out on the opportunity of achieving fantastic synergies when the data scientist works in sync with the people who actually run the show.

We need to show all that needs to be shown in a way that is easy to understand for the clients. This is why a data scientist needs to sharpen his/her presentation skills. Further, just as with data exploration, storytelling also requires good visualization skills. As our professor for this class, Dr. Jack Snoeyink says, "visualization and storytelling ... are both a skill and an art". Our team did the storytelling by way of an engaging Google Slides presentation.

We started by describing the current situation - the question we are trying to answer, the problems the business was facing, the average purchase value (which was low), and the share of customers buying only one item (which was high, ~70%). This set the stage where we introduced our ideas and recommendations. We did so after showing visualizations on how and what to bundle. Finally to validate and back our recommendation we showed our validation work from Google Sheets.

I would like to add one last lesson that we learned from this project/exercise, which although not directly related to only my team's work on this assignment, but rather the other teams that worked on this same assignment. It was surprising to see the amount of diversity in ideas and methods that the different teams applied, and the parts of the dataset that they focused on, to arrive at and then back their business recommendations. Hence, here goes the final lesson:

Final Lesson: There is just so much you can infer from even just a few columns of data! So one has to be smart and intuitive to use whatever data is available and never complain about not having enough data. As a data scientist one needs to develop a keen sense of playing and analyzing whatever data is made available to us. It is the correct perspective that will win the day for the business and therefore the scientist.

That's it, folks, this is how my team and I tried to help a real coffee shop! Thank you so much for reading this article! Hope you enjoyed reading it, please let me know if you have any questions and/or suggestions. Any inputs or ideas to improve on this effort would be greatly appreciated. I look forward to your inputs in the comments section below.

Footnote

In the process of writing this article, my professor, Dr. Snoeyink also brought to light the following insightful points (quoting him directly from our conversation):

Kids size in the Price Point Names column is actually 10 oz. Our team fell short on this - we assumed Kids size to be 8 oz. This information is hidden in the SKU column, a column not talked about in the article. The SKU column records the "Stock Keeping Unit" number or string for shelf retail items, and in some of the cases, it mentions 10 oz where for the same record the Price Point Name column value is Kids size (and this is how we were actually also supposed to find the sizes for Medium and Large, but luckily our guesstimates worked for those important price points).

PaymentID is better than BasketID because a canceled order will have one PaymentID, but two BasketIDs, one for the charge and one for the refund. It is only 0.13% of all transactions, so counting them twice won't be too bad.

I am deeply grateful for Dr. Snoeyink's support and inputs in the process of writing this article.

Niraj Patel

Successful professional with diverse business management skills & a body of work spanning three decades. Highly motivated performer with an entrepreneurial spirit who has generated wealth across multiple businesses.

4 年

Just shows the immense potential and possible synergies that a business owner can derive by partnering with a good computer scientist.

1 次回应

要查看或添加评论，请登录

Aan Patel的更多文章

The 2007 Dynamo Paper: A few (technical) words of praise for AWS

2021年4月27日

The 2007 Dynamo Paper: A few (technical) words of praise for AWS

Needless to say, AWS is huge today. It beats all the other large tech firms in the cloud market in both laying out the…

3 条评论
Explaining the rush for Natural Language Processing in the tech industry

2021年3月18日

Explaining the rush for Natural Language Processing in the tech industry

Broadly speaking, there are two kinds of data that reside in the massive data warehouses run by tech giants: Structured…
Helping small restaurants survive the pandemic - spreading open-source goodness

2020年12月9日

Helping small restaurants survive the pandemic - spreading open-source goodness

Hello friends! Hope you are safe and doing well during these tough times. A few months ago, I came across a piece of…

4 条评论

Data Science: Learnings from 91,000 rows of cashier log data

Aan Patel

Software Engineer 2, AI @ Tanium, MSCS @ UT Austin, Tech Speaker

The Data Science Pipeline

Question Collect Wrangle

Data Cleaning

Explore, Model, Validate

Storytelling

Footnote

Aan Patel的更多文章

社区洞察

其他会员也浏览了

Meet Ilya Yakovlev, Head of Data & Analytics

From Bits to Insights: Understanding Data Types and Structures

Coke Studio & Data Analytics #1

What is “The Art of Thinking Like a Data Scientist” Workbook and Why It Matters

How to ask ‘WHY’ Questions in Data Science?

Data Scientists: Stop Wasting Time - Ask These Five Questions First!

A Beginner's Guide to the Data Science Pipeline

Data Detective 101: A Beginner's Guide to the World of Data Science

Data Demystified - Chapter 1: DIKW model

Win By Keeping Data Science Simple

The Data Science Pipeline

Question Collect Wrangle

Data Cleaning

Explore, Model, Validate

Storytelling

Footnote

Aan Patel的更多文章

The 2007 Dynamo Paper: A few (technical) words of praise for AWS

Explaining the rush for Natural Language Processing in the tech industry

Helping small restaurants survive the pandemic - spreading open-source goodness

社区洞察

其他会员也浏览了

Meet Ilya Yakovlev, Head of Data & Analytics

From Bits to Insights: Understanding Data Types and Structures

Coke Studio & Data Analytics #1

What is “The Art of Thinking Like a Data Scientist” Workbook and Why It Matters

How to ask ‘WHY’ Questions in Data Science?

Data Scientists: Stop Wasting Time - Ask These Five Questions First!

A Beginner's Guide to the Data Science Pipeline

Data Detective 101: A Beginner's Guide to the World of Data Science

Data Demystified - Chapter 1: DIKW model

Win By Keeping Data Science Simple