Making Insights By Exploring Simple Market Basket Analysis
In this article I will give some examples of simple analysis that can be done if you have a dataset of purchases, and how this can be extrapolated for other uses. Part of the information was taken from the following article: “A Gentle Introduction on Market Basket Analysis — Association Rules”. Which is a good introduction to market basket analysis through the point of view of using association rules, although I won't use the methods that are demonstrated there. Instead, I will use a shallow and naive approach, that although will carry limited insights. I think it is better to grasp the underlying base concepts and make you aware of other uses besides retail.
The dataset that I will use can be found here, this is a dataset from an online retailer with transactions going from 01/12/2010 and 09/12/2011. And, to analyse the data I will use R script. I will also make heavy use of available libraries which will make the code very terse. Because of this some extra explanations might be required of why a function is used and what it does.
Starting with the code
We begin with the following libraries that will be necessary to query the data and make the necessary data transformations. In this case I will use mostly general use libraries and not specialized analysis packages that would do most of the work required to find associations between items.
library(tidyverse) library(readxl) library(lubridate) library(dplyr) library(pracma) library(reshape2) library(tibble)
First thing is to load the dataset and make some initial transformations to help out later. And this part follows the same loading pattern as the article that was linked previously.
retail <- read_excel("Data/Online retail.xlsx") retail <- retail[complete.cases(retail),] retail <- retail %>% mutate(Description = as.factor(Description)) retail <- retail %>% mutate(Country = as.factor(Country)) retail$Date <- as.Date(retail$InvoiceDate) retail$Time <- format(retail$InvoiceDate, "%H:%M:%S") retail$InvoiceNo <- as.numeric(as.character(retail$InvoiceNo))
The next step will be to decide what data features will we base our analysis on. Again, for didactic purposes this will be a very shallow analysis, so we will only choose as features: customerId, invoiceId and Description, which is the name of the product. We will make the initial set of assumptions, that it’s not important when the purchase was made, or in what order it was made, or from where the customer is from. We will only make a static analysis from the selected features, this will be a gross simplification. But this will make it easier to introduce the concepts, and we can grow in complexity later.
The initial goal is to make a pseudo predictions, so that when a customer selects a set of products we want to have an idea based on the data of which products that customer could probably get in the same purchase order. So we start with the following data transformations:
retailUnique <- retail %>% select(InvoiceNo, Description, CustomerID) %>% distinct() retaiCustomerlUnique <- retailUnique %>% select(CustomerID, Description) %>% distinct() retailUniqProds <- retaiCustomerlUnique %>% select(Description) %>% distinct() %>% arrange(as.numeric(Description)) retaiCustomerlUnique <- retaiCustomerlUnique %>% mutate(Product = as.numeric(Description)) %>% arrange(CustomerID)
For the first iteration we will sample the whole data for invoices that have at least n-1 of the chosen products. This will get a big enough number of invoices without limiting the sample too much, as I said previously this is a shallow approach and I am not calculating individual probabilities or joint probabilities of particular item-sets. For the first experiment the script will choose randomly three products from the list of unique products.
prods <- retailUniqProds[sample(nrow(retailUniqProds), 3),] invoices <- retailUnique %>% inner_join(prods) %>% group_by(InvoiceNo, CustomerID) %>% summarise(n = n()) %>% filter(n >= 2) sampledData <- retailUnique %>% inner_join(invoices) sampledDataRemoved <- sampledData %>% anti_join(prods) sampledDataFreqs <- sampledDataRemoved %>% group_by(Description) %>% summarise(Freq = n()) %>% arrange(desc(Freq)) head(sampledDataFreqs)
We first need to get our sampled invoices, and then remove any entries with the chosen products from those invoices. So, that we can make a frequency count and order the table from the most frequent to least.
You can execute several times this script and you will get a list of product frequencies that appear in invoices where at least two of the selected products are part of the purchase. Also, this frequency is related to how many times the product appears in customer invoices. Sometimes the script will return a list, and sometimes it won’t return anything. In that case it means that the selected products haven’t got any pairings in the data set’s invoices.
One of the things that comes through is how difficult it is for n randomly selected items to be present in the same purchase. So increasing n will lower the sample of invoices you can work with, and even if the selected products are present in lots of invoices these might not be present together in the same invoice.
As an example we can have the following randomly selected products:
This will give the following results once we ran the script:
Selecting your own sample
You can also explore other possible item sets by selecting your own list of products, which is shown in the code example bellow.
prods_chosen <- c("WHITE HANGING HEART T-LIGHT HOLDER", "PEACE SMALL WOOD LETTERS", "BREAD BIN DINER STYLE RED ") prods <- retailUniqProds[retailUniqProds$Description %in% prods_chosen,] invoices <- retailUnique %>% inner_join(prods) %>% group_by(InvoiceNo, CustomerID) %>% summarise(n = n()) %>% filter(n >= 2) sampledData <- retailUnique %>% inner_join(invoices) sampledDataRemoved <- sampledData %>% anti_join(prods) sampledDataFreqs <- sampledDataRemoved %>% group_by(Description) %>% summarise(Freq = n()) %>% arrange(desc(Freq)) head(sampledDataFreqs)
This will return the following results:
In this particular case we got a higher count for invoices that have at least 2 of the selected products, and as a consequence we can see the six most frequent products that paired with our selected products.
As a test, if you try to filter ‘n’ to higher or equals than 3 in the invoice selection line, you will find that you will get an empty result set back. So the probability of having all three items in an invoice without knowing beforehand the most frequent pairings will be low.
Now, lets profile the customers
We will now change the approach, and instead of using all of the invoices we will profile all customers by the products they bought and rank them by similarity with each other. And, instead of using an anonymous customer we will pick a customer from the dataset. We will still make the choice of n items, but this time we will take a different route. We will pick the sample from the most similar customer profiles that have invoices with the n-1 products.
We will first generate the customer profiles, this will be a shallow profile of a normalized vector with all the products that the customer purchased without regard for quantity or invoice occurrences. And, as similarity metric we will calculate each customer profile against each other using the cosine similarity index.
dfProfileTemplate <- retailUniqProds %>% mutate(Product = as.numeric(Description)) %>% select(Product) %>% t() %>% as.data.frame() %>% add_column(CustomerID = 0, .before = 1) dfOrderedCustomers <- retaiCustomerlUnique %>% select(CustomerID) %>% distinct() %>% arrange(CustomerID) dfProfileTemplate[2:(nrow(dfOrderedCustomers) + 1), 1] <- dfOrderedCustomers$CustomerID dfProfileTemplate <- dfProfileTemplate[-1,] dfProfileTemplate[is.na(dfProfileTemplate)] <- 0 for (i in 1:nrow(dfOrderedCustomers)) { xcustomer <- dfOrderedCustomers[i,]$CustomerID[[1]] customerProds <- retaiCustomerlUnique %>% filter(CustomerID == xcustomer) %>% select(Product) %>% arrange(Product) for (j in 1:nrow(customerProds)) { prod <- customerProds[j,]$Product[[1]] + 1 dfProfileTemplate[i, prod] <- 1 } }
This first part of the code will generate the data frame with the necessary canonic vectors of existing products for each customer. Which will have the CustomerId column, as well as all the available 3877 products listed as columns and each of these can have only 0 or 1 as a value.
In the following code we show two functions for calculating the similarity cosine, the first version is the simple definition of the similarity metric and the second version is a slightly optimized version that receives the vectors already in a pre-digested form. This metric will return a real number from 0 to 1, being that 0 means that the two vectors don’t match and 1 that they are exactly the same.
fnCosineSimilarity <- function(vect1, vect2) { similarity <- dot(vect1, vect2) / (sqrt(dot(vect1, vect1) * dot(vect2, vect2))) return(similarity) } fnCosineSimilarity2 <- function(vect1, vect1Mod, vect2, vect2Mod) { similarity <- dot(vect1, vect2) / (sqrt(vect1Mod * vect2Mod)) return(similarity) } numberRows <- nrow(dfOrderedCustomers) nc <- ncol(dfProfileTemplate) customers <- dfOrderedCustomers$CustomerID; lstVectors <- list() lstVectorsMod <- list() vectCustomerID1 <- c() vectCustomerID2 <- c() vectSimilarityIndex <- c() for (i in 1:nrow(dfOrderedCustomers)) { lstVectors[[i]] <- as.numeric(dfProfileTemplate[i, 2:nc]) lstVectorsMod[[i]] <- dot(lstVectors[[i]], lstVectors[[i]]) } for (i in 1:numberRows) { cust1 <- customers[i] perc <- i / numberRows for (j in (i + 1):numberRows) { cust2 <- customers[j] simil <- fnCosineSimilarity2(lstVectors[[i]], lstVectorsMod[[i]], lstVectors[[j]], lstVectorsMod[[j]]) if (simil > 0) { vectCustomerID1 <- append(vectCustomerID1, cust1) vectCustomerID2 <- append(vectCustomerID2, cust2) vectSimilarityIndex <- append(vectSimilarityIndex, simil) } } } dfProfileSimilarity <- data.frame(CustomerID1 = vectCustomerID1, CustomerID2 = vectCustomerID2, SimilarityIndex = vectSimilarityIndex)
By using the previous code we will create a data frame that will pair each customer with another one and mark them with a similarity index based on their product profiles. This won’t be the most efficient algorithm implementation, so expect that it will take some time to get all the customers similarity scores.
Next step, we need to pick a customer number, the similarity index threshold that will filter the number of customer invoices, and the n products that we want to check. In this case I chose the customer 15313, for convenience reasons since this has a lot of other customers matching it’s profile. But I chose a similarity threshold of 0.1, since I don’t want to leave out too many invoices for the comparison process and still get some frequencies back.
dfCustomersBySimilarity <- dfProfileSimilarity %>% filter(SimilarityIndex >= 0.1 & (CustomerID1 == 15313 | CustomerID2 == 15313)) %>% mutate(CustomerID = ifelse(CustomerID1 == 15313, CustomerID2, CustomerID1)) dfJustCustomerId <- dfCustomersBySimilarity %>% select(CustomerID) prods_chosen <- c("STRAWBERRY CHARLOTTE BAG", "JUMBO BAG BAROQUE BLACK WHITE", "LUNCH BAG RED RETROSPOT") prods <- retailUniqProds[retailUniqProds$Description %in% prods_chosen,] invoices <- retailUnique %>% inner_join(dfJustCustomerId) %>% inner_join(prods) %>% group_by(InvoiceNo, CustomerID) %>% summarise(n = n()) %>% filter(n >= 3) sampledData <- retailUnique %>% inner_join(invoices) sampledDataRemoved <- sampledData %>% anti_join(prods) sampledDataFreqs <- sampledDataRemoved %>% group_by(Description) %>% summarise(Freq = n()) %>% arrange(desc(Freq)) head(sampledDataFreqs)
Once we execute the previous code we will get the following results:
As you can see, now we were able to get all of the invoices with the three products, although the frequencies are rather low. But at same time we are matching invoices from customers that have a greater similarity to our target customer.
Final Thoughts
The methods that are shown in this article won’t find automatically what products will be more frequently paired, the idea is to enable exploration of the base concepts that lead to that path. If what you want is to get the product rule sets, then I would recommend to read the article that is linked at start of this post.
In the first part of this article, I am trying to show how you can explore relations between products and what patterns can appear by mining the data of several online shoppers. This has a big obstacle from the start, that for any given invoice the conditional probability of most combinations of ‘n’ random products being present is rather low. That is the reason that instead of checking on all selected items, the check is if N-1 selected items are present in the invoice.
In the second part, the idea is to explore how profiling customers can give more relevant results. Since, birds of a feather tend to flock together, it makes sense to try to explore the idea of having product rule sets for customers that share similar tastes. Of course, there will be issues in terms of getting the sample size that allows to make meaningful comparisons. And the same issues of conditional probability are bound to happen in this case as well. Although for customers that have a large degree of shared patterns with others, this does increase the odds on our side.
In both cases that we explored this data set, we didn’t look at any of the time related data features. This was a simplification, in reality the time dimension is quite important. Since, products might have a seasonal component, being purchased more often in Christmas, for Valentine’s Day, or at the start of the Summer Holidays. Also, products might be discontinued, or appear or disappear depending on fads and other consumer trends. The other aspect was also regional and geographic related data features, customers from different countries might have slightly different shopping patterns.
The techniques in profiling users, are also useful in finding ways to recommend products that match the previous patterns of similar customers. This might have the unintended side effect of increasing the convergence between customers in terms of purchasing patterns. Making them even more similar, which might not be the intended result.