登录查看更多内容

Making Insights By Exploring Simple Market Basket Analysis

Carlos Concei??o

Real Time Monitoring and Distributed Systems

发布日期: 2020年4月24日

In this article I will give some examples of simple analysis that can be done if you have a dataset of purchases, and how this can be extrapolated for other uses. Part of the information was taken from the following article: “A Gentle Introduction on Market Basket Analysis — Association Rules”. Which is a good introduction to market basket analysis through the point of view of using association rules, although I won't use the methods that are demonstrated there. Instead, I will use a shallow and naive approach, that although will carry limited insights. I think it is better to grasp the underlying base concepts and make you aware of other uses besides retail.

The dataset that I will use can be found here, this is a dataset from an online retailer with transactions going from 01/12/2010 and 09/12/2011. And, to analyse the data I will use R script. I will also make heavy use of available libraries which will make the code very terse. Because of this some extra explanations might be required of why a function is used and what it does.

Starting with the code

We begin with the following libraries that will be necessary to query the data and make the necessary data transformations. In this case I will use mostly general use libraries and not specialized analysis packages that would do most of the work required to find associations between items.

library(tidyverse)
library(readxl)
library(lubridate)
library(dplyr)
library(pracma)
library(reshape2)
library(tibble)

First thing is to load the dataset and make some initial transformations to help out later. And this part follows the same loading pattern as the article that was linked previously.

retail <- read_excel("Data/Online retail.xlsx")
retail <- retail[complete.cases(retail),]
retail <- retail %>% mutate(Description = as.factor(Description))
retail <- retail %>% mutate(Country = as.factor(Country))
retail$Date <- as.Date(retail$InvoiceDate)
retail$Time <- format(retail$InvoiceDate, "%H:%M:%S")
retail$InvoiceNo <- as.numeric(as.character(retail$InvoiceNo))

The next step will be to decide what data features will we base our analysis on. Again, for didactic purposes this will be a very shallow analysis, so we will only choose as features: customerId, invoiceId and Description, which is the name of the product. We will make the initial set of assumptions, that it’s not important when the purchase was made, or in what order it was made, or from where the customer is from. We will only make a static analysis from the selected features, this will be a gross simplification. But this will make it easier to introduce the concepts, and we can grow in complexity later.

The initial goal is to make a pseudo predictions, so that when a customer selects a set of products we want to have an idea based on the data of which products that customer could probably get in the same purchase order. So we start with the following data transformations:

retailUnique <- retail %>% select(InvoiceNo, Description, CustomerID) 
   %>% distinct()
retaiCustomerlUnique <- retailUnique 
   %>% select(CustomerID, Description) 
   %>% distinct()
retailUniqProds <- retaiCustomerlUnique %>% select(Description) 
  %>% distinct() %>% arrange(as.numeric(Description))
retaiCustomerlUnique <- retaiCustomerlUnique 
  %>% mutate(Product = as.numeric(Description)) 
  %>% arrange(CustomerID)

For the first iteration we will sample the whole data for invoices that have at least n-1 of the chosen products. This will get a big enough number of invoices without limiting the sample too much, as I said previously this is a shallow approach and I am not calculating individual probabilities or joint probabilities of particular item-sets. For the first experiment the script will choose randomly three products from the list of unique products.

prods <- retailUniqProds[sample(nrow(retailUniqProds), 3),]
invoices <- retailUnique %>% inner_join(prods) 
  %>% group_by(InvoiceNo, CustomerID) %>% summarise(n = n()) 
  %>% filter(n >= 2)
sampledData <- retailUnique %>% inner_join(invoices)
sampledDataRemoved <- sampledData %>% anti_join(prods)
sampledDataFreqs <- sampledDataRemoved %>% group_by(Description) 
  %>% summarise(Freq = n()) %>% arrange(desc(Freq))

head(sampledDataFreqs)

We first need to get our sampled invoices, and then remove any entries with the chosen products from those invoices. So, that we can make a frequency count and order the table from the most frequent to least.

You can execute several times this script and you will get a list of product frequencies that appear in invoices where at least two of the selected products are part of the purchase. Also, this frequency is related to how many times the product appears in customer invoices. Sometimes the script will return a list, and sometimes it won’t return anything. In that case it means that the selected products haven’t got any pairings in the data set’s invoices.

One of the things that comes through is how difficult it is for n randomly selected items to be present in the same purchase. So increasing n will lower the sample of invoices you can work with, and even if the selected products are present in lots of invoices these might not be present together in the same invoice.

As an example we can have the following randomly selected products:

This will give the following results once we ran the script:

Selecting your own sample

You can also explore other possible item sets by selecting your own list of products, which is shown in the code example bellow.

prods_chosen <- c("WHITE HANGING HEART T-LIGHT HOLDER", 
  "PEACE SMALL WOOD LETTERS", "BREAD BIN DINER STYLE RED ")
prods <- retailUniqProds[retailUniqProds$Description %in% prods_chosen,]
invoices <- retailUnique %>% inner_join(prods) 
  %>% group_by(InvoiceNo, CustomerID) 
  %>% summarise(n = n()) %>% filter(n >= 2)
sampledData <- retailUnique %>% inner_join(invoices)
sampledDataRemoved <- sampledData %>% anti_join(prods)
sampledDataFreqs <- sampledDataRemoved %>% group_by(Description) 
  %>% summarise(Freq = n()) %>% arrange(desc(Freq))

head(sampledDataFreqs)

This will return the following results:

In this particular case we got a higher count for invoices that have at least 2 of the selected products, and as a consequence we can see the six most frequent products that paired with our selected products.

As a test, if you try to filter ‘n’ to higher or equals than 3 in the invoice selection line, you will find that you will get an empty result set back. So the probability of having all three items in an invoice without knowing beforehand the most frequent pairings will be low.

Now, lets profile the customers

We will now change the approach, and instead of using all of the invoices we will profile all customers by the products they bought and rank them by similarity with each other. And, instead of using an anonymous customer we will pick a customer from the dataset. We will still make the choice of n items, but this time we will take a different route. We will pick the sample from the most similar customer profiles that have invoices with the n-1 products.

We will first generate the customer profiles, this will be a shallow profile of a normalized vector with all the products that the customer purchased without regard for quantity or invoice occurrences. And, as similarity metric we will calculate each customer profile against each other using the cosine similarity index.

dfProfileTemplate <- retailUniqProds 
  %>% mutate(Product = as.numeric(Description)) 
  %>% select(Product) %>% t() %>% as.data.frame() 
  %>% add_column(CustomerID = 0, .before = 1)
dfOrderedCustomers <- retaiCustomerlUnique 
  %>% select(CustomerID) %>% distinct() 
  %>% arrange(CustomerID)
dfProfileTemplate[2:(nrow(dfOrderedCustomers) + 1), 1] <- 
  dfOrderedCustomers$CustomerID
dfProfileTemplate <- dfProfileTemplate[-1,]
dfProfileTemplate[is.na(dfProfileTemplate)] <- 0

for (i in 1:nrow(dfOrderedCustomers)) {
  xcustomer <- dfOrderedCustomers[i,]$CustomerID[[1]]
  customerProds <- retaiCustomerlUnique 
    %>% filter(CustomerID == xcustomer) 
    %>% select(Product) %>% arrange(Product)

  for (j in 1:nrow(customerProds)) {
    prod <- customerProds[j,]$Product[[1]] + 1
    dfProfileTemplate[i, prod] <- 1
  }
}

This first part of the code will generate the data frame with the necessary canonic vectors of existing products for each customer. Which will have the CustomerId column, as well as all the available 3877 products listed as columns and each of these can have only 0 or 1 as a value.

In the following code we show two functions for calculating the similarity cosine, the first version is the simple definition of the similarity metric and the second version is a slightly optimized version that receives the vectors already in a pre-digested form. This metric will return a real number from 0 to 1, being that 0 means that the two vectors don’t match and 1 that they are exactly the same.

fnCosineSimilarity <- function(vect1, vect2) {
    similarity <- dot(vect1, vect2) / (sqrt(dot(vect1, vect1) * dot(vect2, vect2)))
    return(similarity)
}

fnCosineSimilarity2 <- function(vect1, vect1Mod, vect2, vect2Mod) {
    similarity <- dot(vect1, vect2) / (sqrt(vect1Mod * vect2Mod))
    return(similarity)
}

numberRows <- nrow(dfOrderedCustomers)
nc <- ncol(dfProfileTemplate)
customers <- dfOrderedCustomers$CustomerID;
lstVectors <- list()
lstVectorsMod <- list()
vectCustomerID1 <- c()
vectCustomerID2 <- c()
vectSimilarityIndex <- c()

for (i in 1:nrow(dfOrderedCustomers)) {
    lstVectors[[i]] <- as.numeric(dfProfileTemplate[i, 2:nc])
    lstVectorsMod[[i]] <- dot(lstVectors[[i]], lstVectors[[i]])
}

for (i in 1:numberRows) {
    cust1 <- customers[i]
    perc <- i / numberRows

    for (j in (i + 1):numberRows) {
        cust2 <- customers[j]
        simil <- fnCosineSimilarity2(lstVectors[[i]], lstVectorsMod[[i]], lstVectors[[j]], lstVectorsMod[[j]])

        if (simil > 0) {
            vectCustomerID1 <- append(vectCustomerID1, cust1)
            vectCustomerID2 <- append(vectCustomerID2, cust2)
            vectSimilarityIndex <- append(vectSimilarityIndex, simil)
        }
    }
}

dfProfileSimilarity <- data.frame(CustomerID1 = vectCustomerID1, CustomerID2 = vectCustomerID2, SimilarityIndex = vectSimilarityIndex)

By using the previous code we will create a data frame that will pair each customer with another one and mark them with a similarity index based on their product profiles. This won’t be the most efficient algorithm implementation, so expect that it will take some time to get all the customers similarity scores.

Next step, we need to pick a customer number, the similarity index threshold that will filter the number of customer invoices, and the n products that we want to check. In this case I chose the customer 15313, for convenience reasons since this has a lot of other customers matching it’s profile. But I chose a similarity threshold of 0.1, since I don’t want to leave out too many invoices for the comparison process and still get some frequencies back.

dfCustomersBySimilarity <- dfProfileSimilarity 
  %>% filter(SimilarityIndex >= 0.1 & (CustomerID1 == 15313 | CustomerID2 == 15313)) 
  %>% mutate(CustomerID = ifelse(CustomerID1 == 15313, CustomerID2, CustomerID1))
dfJustCustomerId <- dfCustomersBySimilarity %>% select(CustomerID)

prods_chosen <- c("STRAWBERRY CHARLOTTE BAG", 
  "JUMBO BAG BAROQUE BLACK WHITE", "LUNCH BAG RED RETROSPOT")
prods <- retailUniqProds[retailUniqProds$Description %in% prods_chosen,]
invoices <- retailUnique %>% inner_join(dfJustCustomerId) 
  %>% inner_join(prods) %>% group_by(InvoiceNo, CustomerID) 
  %>% summarise(n = n()) %>% filter(n >= 3)
sampledData <- retailUnique %>% inner_join(invoices)
sampledDataRemoved <- sampledData %>% anti_join(prods)
sampledDataFreqs <- sampledDataRemoved %>% group_by(Description) 
  %>% summarise(Freq = n()) %>% arrange(desc(Freq))

head(sampledDataFreqs)

Once we execute the previous code we will get the following results:

As you can see, now we were able to get all of the invoices with the three products, although the frequencies are rather low. But at same time we are matching invoices from customers that have a greater similarity to our target customer.

Final Thoughts

The methods that are shown in this article won’t find automatically what products will be more frequently paired, the idea is to enable exploration of the base concepts that lead to that path. If what you want is to get the product rule sets, then I would recommend to read the article that is linked at start of this post.

In the first part of this article, I am trying to show how you can explore relations between products and what patterns can appear by mining the data of several online shoppers. This has a big obstacle from the start, that for any given invoice the conditional probability of most combinations of ‘n’ random products being present is rather low. That is the reason that instead of checking on all selected items, the check is if N-1 selected items are present in the invoice.

In the second part, the idea is to explore how profiling customers can give more relevant results. Since, birds of a feather tend to flock together, it makes sense to try to explore the idea of having product rule sets for customers that share similar tastes. Of course, there will be issues in terms of getting the sample size that allows to make meaningful comparisons. And the same issues of conditional probability are bound to happen in this case as well. Although for customers that have a large degree of shared patterns with others, this does increase the odds on our side.

In both cases that we explored this data set, we didn’t look at any of the time related data features. This was a simplification, in reality the time dimension is quite important. Since, products might have a seasonal component, being purchased more often in Christmas, for Valentine’s Day, or at the start of the Summer Holidays. Also, products might be discontinued, or appear or disappear depending on fads and other consumer trends. The other aspect was also regional and geographic related data features, customers from different countries might have slightly different shopping patterns.

The techniques in profiling users, are also useful in finding ways to recommend products that match the previous patterns of similar customers. This might have the unintended side effect of increasing the convergence between customers in terms of purchasing patterns. Making them even more similar, which might not be the intended result.

要查看或添加评论，请登录

Carlos Concei??o的更多文章

Supply Chain Reality Check

2020年3月11日

Supply Chain Reality Check

The coronavirus situation that is currently unfolding, and for which it might be too soon to call out any predictions…
Why Talent Is Often Overlooked

2019年7月22日

Why Talent Is Often Overlooked

Talent is a quality often spoken, many times lauded and mostly lacking, never was talent so said to be missing. And…

1 条评论
Why Methodologies Fail to Bring Home the Bacon

2018年11月12日

Why Methodologies Fail to Bring Home the Bacon

In my career I have seen methodologies falling in and out of favour, with each new wave promising to solve the problems…
Guide to Office Politics: Office Factions

2018年11月5日

Guide to Office Politics: Office Factions

I will introduce here some of the common political factions that can exist within a company, these will be categorized…

1 条评论
How To Turn Around a Failing Software Project

2018年10月22日

How To Turn Around a Failing Software Project

In this article I will discuss some scenarios and situations that involve turning around a project and save it from the…

5 条评论
The Life Cycle of a Tech Organization

2018年10月7日

The Life Cycle of a Tech Organization

The ideas and opinions that are here written cannot be fully generalized, much of this is specific to the countries and…
Are China’s Bets in AI Poised to Topple US Dominance?

2018年9月30日

Are China’s Bets in AI Poised to Topple US Dominance?

The theme of AI is becoming a talking issue when it comes to big powers competition, although the way it can affect the…
The Cosmification of Management Articles

2018年9月27日

The Cosmification of Management Articles

Have you ever noticed the similarities between the current crop of business articles that pop up in linkedin and the…
Machine Learning Road To Disappointment

2018年9月24日

Machine Learning Road To Disappointment

From the media hype around AI and ML (Machine Learning) we would feel that common usage is just around the corner, with…

See all articles

Making Insights By Exploring Simple Market Basket Analysis

Carlos Concei??o

Real Time Monitoring and Distributed Systems

Starting with the code

Selecting your own sample

Now, lets profile the customers

Final Thoughts

Carlos Concei??o的更多文章

社区洞察

其他会员也浏览了

What your graph says about you(r data)

Data Driven Business: The Art & Science of connecting the dots

The Data Analytics Mistake You’re Making

Think Your Business is Too Tiny for Data?

Should we hate pie charts? ??

Action Analysis Will Replace Data Analysis

What Is Data Intelligence & Why Does It Matter? | Notable Data News

How Bad Data Will Beat a Good Forecast Every Time!!

Why isn’t there enough “why” data?

?? Beyond the Numbers: Transforming Data Into Business Superpowers

Starting with the code

Selecting your own sample

Now, lets profile the customers

Final Thoughts

Carlos Concei??o的更多文章

Supply Chain Reality Check

Why Talent Is Often Overlooked

Why Methodologies Fail to Bring Home the Bacon

Guide to Office Politics: Office Factions

How To Turn Around a Failing Software Project

The Life Cycle of a Tech Organization

Are China’s Bets in AI Poised to Topple US Dominance?

The Cosmification of Management Articles

Machine Learning Road To Disappointment

社区洞察

其他会员也浏览了

What your graph says about you(r data)

Data Driven Business: The Art & Science of connecting the dots

The Data Analytics Mistake You’re Making

Think Your Business is Too Tiny for Data?

Should we hate pie charts? ??

Action Analysis Will Replace Data Analysis

What Is Data Intelligence & Why Does It Matter? | Notable Data News

How Bad Data Will Beat a Good Forecast Every Time!!

Why isn’t there enough “why” data?

?? Beyond the Numbers: Transforming Data Into Business Superpowers