Market Basket Analysis using Power BI and R

Market Basket Analysis using Power BI and R

As you may know according to Wikipedia R is a programming language and free software environment for statistical computing and graphics supported by the R Foundation for Statistical Computing.The R language is widely used among statisticians and data miners for developing statistical software and data analysis and Power BI is a business analytics service provided by Microsoft. It provides interactive visualizations with self-service business intelligence capabilities, where end users can create reports and dashboards by themselves, without having to depend on information technology staff or database administrators.

Today I want to talk about integrating these 2 important tools to come up with a solution to provide Market Basket Analysis and to do so I have used one the most useful data mining patterns called "Association Rule".Association rule mining is the data mining process of finding the rules that may govern associations and causal objects between sets of items.So in a given transaction with multiple items, it tries to find the rules that govern how or why such items are often bought together.

Given is a set of transaction data. You can see transactions numbered 1 to 5. Each transaction shows items bought in that transaction. You can see that Diaper is bought with Beer in three transactions. Similarly, Bread is bought with milk in three transactions making them both frequent item sets.Here we have ,

Support (s): Fraction of transactions that contain the item-set

Support(Bread=>Milk)=3/5=0.6=60%

Confidence (c): For a rule A=>B Confidence shows the percentage in which B is bought with A.The number of transactions with both A and B divided by the total number of transactions having A is called confidence.

Confidence(Bread=>Milk)=3/4=0.75=75%

Lift: Lift gives the correlation between A and B in the rule A=>B. Correlation shows how one item-set A effects the item-set B.

Lift(A,B)=P(AUB)/P(A)P(B)

Lift(Bread=>Milk)=3/0.64=4.6875

So, higher the lift, higher the chance of A and B occurring together.

Goal of Association Rule Mining

When you apply Association Rule Mining on a given set of transactions T your goal will be to find all rules with:

  1. Support greater than or equal to min_support
  2. Confidence greater than or equal to min_confidence

APRIORI Algorithm

Association Rule Mining is viewed as a two-step approach:

  1. Frequent Item-set Generation: Find all frequent item-sets with support >= predetermined min_support count
  2. Rule Generation: List all Association Rules from frequent item-sets. Calculate Support and Confidence for all rules. Prune rules that fail min_support and min_confidence thresholds.

Frequent Item-set Generation is the most computationally expensive step because it requires a full database scan.

Now Let's begin the MAGIC

First we need to have a data-set in R and use apriori algorithm on it to get some meaningful rules back. Install R on your operating system , I recommend R Studio IDE for better interaction.

install.packages("Matrix")
install.packages("RODBC")
install.packages("arules")
install.packages("methods")

Then use the installed libraries,

library(Matrix)
library(RODBC)
library(arules)
library(methods)

We want to read our data from a SQL database , the data has to be in transaction format meaning each row must contain all the items which was bought in one column and comma delimited. In Order to do so you can use STUFF function in sql to prepare the data in the desired format.

Go to ODBC Data Source Administrator and create a new System DSN connection to your data base server and database , don't forget to set the default database.Now let's get beck to R and read data from our source using the DSN name that we created.

con <- odbcConnect("****")
raw_data <- sqlQuery(con , "SELECT **** FROM dbo.********")
itemsn <- strsplit(as.character(raw_data$NameValues) , ",")
Digi <- as(itemsn , "transactions")

In this part we have started a connection using odbcConnect and DSN name , then read data using sqlQuery and our query , then split data using strsplit , after that convert it to "transaction" form using "as" function.

Then we need to apply apriori algorithm using 2 parameters , support and confidence.Here we need to state these 2 parameters in a way that we can have some applicable number of rules , you can observer your data to find out the best values for these 2 parameters.

m1 <- apriori(Digi , parameterlist( support=0.002 , confidence = 0.1 , minlen = 2))

then we need to save the output table in order to use it in Power BI.

output<-as(m1, "data.frame")
trash_1 <- sqlQuery(con , "TRUNCATE TABLE dbo.*******")
sqlSave(con , dat = output , tablename = "******" , append = TRUE)

First we have changed the data into data-frame format , then we have truncated the destination table and finally saved data on it using sqlSave function.

We can save all R codes in a .R file and execute it in PowerShell or even schedule executing it in a SQL job on a daily basis.

Visulization

Now let's move to Power BI and visualization , open Power BI and select your destination table as data source , you will have a table like this in your source,

I've covered rule names for security issues because they are real data from a big retail company located in Iran.Now you need to create some new columns to take out your source and target items from the rule. I've prepared some easy DAX for that.Close and Apply this part and go to the data part create 2 new columns using these 2 DAX formulas.

Source = SUBSTITUTE(SUBSTITUTE(LEFT(ArulesResult[rules],FIND("=>",ArulesResult[rules],1)-1) , "{" , "") , "}" , "")

Target = SUBSTITUTE(SUBSTITUTE(RIGHT(ArulesResult[rules] , LEN(ArulesResult[rules]) - FIND("=>",ArulesResult[rules],1) - 1) , "{" , "") , "}" , "")

With these 2 formulas you can take out "Milk" and "Bread" from the rule {Milk}=>{Bread}.

Now get back to the report tab , in visualization part click ... import from store(market place) and search "Force-Directed Graph" and import it to your visuals.

In source use source , in target use target and in weight use lift as it is the most trusted value to understand how important and reliable a node is.

You can use 2 Slicers to filter your Source and Target items too.That'll be all.

Enjoy

I've removed item names again for security issues.

Sorry ;)

Source : RADACAD , DataCamp and my personal experience.


Mohammad Mubarah

Co-founder at Stealth Proptech

4 年

Thank you for this article , can you post the link of the Data.

回复
Dave Rich, CFA, CMT, MBA

Available for project, contract, fractional services. CFO, FP&A, purchase/sale due diligence, rasing capital, turnarounds/restructurings. Bitcoin enthusiast since 2019.

6 年

Monroe "Monty" Williamson: Glad you liked this so it showed up in my stream...Thanks!

Mohammad R.

Talk about Operations, Digital Platforms, and Supply Chains!

6 年
Matthias Giger

Senior Business Intelligence Consultant IBCS?, PowerBI & Tableau Certified

6 年

要查看或添加评论,请登录

Milad Firouzi的更多文章

社区洞察

其他会员也浏览了