Market Basket Analysis - Association Rule Mining, Apriori Algorithm
Abhi Sharma
Cloud Engineer @ Google | Ex-Founder | Building on Entrepreneurial Foundations
Market Basket Analysis is one of the most common and basic problem in data science world. It is typically used for product recommendations in e-commerce industry. The application is not only limited to e-commerce but to various other fields. In this article, we are going to explore this technique in detail and we will be discussing a lot of metrics we use to analyse the recommendations.
Generally, Market Basket Analysis is used to recommend products to customers in an e-commerce setting but it can also be used by a supermarket to arrange products on shelves, associating stocks which should be bought together to make profit, identify different purchase patterns and understand customer behaviour, etc. There are many machine learning algorithms to cluster similar products or products which are sold together but the most common way of identifying these relationships between products is by using Association Rule Learning. In this article, we are going to learn all about association rules and frequent item-sets.
The Dataset
The dataset which is used in this implementation is taken from Kaggle and it is freely available. The dataset consists of 7 columns - BillNo, Itemname, Quantity, Date, Price, CustomerId and Country. BillNo is the column which defines a transaction. There are different records for a single BillNo which tell what products are sold in that particular transaction.
What are Association Rules?
Association rule is a relationship between 2 item-sets and it tells how strong the relationship is. For example A and B are 2 sets of different items where A is antecedent and B is consequent then the relationship between them is called association rule and it is denoted by A->B.
Here, Itemsets like A and B can contain any number of items i.e A = {'Banana','Milk'} and B = {Vanilla Ice-cream}. Similarly A could also be {'Milk'} and B = {'Bread'}. To be noted, for association rule A->B, A means items which are bought together or something which is already happened and B means items which could be recommended or something that is likely to happen next.
There are different metrics which quantifies the strength of the relationship of item-sets in a association rule. These are support, confidence, lift, leverage and conviction. Lets discuss them in details.
Support - Support of a item-set is probability of the occurrence of the item-set out of the total number of transactions. It basically quantifies the share of transactions for a item-set out of all the transactions. If A is an item-set then
support(A) = n(A)/n(transactions)
Confidence - If the association is A->B then the probability of A and B occurring together out of all the transactions where A has already occurred is called confidence.
confidence = n(A ∩ B)/n(A) = P(B|A)
Lift - It tells about the strength of the relationship A->B. The value of lift is supposed to be greater than 1. A value closer to 1 shows independence between the item-sets.
lift = support(A ∩ B)/(support(A) x support(B)) = confidence(A->B)/support(B)
Leverage - It is the difference between the support(A->B) and support(A) multiplied by support(B). The value can vary between -1 and 1. Leverage value closer to 0 indicates independence.
leverage = support(A->B) - support(A)*support(B)
Conviction - Conviction is the measure of dependence of consequent on antecedent. Value of conviction should be greater than 1 and it can go upto infinity.
What are frequent item-sets?
We defined item-sets above. Now what is frequent item-sets? These are those item-sets which occur more than a specified threshold value. This threshold value is support. Once we prepare transaction data which contains all the item-sets, the next thing is to identify most common or frequent item-sets. For example -
Lets assume that in a dataset, we have around 100,000 transactions so in total there are 1e+50000 item-sets. Of course, all of these combinations of items are not important for us and they are too many. Now assume what if we have millions of transactions happening everyday. Therefore, we need to make sure we cater only those item-sets which satisfy a minimum value of support and these will be called frequent item-sets.
There are 2 major algorithms which are used across industries to find frequent item-sets.
In this article, we are going to see apriori algorithm in action.
Prepare the dataset
The very first step is to prepare the dataset for further analysis. To implement association rule learning, we need to get the data ready. The dataset which is required for apriori algorithm is a tabular structure where each row signifies a single transaction and the items which are sold in that transaction.
领英推荐
For better understanding, refer the table below.
In the above image we see BillNo resembles a transaction and there are columns for each product. A value of 1 indicates that the product is bought, whereas a 0 represents that the product is not present in the particular transaction.
Note - If the data does not follow the proper format, It is our job as a data scientist to prepare this data for further analysis
Find frequent item-sets using Apriori Algorithm
Once the data is prepared, we are all set to apply the apriori algorithm and find frequent item-sets. To apply apriori algorithm, we used MLxtend library in python.
Below code snippet implements the algorithm on the prepared dataset and returns frequent item-sets as per provided minimum threshold support value.
In the above code, we see min_support is 0.01 which is 1% of the total dataset transactions and max_len is another parameter which tells maximum number of allowed items in a antecedent item-set.
The output should look something like this
Finding Association Rules from Frequent Item-sets
In our quest of association rules, the final step is to identify all the associations out of frequent item-sets. We are going to use a wonderful function from MLxtend library again to get us all the association rules which satisfy the minimum threshold for lift value.
In the above example, we used minimum threshold for lift as 1.5. It depends which minimum threshold lift suits your use-case but generally any value which is greater than 1.5 could be useful.
The above lines of code will give us the required association rules which can be used for recommendations. The output of rules would look something like this.
Conclusion
In this article, we tried to understand what are association rules, frequent item-sets and how to implement apriori algorithm in python to find association rules. The industry implementation could be more complex and it may include lot of data preparation steps and various other metric evaluation to provide final recommendations to the end-user.
For your reference, you can check this kaggle notebook link and try to replicate or improve on the code at your own.
We will be soon coming up with some other interesting topic next week for all the data enthusiasts. See you soon! Have a nice week ahead!
Author : Abhi Sharma - Linkedin