Market Basket Analysis - Association Rule Mining, Apriori Algorithm
Association Rule Mapping, Market Basket Analysis

Market Basket Analysis - Association Rule Mining, Apriori Algorithm

Market Basket Analysis is one of the most common and basic problem in data science world. It is typically used for product recommendations in e-commerce industry. The application is not only limited to e-commerce but to various other fields. In this article, we are going to explore this technique in detail and we will be discussing a lot of metrics we use to analyse the recommendations.

Generally, Market Basket Analysis is used to recommend products to customers in an e-commerce setting but it can also be used by a supermarket to arrange products on shelves, associating stocks which should be bought together to make profit, identify different purchase patterns and understand customer behaviour, etc. There are many machine learning algorithms to cluster similar products or products which are sold together but the most common way of identifying these relationships between products is by using Association Rule Learning. In this article, we are going to learn all about association rules and frequent item-sets.

The Dataset

The dataset which is used in this implementation is taken from Kaggle and it is freely available. The dataset consists of 7 columns - BillNo, Itemname, Quantity, Date, Price, CustomerId and Country. BillNo is the column which defines a transaction. There are different records for a single BillNo which tell what products are sold in that particular transaction.

What are Association Rules?

Association rule is a relationship between 2 item-sets and it tells how strong the relationship is. For example A and B are 2 sets of different items where A is antecedent and B is consequent then the relationship between them is called association rule and it is denoted by A->B.

Here, Itemsets like A and B can contain any number of items i.e A = {'Banana','Milk'} and B = {Vanilla Ice-cream}. Similarly A could also be {'Milk'} and B = {'Bread'}. To be noted, for association rule A->B, A means items which are bought together or something which is already happened and B means items which could be recommended or something that is likely to happen next.

There are different metrics which quantifies the strength of the relationship of item-sets in a association rule. These are support, confidence, lift, leverage and conviction. Lets discuss them in details.

Support - Support of a item-set is probability of the occurrence of the item-set out of the total number of transactions. It basically quantifies the share of transactions for a item-set out of all the transactions. If A is an item-set then

support(A) = n(A)/n(transactions)

Confidence - If the association is A->B then the probability of A and B occurring together out of all the transactions where A has already occurred is called confidence.

confidence = n(A ∩ B)/n(A) = P(B|A)

Lift - It tells about the strength of the relationship A->B. The value of lift is supposed to be greater than 1. A value closer to 1 shows independence between the item-sets.

lift = support(A ∩ B)/(support(A) x support(B)) = confidence(A->B)/support(B)

Leverage - It is the difference between the support(A->B) and support(A) multiplied by support(B). The value can vary between -1 and 1. Leverage value closer to 0 indicates independence.

leverage = support(A->B) - support(A)*support(B)

Conviction - Conviction is the measure of dependence of consequent on antecedent. Value of conviction should be greater than 1 and it can go upto infinity.

What are frequent item-sets?

We defined item-sets above. Now what is frequent item-sets? These are those item-sets which occur more than a specified threshold value. This threshold value is support. Once we prepare transaction data which contains all the item-sets, the next thing is to identify most common or frequent item-sets. For example -

Lets assume that in a dataset, we have around 100,000 transactions so in total there are 1e+50000 item-sets. Of course, all of these combinations of items are not important for us and they are too many. Now assume what if we have millions of transactions happening everyday. Therefore, we need to make sure we cater only those item-sets which satisfy a minimum value of support and these will be called frequent item-sets.

There are 2 major algorithms which are used across industries to find frequent item-sets.

  1. Apriori : Apriori algorithm scans the entire transaction dataset multiple times to scan for frequent item-sets as per the minimum threshold support. Item-sets which have support value more than or equal to minimum threshold support are selected.
  2. FP Growth : FP Growth, also known as frequent pattern growth algorithm uses a special kind of data structure called FP-Tree. FP-Growth is a tree based algorithm and it maintains a tree structure of all the items and their associations with other items which is then used to find frequent item-sets.

In this article, we are going to see apriori algorithm in action.

Prepare the dataset

The very first step is to prepare the dataset for further analysis. To implement association rule learning, we need to get the data ready. The dataset which is required for apriori algorithm is a tabular structure where each row signifies a single transaction and the items which are sold in that transaction.

For better understanding, refer the table below.

No alt text provided for this image
This is just a sample of prepared dataset.

In the above image we see BillNo resembles a transaction and there are columns for each product. A value of 1 indicates that the product is bought, whereas a 0 represents that the product is not present in the particular transaction.

Note - If the data does not follow the proper format, It is our job as a data scientist to prepare this data for further analysis

Find frequent item-sets using Apriori Algorithm

Once the data is prepared, we are all set to apply the apriori algorithm and find frequent item-sets. To apply apriori algorithm, we used MLxtend library in python.

Below code snippet implements the algorithm on the prepared dataset and returns frequent item-sets as per provided minimum threshold support value.

No alt text provided for this image
Use of MLxtend to implement apriori algorithm

In the above code, we see min_support is 0.01 which is 1% of the total dataset transactions and max_len is another parameter which tells maximum number of allowed items in a antecedent item-set.

The output should look something like this

No alt text provided for this image
Frequent item-sets discovered by apriori algorithm

Finding Association Rules from Frequent Item-sets

In our quest of association rules, the final step is to identify all the associations out of frequent item-sets. We are going to use a wonderful function from MLxtend library again to get us all the association rules which satisfy the minimum threshold for lift value.

No alt text provided for this image

In the above example, we used minimum threshold for lift as 1.5. It depends which minimum threshold lift suits your use-case but generally any value which is greater than 1.5 could be useful.

The above lines of code will give us the required association rules which can be used for recommendations. The output of rules would look something like this.

No alt text provided for this image
Association Rules


Conclusion

In this article, we tried to understand what are association rules, frequent item-sets and how to implement apriori algorithm in python to find association rules. The industry implementation could be more complex and it may include lot of data preparation steps and various other metric evaluation to provide final recommendations to the end-user.

For your reference, you can check this kaggle notebook link and try to replicate or improve on the code at your own.

We will be soon coming up with some other interesting topic next week for all the data enthusiasts. See you soon! Have a nice week ahead!



Author : Abhi Sharma - Linkedin

要查看或添加评论,请登录

Abhi Sharma的更多文章

社区洞察

其他会员也浏览了