登录查看更多内容

Market Basket Analysis - Association Rule Mining, Apriori Algorithm

Abhi Sharma

Cloud Engineer @ Google | Ex-Founder | Building on Entrepreneurial Foundations

发布日期: 2023年5月24日

Market Basket Analysis is one of the most common and basic problem in data science world. It is typically used for product recommendations in e-commerce industry. The application is not only limited to e-commerce but to various other fields. In this article, we are going to explore this technique in detail and we will be discussing a lot of metrics we use to analyse the recommendations.

Generally, Market Basket Analysis is used to recommend products to customers in an e-commerce setting but it can also be used by a supermarket to arrange products on shelves, associating stocks which should be bought together to make profit, identify different purchase patterns and understand customer behaviour, etc. There are many machine learning algorithms to cluster similar products or products which are sold together but the most common way of identifying these relationships between products is by using Association Rule Learning. In this article, we are going to learn all about association rules and frequent item-sets.

The Dataset

The dataset which is used in this implementation is taken from Kaggle and it is freely available. The dataset consists of 7 columns - BillNo, Itemname, Quantity, Date, Price, CustomerId and Country. BillNo is the column which defines a transaction. There are different records for a single BillNo which tell what products are sold in that particular transaction.

What are Association Rules?

Association rule is a relationship between 2 item-sets and it tells how strong the relationship is. For example A and B are 2 sets of different items where A is antecedent and B is consequent then the relationship between them is called association rule and it is denoted by A->B.

Here, Itemsets like A and B can contain any number of items i.e A = {'Banana','Milk'} and B = {Vanilla Ice-cream}. Similarly A could also be {'Milk'} and B = {'Bread'}. To be noted, for association rule A->B, A means items which are bought together or something which is already happened and B means items which could be recommended or something that is likely to happen next.

There are different metrics which quantifies the strength of the relationship of item-sets in a association rule. These are support, confidence, lift, leverage and conviction. Lets discuss them in details.

Support - Support of a item-set is probability of the occurrence of the item-set out of the total number of transactions. It basically quantifies the share of transactions for a item-set out of all the transactions. If A is an item-set then

support(A) = n(A)/n(transactions)

Confidence - If the association is A->B then the probability of A and B occurring together out of all the transactions where A has already occurred is called confidence.

confidence = n(A ∩ B)/n(A) = P(B|A)

Lift - It tells about the strength of the relationship A->B. The value of lift is supposed to be greater than 1. A value closer to 1 shows independence between the item-sets.

lift = support(A ∩ B)/(support(A) x support(B)) = confidence(A->B)/support(B)

Leverage - It is the difference between the support(A->B) and support(A) multiplied by support(B). The value can vary between -1 and 1. Leverage value closer to 0 indicates independence.

leverage = support(A->B) - support(A)*support(B)

Conviction - Conviction is the measure of dependence of consequent on antecedent. Value of conviction should be greater than 1 and it can go upto infinity.

What are frequent item-sets?

We defined item-sets above. Now what is frequent item-sets? These are those item-sets which occur more than a specified threshold value. This threshold value is support. Once we prepare transaction data which contains all the item-sets, the next thing is to identify most common or frequent item-sets. For example -

Lets assume that in a dataset, we have around 100,000 transactions so in total there are 1e+50000 item-sets. Of course, all of these combinations of items are not important for us and they are too many. Now assume what if we have millions of transactions happening everyday. Therefore, we need to make sure we cater only those item-sets which satisfy a minimum value of support and these will be called frequent item-sets.

There are 2 major algorithms which are used across industries to find frequent item-sets.

Apriori : Apriori algorithm scans the entire transaction dataset multiple times to scan for frequent item-sets as per the minimum threshold support. Item-sets which have support value more than or equal to minimum threshold support are selected.
FP Growth : FP Growth, also known as frequent pattern growth algorithm uses a special kind of data structure called FP-Tree. FP-Growth is a tree based algorithm and it maintains a tree structure of all the items and their associations with other items which is then used to find frequent item-sets.

In this article, we are going to see apriori algorithm in action.

Prepare the dataset

The very first step is to prepare the dataset for further analysis. To implement association rule learning, we need to get the data ready. The dataset which is required for apriori algorithm is a tabular structure where each row signifies a single transaction and the items which are sold in that transaction.

Karl Schmieder, MS MFA 1 年前

13 Process Mining Challenges and 95 Best Practices

Fluxicon 4 个月前

Advanced Analytics in Mining Engineering

Ali Soofastaei 3 年前

For better understanding, refer the table below.

No alt text provided for this image — This is just a sample of prepared dataset.

In the above image we see BillNo resembles a transaction and there are columns for each product. A value of 1 indicates that the product is bought, whereas a 0 represents that the product is not present in the particular transaction.

Note - If the data does not follow the proper format, It is our job as a data scientist to prepare this data for further analysis

Find frequent item-sets using Apriori Algorithm

Once the data is prepared, we are all set to apply the apriori algorithm and find frequent item-sets. To apply apriori algorithm, we used MLxtend library in python.

Below code snippet implements the algorithm on the prepared dataset and returns frequent item-sets as per provided minimum threshold support value.

In the above code, we see min_support is 0.01 which is 1% of the total dataset transactions and max_len is another parameter which tells maximum number of allowed items in a antecedent item-set.

The output should look something like this

Finding Association Rules from Frequent Item-sets

In our quest of association rules, the final step is to identify all the associations out of frequent item-sets. We are going to use a wonderful function from MLxtend library again to get us all the association rules which satisfy the minimum threshold for lift value.

In the above example, we used minimum threshold for lift as 1.5. It depends which minimum threshold lift suits your use-case but generally any value which is greater than 1.5 could be useful.

The above lines of code will give us the required association rules which can be used for recommendations. The output of rules would look something like this.

Conclusion

In this article, we tried to understand what are association rules, frequent item-sets and how to implement apriori algorithm in python to find association rules. The industry implementation could be more complex and it may include lot of data preparation steps and various other metric evaluation to provide final recommendations to the end-user.

For your reference, you can check this kaggle notebook link and try to replicate or improve on the code at your own.

We will be soon coming up with some other interesting topic next week for all the data enthusiasts. See you soon! Have a nice week ahead!

Author : Abhi Sharma - Linkedin

Machines who think - ML & AI

929 位关注者

要查看或添加评论，请登录

Abhi Sharma的更多文章

Idea Evaluation is the foundation - Mind your own business

2024年9月13日

Idea Evaluation is the foundation - Mind your own business

Welcome to the first blog of "Mind your own business" series. If you're a seasoned entrepreneur with a track record of…
Is your data normal? Check for normality

2023年11月17日

Is your data normal? Check for normality

Normal distribution is one of the extremely important concept in data science. It is a bread and butter for data…

1 条评论
Do data scientists earn more than data engineers? - Prove by Hypothesis Testing

2023年6月22日

Do data scientists earn more than data engineers? - Prove by Hypothesis Testing

Welcome back to the newsletter. This is another dose of inferential statistics where we are going to see how hypothesis…
Be confident with confidence intervals

2023年6月16日

Be confident with confidence intervals

In continuation to my last blog on statistics - Data analytics is all about Statistics where we saw various probability…
Data Analytics is all about Statistics

2023年6月6日

Data Analytics is all about Statistics

Data Analytics is a term which is being used widely these days. Almost everyone is either doing data analytics or using…

7 条评论

See all articles

Market Basket Analysis - Association Rule Mining, Apriori Algorithm

Abhi Sharma

Cloud Engineer @ Google | Ex-Founder | Building on Entrepreneurial Foundations

The Dataset

What are Association Rules?

What are frequent item-sets?

Prepare the dataset

领英推荐

Find frequent item-sets using Apriori Algorithm

Finding Association Rules from Frequent Item-sets

Conclusion

Machines who think - ML & AI

929 位关注者

Abhi Sharma的更多文章

社区洞察

其他会员也浏览了

Advanced Analytics in Mining Engineering

Advanced Analytics in Mining Engineering

DATA MINING – THE CORE OF A MODERN AUTOMOTIVE PLATFORM

How Consumer Insights Teams Can Leverage Text Mining Techniques

Redefining Competitive Intelligence Through Data Mining Strategies

Process Mining – Demystifying Process Mining and it’s adoption

Streamlining Operations: Leveraging Process Mining for Business Enhancement

Key Takeaways from My Early Process Mining Journey

Analysis of Iron Ore mining Data Analyst Project for Metals R’ Us using Python

Exploratory Analysis of Froth Characteristics in Iron Flotation

The Dataset

What are Association Rules?

What are frequent item-sets?

Prepare the dataset

领英推荐

Find frequent item-sets using Apriori Algorithm

Finding Association Rules from Frequent Item-sets

Conclusion

Machines who think - ML & AI

929 位关注者

Abhi Sharma的更多文章

Idea Evaluation is the foundation - Mind your own business

Is your data normal? Check for normality

Do data scientists earn more than data engineers? - Prove by Hypothesis Testing

Be confident with confidence intervals

Data Analytics is all about Statistics

社区洞察

其他会员也浏览了

Advanced Analytics in Mining Engineering

Advanced Analytics in Mining Engineering

DATA MINING – THE CORE OF A MODERN AUTOMOTIVE PLATFORM

How Consumer Insights Teams Can Leverage Text Mining Techniques

Redefining Competitive Intelligence Through Data Mining Strategies

Process Mining – Demystifying Process Mining and it’s adoption

Streamlining Operations: Leveraging Process Mining for Business Enhancement

Key Takeaways from My Early Process Mining Journey

Analysis of Iron Ore mining Data Analyst Project for Metals R’ Us using Python

Exploratory Analysis of Froth Characteristics in Iron Flotation