Machine Learning 9: 'Sequential Rule Mining'
Sequential Rule Mining is a data mining technique which consists of discovering rules in sequences. Sequential Rule Mining has many applications for example for analysing the behaviour of customers in supermarkets or users on a website or passengers at an airport.
Discovering sequential patterns in sequences
An important data mining problem is to design algorithm. To understand the sequences of the patterns in activities pattern data set for discovering hidden patterns in sequences, we have to implement a bunch of Sequence Rule Mining algorithms and Pattern Mining techniques. There have been a lot of research on this topic in the field of data mining and various algorithms have been proposed.
A sequential pattern is a subsequence that appear in several sequences of a dataset. For example, the sequential pattern <{a}{c}{e}> appears in the two first sequences of our dataset. This pattern is quite interesting. It indicates that customers who bought {a}, often bought {c} after, followed by buying {e}.
Such a pattern is said to have a Support of two sequences because it appears in two sequences from the dataset. Several algorithms have been proposed for finding all sequential patterns in a dataset such as Apriori, SPADE, Prefix Span and GSP. These algorithms take as input a sequence dataset and a minimum support threshold (min-sup). Then, they will output all sequential patterns having a support no less than min-sup. Those patterns are said to be the frequent sequential patterns.
Association Analysis
There are a couple of terms used in association analysis that are important to understand. Association rules are normally written like this: {Diapers} -> {Beer} which means that there is a strong relationship between customers that purchased diapers and also purchased beer in the same transaction. In the above example, the {Diaper} is the antecedent and the {Beer} is the consequent. Both antecedents and consequents can have multiple items. In other words, {Diaper, Gum} -> {Beer, Chips} is a valid rule.
Support is the relative frequency that the rules show up. In many instances, you may want to look for high support in order to make sure it is a useful relationship. However, there may be instances where a low support is useful if you are trying to find “hidden” relationships.
Confidence is a measure of the reliability of the rule. A confidence of .5 in the above example would mean that in 50% of the cases where Diaper and Gum were purchased, the purchase also included Beer and Chips. For product recommendation, a 50% confidence may be perfectly acceptable but in a medical situation, this level may not be high enough.
Lift is the ratio of the observed support to that expected if the two rules were independent. The basic rule of thumb is that a lift value close to 1 means the rules were completely independent. Lift values > 1 are generally more “interesting” and could be indicative of a useful rule pattern.
Apriori algorithm is based on conditional probabilities and helps us determine the likelihood of items being bought together based on a - priori data.
There are three important parameters -
support, confidence and lift.
Suppose there a set of transactions with item1 --> item 2. So, support for item 1 will be defined by n(item1) / n (total transactions). Confidence on the other hand is defined as, n (item1 & item2) / n(item1). So, confidence tells us the strength of the association and support tells us the relevance of the rule. Because we don’t want to include rules about items that are seldom bought, or in other words, have low support. Lift is Confidence/Support. Higher the lift, more the significance of applying the Apriori algorithm to determine the rule.
The figures below describe the process in more intuitive manner.
But, please note, Apriori Algorithm is highly time expensive.
More Resources to explore
§ Introduction to Market Basket Analysis in Python
§ Machine learning and Data Mining - Association Analysis with Python
§ Apriori Algorithm (Python 3.0)
§ Association rules and frequent itemset
§ Association Rules and the Apriori Algorithm: A Tutorial
§ How to Create Data Visualization for Association Rules in Data Mining
------------
Exercise:
------------
As for the practice for this week, you have to Association Mining Algorithms on this Kaggle Competition.
Data Science Educator, Mentor, Trainer, and Research Development in Machine Learning, AI/DL and Software Engineering .
6 年Machine Learning 9 : What about Machine Learning 1.8: Are they previous articles???
Senior Data Architect at DataEconomy
6 年Agree. Apriori algorithm uses multiple database scans. What I’ve learnt is that Frequent Pattern (FP) growth mining technique for such association analysis instead uses its own internal FP tree structure which results in less database scan and is way faster than Apriori.