BxD Primer Series: FP-Growth Pattern Search Algorithm

BxD Primer Series: FP-Growth Pattern Search Algorithm

Hey there ??

Welcome to BxD Primer Series where we are covering topics such as Machine learning models, Neural Nets, GPT, Ensemble models, Hyper-automation in ‘one-post-one-topic’ format. Today’s post is on?FP-Growth Pattern Search Algorithm. Let’s get started:

The What:

Both FP-Growth and ECLAT are improvement over the Apriori algorithm used for frequent itemset mining. We covered Apriori (check?here) and ECLAT (check?here) in previous editions. This edition will build on both those editions, so please read them.

FP-Growth algorithm works by building a data structure called an FP-Tree (Frequent Pattern Tree) that represents the transactional database. The tree is built by scanning the database and counting the frequency of each item in each transaction. Items are then sorted in descending order of frequency, and a path is created in the tree for each transaction, with nodes representing the items in the transaction.

Once the tree is built, the algorithm recursively searches the tree for frequent itemsets by traversing the paths in the tree. The algorithm uses a technique called conditional pattern base and conditional FP-Tree to avoid generating all possible combinations of itemsets.

The FP-Growth algorithm has several advantages over Apriori, including its ability to handle large datasets with many items and its faster execution time. Additionally, because the algorithm only needs to scan the database twice (once to build the tree and once to mine the frequent itemsets), it is more efficient than Apriori, which needs to scan the database multiple times.

Comparison with ECLAT:

FP-Growth builds an FP-Tree data structure to represent the transactional database, while ECLAT uses vertical data format. This means that ECLAT requires less memory compared to FP-Growth.

FP-Growth uses a divide-and-conquer approach to finding frequent itemsets, where it recursively decomposes the database into smaller sub-databases, whereas ECLAT uses a depth-first search approach. This makes FP-Growth more efficient for large datasets, while ECLAT performs better for smaller datasets.

FP-Growth generates fewer candidates compared to ECLAT because it only generates candidates from the frequent items in the tree, whereas ECLAT generates candidates by intersecting the transactions of each item.

In terms of performance, FP-Growth is generally faster than ECLAT for larger datasets with a higher number of items. However, ECLAT may be faster for larger datasets with fewer items.

The How:

In the FP-Growth algorithm, the minimum support threshold is specified as an input parameter. Only those itemsets that have a support value greater than or equal to the minimum support threshold are considered frequent and are included in the output of the algorithm.

Minimum Support Threshold?is a number between 0 and 1, usually expressed as percentage. They are decided using both visual and qualitative approaches. The approaches are already covered it on ECLAT algorithm edition (check?here).

No alt text provided for this image

Header table?is a data structure used in the FP-Growth algorithm to keep track of the support count and the location of each frequent item in the FP-Tree. It looks as below:

No alt text provided for this image

Building FP-Tree?with an input dataset and minimum support threshold, the algorithm works as follows:

  1. Scan the dataset and count the frequency of each item. Remove any item that does not meet the minimum support threshold. Sort the remaining items in descending order of frequency, and create the header table.
  2. Reconstruct the dataset using item sort order of header table for each transaction.
  3. Create the root node of the FP-Tree, represented by?null.
  4. With each transaction in dataset, grow the FP-Tree. Here is how the tree growth works:

No alt text provided for this image

  • Nodes contain the item they are representing and a counter which tells how many times that item has?repeated in the path, thus avoiding its storage multiple times and providing compression.
  • As you can see in the figure, two different branches are made for first two transactions, as their?prefix was not common.
  • You can see a dotted line from?b?of first transaction to the?b?of second transaction. This line represents that both the nodes are linked through a linked list. This makes their retrieval very fast.
  • When mapping the third transaction on the FP-Tree,?a?was common for TID 1 and 3, so?a’s?counter was increased?and from it, another branch was created to represent TID 3.
  • Following a similar approach, the final FP-Tree was constructed after reading all 10 transactions.

Extracting frequent items from FP-tree works as follow:

  1. For each item, starting from lowest node, count their occurrence in tree. If this is less then minimum support threshold, delete that item altogether.
  2. For remaining items, build conditional pattern base which consists of set of prefix paths that lead to the item, along with the support count of each path(number of times that path leads to item). These prefix paths represent all the frequent patterns that contain the item, but excludes the item itself.
  3. For each item, generate the conditional FP-Tree using their conditional pattern base and recursively generate the frequent itemsets from the conditional FP-Tree. To do this, we can use the same process as for the original FP-Tree, starting with the leaf nodes and working our way up the tree.

The final outcome of FP-Growth looks as below:

No alt text provided for this image

This can further be used to calculate confidence of association rules. Further, based on a confidence threshold association rules can be selected or rejected.

No alt text provided for this image

Concept of Lift:

Lift is a measure of the strength of the association between two items in an association rule. It is calculated as the ratio of the observed support of both items occurring together to the expected support if they were independent.

No alt text provided for this image

If the value of lift is greater than 1, it means that the two items are positively correlated (i.e., their occurrence is not random and they tend to appear together). A value of 1 indicates independence (i.e., the occurrence of one item does not affect the occurrence of the other), and a value less than 1 indicates negative correlation (i.e., the occurrence of one item decreases the likelihood of the occurrence of the other).

Selecting Minimum Support and Confidence Thresholds:

Minimum Support and Confidence Thresholds are numbers between 0 and 1, usually expressed as percentage. They are decided using both visual and qualitative approaches. The approaches are already covered it on ECLAT algorithm edition (check?here).

The Why:

Some reasons to use FP-Growth for pattern search:

  • Faster performance on large datasets: FP-Growth can handle large datasets more efficiently than Apriori and ECLAT because it only needs to scan the dataset twice and avoids generating candidate itemsets, which can be time-consuming for large datasets.
  • Memory-efficient: FP-Growth uses a compact data structure (i.e., the FP-Tree) that requires less memory than the frequent itemset tables used by Apriori and ECLAT. This makes it more suitable for datasets with a high number of items or a large number of transactions.
  • Ability to handle both dense and sparse datasets: Unlike Apriori, FP-Growth can handle both dense and sparse datasets efficiently because it does not generate candidate itemsets.
  • Better performance with low minimum support thresholds: FP-Growth is more efficient than Apriori and ECLAT when the minimum support threshold is low, as it can avoid generating many infrequent itemsets that will not meet the minimum support threshold.

The Why-Not:

Some reasons to not use FP-Growth for pattern search:

  • Slower performance on small datasets: FP-Growth may not be as efficient as Apriori and ECLAT on small datasets because the cost of constructing the FP-Tree may be higher than the cost of generating candidate itemsets.
  • Difficult to parallelize: FP-Growth is more difficult to parallelize than Apriori and ECLAT because the FP-Tree construction process is inherently sequential.
  • Less flexibility in handling complex datasets: FP-Growth may not be as flexible as Apriori and ECLAT in handling complex datasets with varying itemsets, as it relies on a fixed data structure (i.e., the FP-Tree) that is optimized for a specific dataset.

Time for you to support:

  1. Reply to this article with your question
  2. Forward/Share to a friend who can benefit from this
  3. Chat on Substack with BxD (here)
  4. Engage with BxD on LinkedIN (here)

In next coming posts, we will start with dimensionality reduction models such as PCA, LSA, SVD, LDA, t-SNE.

Let us know your feedback!

Until then,

Have a great time! ??

#businessxdata?#bxd?#fp?#growth?#pattern?#search?#primer

要查看或添加评论,请登录

社区洞察

其他会员也浏览了