Market Basket Analysis

Market Basket Analysis

When it’s said that someone who buys a cereal box for breakfast has a high likelihood of buying milk packages in the same purchase, it seems like an obvious and logical sentence once these two products have a high relationship (apparently). But what if there are hidden and unobvious relationships between other products? Walmart, a multinational North-American company, discovered some interesting and unobvious relationships between products in their stores:

  • Frequently young adult men bought beers and baby diapers in the same purchase [1];
  • Before the beginning of a hurricane it was common to exist many purchases with Strawberry Pop-Tarts in addition to batteries and other essential items [2];
  • Customers who bought Barbie dolls would have a good chance of buying one of three types of specific chocolate bars [3].

The discovery of these relationships makes it possible to change a store layout, putting the most associated products frequently nearest to each other. Also, these relationships drive better store stock management. For example, a Walmart store could have in stock some chocolate packages (of those that are acquired with Barbie dolls) reaching the expiration date. Once Walmart knows that there is a relationship between Barbie dolls and these chocolate bars purchase, it’s possible to do a promotional campaign?in?which buying a Barbie doll results in a discount for chocolate bars purchase.

It is also possible to restructure the way products are sold. For example, fast-food chains discovered early, that people who buy fast food tend to feel thirsty due to the salt in the food. Thanks to this, the fast-food chains started to create menus that contained both food and drinks [2].

These relationships are also a good tool to assist in the process of buying online. When purchasing a product from an online store like Amazon, we can see a “Frequently Bought Together” section, where the product sets that are more frequently acquired in the same purchase are presented [4].

Talking about product relationships surely leads to a discussion on Association Rules. These Association Rules are what make up the Market Basket Analysis.

?

Algorithms and Association Rules

Before explaining the algorithms and rules of the association, it is necessary to define some key concepts:?transaction, basket?and?itemset.?A?transaction?is an item of purchase. A?basket?is a complete purchase and it’s composed of one or more?transactions.?An?itemset?is a set of products that is part of a?basket?(an?itemset?may not be composed of all products of a?basket).

·?????Apriori Algorithm

To compute the association rules at Market Basket Analysis, Apriori algorithm can be used. This algorithm finds frequent itemsets through a process called “candidate itemset generation”. It’s an iterative process where k-itemsets are used to explore (k+1)-itemsets and where?k?corresponds to the number of items [8].

To sum up, in the first place, the algorithm will find the set of frequent 1-itemsets. Then, the algorithm will find the set of frequent 2-itemsets and so on, until there are no more frequent k-itemsets.

Apriori algorithm has some disadvantages such as the fact of being computationally expensive or taking a long time to finish a run. It happens because the algorithm needs to iterate many times over the whole dataset to generate the itemsets.

·???????FP-Growth Algorithm

To overcome the disadvantages of the Apriori algorithm mentioned above, it was created an improved version of the algorithm called FP-Growth, where “FP” means “Frequent Pattern” [5].

Instead of iterating over the whole dataset like the Apriori algorithm, FP-Growth iterates only twice and uses a tree structure (FP-Tree) to store the information. Each tree node represents an item and the number of baskets that originated by that node. Each nodes connection represents an itemset.

For a better understanding of how the algorithm works, each phase will be illustrated below. The following table presents a dataset in which each line represents a basket with its items.

No alt text provided for this image

?Table 1 – Example of a dataset

?Firstly, the algorithm will be iterating over the dataset and will count the absolute frequency of each item. The items will be sorted by frequency (support count). In case the frequency is the same and they are identified by a textual description, the ordering is done alphabetically. The table below presents the output of this phase.

No alt text provided for this image

?Table 2 – Support Count of the items contained in dataset

?Then, a tree is built, starting with the root. The root doesn’t contain any elements (it’s null). Items will be added by descending order of support count and also considering the content of each basket. The resultant tree is presented below.

No alt text provided for this image

?Figure 1 – Tree that relates the different products of baskets

?The last step is to build a conditional pattern table and to do the frequent pattern generation for each item. The construction of this table is done considering the support count in ascending order, i.e., the first item will be the item G, the second item will be the item D and so on.

The first column, i.e., the Conditional Pattern Base of a given item is obtained through the tree above. It is done by discovering the path from the root until the considered item, through connections between nodes. In the case of item D, the possible paths are C-A-D and C-A-F-D, being represented by {A, C:1} and {A, C, F:1} respectively. Item D is counted one time in each of the paths.

The second column is called Conditional FP-Tree, where the summation of the counts of each presented item in the discovered paths is done. In case of item D, the Conditional FP-Tree will be obtained joining <A:2>, <C:2> and <F:1>. However, to obtain the strongest Association Rules it is possible to define a minimum support count value, chosen by the analyst. In our example, this value will be 2. Thus, the Conditional FP-Tree for item D will be <A:2, C:2>.

For the third column, called Frequent Pattern Generation, all possible combinations between item D and the other items presented at Conditional FP-Tree are done. In this case, item D is joined to each of the other items (i.e., items A and C) and finally, the final set is obtained where all the three items are joined.

No alt text provided for this image

Table 3 – Conditional Patterns Table

?In the table above, there aren’t the Frequent Patterns of item G. It is because none of the items of that tree fulfilled the requirement of having a count bigger or equal than the minimum support count value. We can also note that the items that are directly connected to the tree root aren’t considered to Frequent Pattern Generation process because there isn’t any intermediary node between the root and the nodes that are connected to it.

Thus, the Frequent Patterns that are found for sets of 2 items are {A, D:2}, {C, D:2}, {A, F:2}, {C, F:2}, {B, E:2} and {A, C:3}. For sets of 3 items the Frequent Patterns are {A, C, D:2} and {A, C, F:2}.

It’s important to say that the Generated Frequent Patterns don’t need to be generated only through a unique item. Although we did it in our example, the Conditional Patterns table could be built considering for example the combinations between items G and F or D and F looking for the tree structure.

The rules are obtained considering the Generated Frequent Patterns. From the previous example, for the item D, the obtained rules are A → D, D → A, C → D , D → C, (A,C) → D, (A,D) → C and (C,D) → A.

The strength of these rules is evaluated for some metrics that will be explained in the next section.

?

·???????Metrics to evaluate the Association Rules

?

The three most used metrics to evaluate the strength of the Association Rules are support (different support count), confidence and lift [6].

An Association Rule is established by:

Antecedent -> Consequent

Therefore, let X and Y be two items and let X → Y be the established rule between the two items. The considered metrics, for the considered rule, are computed as follows:

No alt text provided for this image

Where frq (X, Y) represents the number of baskets that contain items X and Y at the same time and N represents the total number of baskets. In the case of frq (X) and frq(Y), these represent the number of times that item X and item Y are in a basket individually, respectively. These frq (X, Y) , frq (X) and frq (Y) are a representation of Frequent Patterns.

Not all rules are really important or significant and thus, in order to reduce time and the consumption of computation resources, it’s necessary to establish some minimum values in the evaluation metrics. Usually, these values are established in the values of support and confidence. However, there isn’t a well-established way to choose these minimum values.

If a rule has a low support value it means that there isn’t sufficient information about the relationship between the items that composed the rule. Regarding confidence, it represents a conditional probability. Typically, high values of confidence are preferred. However, these values don’t mean necessarily a strong Association Rule. So, there is a lift metric. This metric indicates how significant and strong is a rule. Lift is the increase in the probability of having Y given X over support of Y. For considering a rule strong, it’s also necessary that there is a lift value much higher than 1. A lift value lower than 1 means that the fact of having X in a basket doesn’t increase the hypothesis of the occurrence of Y, even though that rule is having high confidence.

?

Use Case

In order to show the use of this technique, it was developed a small example of applying Market Basket Analysis with FP-Growth algorithm in a Databricks Workspace. For this, it was used a dataset available at:?https://www.kaggle.com/psparks/instacart-market-basket-analysis?select=order_products__train.csv.

There are some Python packages that contain the FP-Growth algorithm as well as some approaches describing how to implement Apriori algorithm from scratch. In the case of the FP-Growth algorithm, it can be found in the package PySpark at the following link:?https://spark.apache.org/docs/2.3.0/ml-frequent-pattern-mining.html.

To start, it was built a Spark Dataframe in which each line represents a transaction of an item of a basket, “order_id” is the unique identifier of each basket and “product_name” is the name of the product that is in the basket. The code below shows how these two files were joined [7]. The Figure 2 presents the resultant table.

from?pyspark.sql?import?functions as F

from?pyspark.ml.fpm?import?FPGrowth

productsDF = spark.read.csv(“products.csv”)

ordersDF = spark.read.csv(“order_products__train.csv”)

basketdata = ordersDF.join(productsDF, on=[“product_id”], how=”inner”).select(“order_id”, “product_name”)

No alt text provided for this image

?Figure 2 – Part of the resultant table

?The goal is to consider only a list of unique products in each basket. For that reason, it was removed all the duplicate rows from the table, transforming it so that there is a column with the unique identifier of a basket and another column with a list of all unique products that make up that basket. Now, each row represents a basket. The code below presents how the described process was performed [7]. The Figure 3 shows the resultant table of this process.

basketdata = basketdata.dropDuplicates ([“order_id”, “product_name”]).sort?(“order_id”)

basketdata = basketdata.groupBy(“order_id”).agg(F.collect_list(“product_name”)).sort(“order_id”)

No alt text provided for this image

?Figure 3 – Part of the resultant table in which each line is a basket with a list of unique products that make up that basket

?Then, it was defined the minimum values for support and confidence to do a first distinction between the rules that are potentially relevant and the rules that won’t have any interest. The minimum support value is very dependent on the available data. There isn’t an exact criterion to do these choices though [7]. For this case, the minimum values chosen were 0.005 for support and 0.1 for confidence.

# support and confidence minimum values choice and fit model

fpGrowth = FPGrowth(itemsCol= “collect_list(product_name)“, minSupport=0.005, minConfidence=0.1)

model = fpGrowth.fit(basketdata)

# most frequent itemsets display

model.freqItemsets.display()

items = model.freqItemsets

# generated association rules display

model.associationRules.display()

rules = model.associationRules

The 7 strongest rules, considering lift metric, were the following:

No alt text provided for this image

Figure 5 – The 7 strongest rules with higher lift value

?They show the already expected association between fruits or between garlic and onion. And in lines 3 and 4 there is this curious association between Organic Cilantro and Limes (Organic Cilantro → Limes e Limes → Organic Cilantro).

It’s possible to verify that the confidence metric doesn’t show high values. However, the lift values are sufficiently greater than 1 to consider that these rules are strong. One of the factors that could drive to a low confidence value (and also a low support value) may be related to the fact that there is a wide variety of products throughout the dataset. In general, the products are organized by a hierarchical structure with many levels. The greater the specificity the more products are at that level. To get around the wide variety of products, it’s common to apply Market Basket Analysis considering levels with less specificity, decreasing in this way the variety.

?

To sum up

Market Basket Analysis is an Associate Rule Mining technique widely used by many companies with the interest in finding relationships between products that are in their stores. This type of analysis, using transactional data, seeks to verify what are the most frequent patterns, i.e., to find out which products are most often purchased together. Applications range from stock and store management to support promotional campaigns and customer retention strategies.

Initially, it was a technique that was performed with an algorithm called Apriori. Posteriorly, to mitigate the limitations that Apriori presents, regarding time and computation resources, a new algorithm was developed: the FP-Growth algorithm, which can be used through PySpark in a Big Data architecture.

?

[1]??https://medium.com/@kroneii/visualizing-diaper-beer-relationship-in-graph-database-e2790819ad13

[2]??https://medium.com/analytics-vidhya/association-rule-mining-7f06401f0601

[3]?Berry, M. J. A. and Linoff, G. S. (2004).?Data Mining Techniques For Marketing, Sales, and Customer Relationship Management?(2nd?ed.). Wiley Publishing, Inc.

[4]??https://towardsdatascience.com/demystifying-customer-behavior-with-market-basket-analysis-87b5841def3a

[5]??https://www.mygreatlearning.com/blog/understanding-fp-growth-algorithm/

[6]??https://towardsdatascience.com/association-rules-2-aa9a77241654

[7]??https://towardsdatascience.com/market-basket-analysis-using-pysparks-fpgrowth-55c37ebd95c0

[8]??https://towardsdatascience.com/complete-guide-to-association-rules-2-2-c92072b56c84

要查看或添加评论,请登录

BI4ALL的更多文章

社区洞察

其他会员也浏览了