Association Rules explained using R

Association Rules explained using R

Association rules is a methodology of discovering interesting relations between variables in a dataset. These are nothing but if/then statements that uncovers “link” between seemingly unrelated data features. An example - “If a customer buys a dozen eggs and bread, then he is likely to also purchase milk.” This rule is represented as {eggs, bread} —> {milk}. Such information is widely used for retail basket analysis, as well as in other applications to find associations between item-sets and between sets of attribute-value pairs.

Isn’t this interesting? Let’s find out how can this be achieved.

Association rules are created by analyzing the data for frequent if/then patterns and using three very important attributes. Before we go into the implementation, let me explain these attributes

    • Support - It's is an indication of how frequently the item appears in the dataset.
    • Confidence - as the name suggests represents the number of times the if/then statements have been found to be true.
    • Lift - Lift (X -> Y) is a computed parameter which captures the ratio of observed support to that expected if X & Y were independent.

 

We will use the famous Weather dataset in R to understand this technique.

 

Data munging activity (Data source)

weather_sub$Cloudy[weather_sub$Cloudy >= 5.7 & weather_sub$Cloudy 9] - "Overcast"

weather_sub$Windy = ifelse(as.numeric(weather_sub$WindGustSpeed) > 39,"True","False")

 

head(weather_sub)

  Windy Humidity Rain      Cloudy Pressure

  False   Low   No    Overcast      Low

  False   Normal  Yes Semi cloudy      Low

  True     High  Yes    Overcast      Low

  True   Normal  Yes    Overcast      Low

  True   Normal  Yes    Overcast     High

  True   Normal   No Semi cloudy     High

 

str(weather_sub)

'data.frame': 366 obs. of  5 variables:

$ Windy   : Factor w/ 2 levels "False","True": 1 1 2 2 2 2 2 2 2 1 ...

$ Humidity: Factor w/ 3 levels "High","Low","Normal": 2 3 1 3 3 3 3 3 3 2 ...

$ Rain    : Factor w/ 2 levels "No","Yes": 1 2 2 2 2 1 1 1 1 2 ...

$ Cloudy  : Factor w/ 3 levels "Clear sky","Overcast",..: 2 3 2 2 2 3 2 2 2 1 ...

$ Pressure: Factor w/ 2 levels "High","Low": 2 2 2 2 1 1 1 1 1 1 ...

 

Finding the association rules

decision_rules = apriori(weather_sub)

inspect(decision_rules)

 lhs                  rhs             support       confidence     lift

 {} =>         {Rain=No} 0.8196721  0.8196721 1.000000

{Humidity=High} => {Cloudy=Overcast} 0.1420765  0.9285714 2.360119

{Humidity=Low} =>         {Rain=No} 0.3005464  0.9482759 1.156897

{Cloudy=Clear sky} =>         {Rain=No} 0.3579235  0.8733333 1.065467

{Windy=False} =>         {Rain=No} 0.4480874  0.8677249 1.058624

{Pressure=High} =>         {Rain=No} 0.5081967  0.8985507 1.096232

{Humidity=Low,Cloudy=Clear sky} =>         {Rain=No} 0.1775956  0.9420290 1.149275

 

We make the RHS side interpretable by taking Rain feature as “Yes/No” on the RHS side (response)

decision_rules_upd = apriori(weather_sub, parameter = list(minlen=2, supp=0.1, conf=0.7), appearance = list(rhs=c("Rain=No", "Rain=Yes"), default="lhs"), control = list(verbose=F))

sorted_decision_rules = sort(decision_rules_upd, by = “lift")

inspect(sorted_decision_rules)

   lhs                                                 rhs       support   confidence lift     

{Windy=False,Humidity=Low}                       => {Rain=No} 0.1366120 0.9803922  1.1960784

{Windy=False,Cloudy=Clear sky,Pressure=High}     => {Rain=No} 0.1612022 0.9516129  1.1609677

{Humidity=Low,Pressure=High}                     => {Rain=No} 0.1557377 0.9500000  1.1590000

{Humidity=Low}                                   => {Rain=No} 0.3005464 0.9482759  1.1568966

{Humidity=Low,Pressure=Low}                      => {Rain=No} 0.1448087 0.9464286  1.1546429

{Humidity=Low,Cloudy=Clear sky}                  => {Rain=No} 0.1775956 0.9420290  1.1492754

 

Pruning the decision rules by eliminating redundant ones

subset = is.subset(sorted_decision_rules, sorted_decision_rules)

redundant = colSums(subset, na.rm=T) >= 1

pruned_decision_rules = sorted_decision_rules[!redundant]

   lhs                                             rhs       support   confidence lift     

{Windy=False,Humidity=Low}                   => {Rain=No} 0.1366120 0.9803922  1.1960784

{Windy=False,Cloudy=Clear sky,Pressure=High} => {Rain=No} 0.1612022 0.9516129  1.1609677

{Humidity=Low,Pressure=High}                 => {Rain=No} 0.1557377 0.9500000  1.1590000

{Humidity=Low}                               => {Rain=No} 0.3005464 0.9482759  1.1568966

{Windy=False,Humidity=Normal,Pressure=High}  => {Rain=No} 0.2103825 0.9390244  1.1456098

{Windy=False,Cloudy=Clear sky}               => {Rain=No} 0.2076503 0.9382716  1.1446914

 

This way we can alter the association rule algorithm and utilize for classification on the response variable (Rain = Yes or No)

 

Visualizing association rules

Fact from the industry - Walmart used association rule learning modeling technique to uncover a very interesting fact. They identified that men who go to the store to buy diapers will also tend to buy beer at the same time.

Happy learning!

Suryanarayana Ambatipudi

Senior AI/ML Technology Leader | GenAI/Data Science Strategist

9 年

Very nicely explained..

Ruchi Santoshwar

Data Maestro | Seeing beyond the numbers to drive business impact | ISB

9 年

Nice article Akhil

要查看或添加评论,请登录

Akhil Sakhardande的更多文章

  • Mathematics behind ML

    Mathematics behind ML

    Mathematics is the fundamental building block of any machine learning problem, or to that extent any optimization…

    1 条评论
  • Route and Fleet Optimization

    Route and Fleet Optimization

    Route Optimization is considered to be a NP-hard problem in combinatorial optimization, and the worst case running time…

  • Deep learning with H2O

    Deep learning with H2O

    What is H2O H2O is an open source machine-learning platform where we can build models on large data and achieve…

    6 条评论
  • Parallel computations in R

    Parallel computations in R

    R is the widely used language for Data Scientist alongside Python. It helps you to perform numerous operations in…

    13 条评论
  • Most upright algorithm for classification – Decision tree or Logistic regression

    Most upright algorithm for classification – Decision tree or Logistic regression

    Classification is one of the crucial problems we need to solve while working on business problems in data science. In…

社区洞察

其他会员也浏览了