登录查看更多内容

Association Rules explained using R

Akhil Sakhardande

data science tech lead manager, AdTech/MarTech

发布日期: 2016年2月25日

Association rules is a methodology of discovering interesting relations between variables in a dataset. These are nothing but if/then statements that uncovers “link” between seemingly unrelated data features. An example - “If a customer buys a dozen eggs and bread, then he is likely to also purchase milk.” This rule is represented as {eggs, bread} —> {milk}. Such information is widely used for retail basket analysis, as well as in other applications to find associations between item-sets and between sets of attribute-value pairs.

Isn’t this interesting? Let’s find out how can this be achieved.

Association rules are created by analyzing the data for frequent if/then patterns and using three very important attributes. Before we go into the implementation, let me explain these attributes

Support - It's is an indication of how frequently the item appears in the dataset.
Confidence - as the name suggests represents the number of times the if/then statements have been found to be true.
Lift - Lift (X -> Y) is a computed parameter which captures the ratio of observed support to that expected if X & Y were independent.

We will use the famous Weather dataset in R to understand this technique.

Data munging activity (Data source)

weather_sub$Cloudy[weather_sub$Cloudy >= 5.7 & weather_sub$Cloudy 9] - "Overcast"

weather_sub$Windy = ifelse(as.numeric(weather_sub$WindGustSpeed) > 39,"True","False")

head(weather_sub)

Windy Humidity Rain Cloudy Pressure

False Low No Overcast Low

False Normal Yes Semi cloudy Low

True High Yes Overcast Low

True Normal Yes Overcast Low

True Normal Yes Overcast High

True Normal No Semi cloudy High

str(weather_sub)

'data.frame': 366 obs. of 5 variables:

$ Windy : Factor w/ 2 levels "False","True": 1 1 2 2 2 2 2 2 2 1 ...

$ Humidity: Factor w/ 3 levels "High","Low","Normal": 2 3 1 3 3 3 3 3 3 2 ...

$ Rain : Factor w/ 2 levels "No","Yes": 1 2 2 2 2 1 1 1 1 2 ...

$ Cloudy : Factor w/ 3 levels "Clear sky","Overcast",..: 2 3 2 2 2 3 2 2 2 1 ...

$ Pressure: Factor w/ 2 levels "High","Low": 2 2 2 2 1 1 1 1 1 1 ...

Finding the association rules

decision_rules = apriori(weather_sub)

inspect(decision_rules)

lhs rhs support confidence lift

{} => {Rain=No} 0.8196721 0.8196721 1.000000

{Humidity=High} => {Cloudy=Overcast} 0.1420765 0.9285714 2.360119

{Humidity=Low} => {Rain=No} 0.3005464 0.9482759 1.156897

{Cloudy=Clear sky} => {Rain=No} 0.3579235 0.8733333 1.065467

{Windy=False} => {Rain=No} 0.4480874 0.8677249 1.058624

{Pressure=High} => {Rain=No} 0.5081967 0.8985507 1.096232

{Humidity=Low,Cloudy=Clear sky} => {Rain=No} 0.1775956 0.9420290 1.149275

We make the RHS side interpretable by taking Rain feature as “Yes/No” on the RHS side (response)

decision_rules_upd = apriori(weather_sub, parameter = list(minlen=2, supp=0.1, conf=0.7), appearance = list(rhs=c("Rain=No", "Rain=Yes"), default="lhs"), control = list(verbose=F))

sorted_decision_rules = sort(decision_rules_upd, by = “lift")

inspect(sorted_decision_rules)

lhs rhs support confidence lift

{Windy=False,Humidity=Low} => {Rain=No} 0.1366120 0.9803922 1.1960784

{Windy=False,Cloudy=Clear sky,Pressure=High} => {Rain=No} 0.1612022 0.9516129 1.1609677

{Humidity=Low,Pressure=High} => {Rain=No} 0.1557377 0.9500000 1.1590000

{Humidity=Low} => {Rain=No} 0.3005464 0.9482759 1.1568966

{Humidity=Low,Pressure=Low} => {Rain=No} 0.1448087 0.9464286 1.1546429

{Humidity=Low,Cloudy=Clear sky} => {Rain=No} 0.1775956 0.9420290 1.1492754

Pruning the decision rules by eliminating redundant ones

subset = is.subset(sorted_decision_rules, sorted_decision_rules)

redundant = colSums(subset, na.rm=T) >= 1

pruned_decision_rules = sorted_decision_rules[!redundant]

lhs rhs support confidence lift

{Windy=False,Humidity=Low} => {Rain=No} 0.1366120 0.9803922 1.1960784

{Windy=False,Cloudy=Clear sky,Pressure=High} => {Rain=No} 0.1612022 0.9516129 1.1609677

{Humidity=Low,Pressure=High} => {Rain=No} 0.1557377 0.9500000 1.1590000

{Humidity=Low} => {Rain=No} 0.3005464 0.9482759 1.1568966

{Windy=False,Humidity=Normal,Pressure=High} => {Rain=No} 0.2103825 0.9390244 1.1456098

{Windy=False,Cloudy=Clear sky} => {Rain=No} 0.2076503 0.9382716 1.1446914

This way we can alter the association rule algorithm and utilize for classification on the response variable (Rain = Yes or No)

Visualizing association rules

Fact from the industry - Walmart used association rule learning modeling technique to uncover a very interesting fact. They identified that men who go to the store to buy diapers will also tend to buy beer at the same time.

Happy learning!

Suryanarayana Ambatipudi

Senior AI/ML Technology Leader | GenAI/Data Science Strategist

9 年

Very nicely explained..

1 次回应

Ruchi Santoshwar

Data Maestro | Seeing beyond the numbers to drive business impact | ISB

9 年

Nice article Akhil

1 次回应

查看更多评论

要查看或添加评论，请登录

Akhil Sakhardande的更多文章

Mathematics behind ML

2017年10月8日

Mathematics behind ML

Mathematics is the fundamental building block of any machine learning problem, or to that extent any optimization…

1 条评论
Route and Fleet Optimization

2016年11月11日

Route and Fleet Optimization

Route Optimization is considered to be a NP-hard problem in combinatorial optimization, and the worst case running time…
Deep learning with H2O

2016年7月3日

Deep learning with H2O

What is H2O H2O is an open source machine-learning platform where we can build models on large data and achieve…

6 条评论
Parallel computations in R

2015年12月28日

Parallel computations in R

R is the widely used language for Data Scientist alongside Python. It helps you to perform numerous operations in…

13 条评论
Most upright algorithm for classification – Decision tree or Logistic regression

2015年12月3日

Most upright algorithm for classification – Decision tree or Logistic regression

Classification is one of the crucial problems we need to solve while working on business problems in data science. In…

See all articles

Association Rules explained using R

Akhil Sakhardande

data science tech lead manager, AdTech/MarTech

Akhil Sakhardande的更多文章

社区洞察

其他会员也浏览了

?? QQ Plots Explained: A Beginner's Guide to Comparing Distributions

You had me at hello part II: identifying patterns in your data

The Importance of Sample Size and an Introduction to Standard Deviation

When To Use What Kind Of Charts To Represent The Data

Tips and tricks with data.table

RMSE (Root Mean Square Error) and MSE (Mean Square Error)

Signal .v. Noise - Presenting your Data

How to visualize millions of time series points

Data in Statistics

Akhil Sakhardande的更多文章

Mathematics behind ML

Route and Fleet Optimization

Deep learning with H2O

Parallel computations in R

Most upright algorithm for classification – Decision tree or Logistic regression

社区洞察

其他会员也浏览了

?? QQ Plots Explained: A Beginner's Guide to Comparing Distributions

You had me at hello part II: identifying patterns in your data

The Importance of Sample Size and an Introduction to Standard Deviation

When To Use What Kind Of Charts To Represent The Data

Tips and tricks with data.table

RMSE (Root Mean Square Error) and MSE (Mean Square Error)

Signal .v. Noise - Presenting your Data

How to visualize millions of time series points

Data in Statistics