Association Rules explained using R
Association rules is a methodology of discovering interesting relations between variables in a dataset. These are nothing but if/then statements that uncovers “link” between seemingly unrelated data features. An example - “If a customer buys a dozen eggs and bread, then he is likely to also purchase milk.” This rule is represented as {eggs, bread} —> {milk}. Such information is widely used for retail basket analysis, as well as in other applications to find associations between item-sets and between sets of attribute-value pairs.
Isn’t this interesting? Let’s find out how can this be achieved.
Association rules are created by analyzing the data for frequent if/then patterns and using three very important attributes. Before we go into the implementation, let me explain these attributes
- Support - It's is an indication of how frequently the item appears in the dataset.
- Confidence - as the name suggests represents the number of times the if/then statements have been found to be true.
- Lift - Lift (X -> Y) is a computed parameter which captures the ratio of observed support to that expected if X & Y were independent.
We will use the famous Weather dataset in R to understand this technique.
Data munging activity (Data source)
weather_sub$Cloudy[weather_sub$Cloudy >= 5.7 & weather_sub$Cloudy 9] - "Overcast"
weather_sub$Windy = ifelse(as.numeric(weather_sub$WindGustSpeed) > 39,"True","False")
head(weather_sub)
Windy Humidity Rain Cloudy Pressure
False Low No Overcast Low
False Normal Yes Semi cloudy Low
True High Yes Overcast Low
True Normal Yes Overcast Low
True Normal Yes Overcast High
True Normal No Semi cloudy High
str(weather_sub)
'data.frame': 366 obs. of 5 variables:
$ Windy : Factor w/ 2 levels "False","True": 1 1 2 2 2 2 2 2 2 1 ...
$ Humidity: Factor w/ 3 levels "High","Low","Normal": 2 3 1 3 3 3 3 3 3 2 ...
$ Rain : Factor w/ 2 levels "No","Yes": 1 2 2 2 2 1 1 1 1 2 ...
$ Cloudy : Factor w/ 3 levels "Clear sky","Overcast",..: 2 3 2 2 2 3 2 2 2 1 ...
$ Pressure: Factor w/ 2 levels "High","Low": 2 2 2 2 1 1 1 1 1 1 ...
Finding the association rules
decision_rules = apriori(weather_sub)
inspect(decision_rules)
lhs rhs support confidence lift
{} => {Rain=No} 0.8196721 0.8196721 1.000000
{Humidity=High} => {Cloudy=Overcast} 0.1420765 0.9285714 2.360119
{Humidity=Low} => {Rain=No} 0.3005464 0.9482759 1.156897
{Cloudy=Clear sky} => {Rain=No} 0.3579235 0.8733333 1.065467
{Windy=False} => {Rain=No} 0.4480874 0.8677249 1.058624
{Pressure=High} => {Rain=No} 0.5081967 0.8985507 1.096232
{Humidity=Low,Cloudy=Clear sky} => {Rain=No} 0.1775956 0.9420290 1.149275
We make the RHS side interpretable by taking Rain feature as “Yes/No” on the RHS side (response)
decision_rules_upd = apriori(weather_sub, parameter = list(minlen=2, supp=0.1, conf=0.7), appearance = list(rhs=c("Rain=No", "Rain=Yes"), default="lhs"), control = list(verbose=F))
sorted_decision_rules = sort(decision_rules_upd, by = “lift")
inspect(sorted_decision_rules)
lhs rhs support confidence lift
{Windy=False,Humidity=Low} => {Rain=No} 0.1366120 0.9803922 1.1960784
{Windy=False,Cloudy=Clear sky,Pressure=High} => {Rain=No} 0.1612022 0.9516129 1.1609677
{Humidity=Low,Pressure=High} => {Rain=No} 0.1557377 0.9500000 1.1590000
{Humidity=Low} => {Rain=No} 0.3005464 0.9482759 1.1568966
{Humidity=Low,Pressure=Low} => {Rain=No} 0.1448087 0.9464286 1.1546429
{Humidity=Low,Cloudy=Clear sky} => {Rain=No} 0.1775956 0.9420290 1.1492754
Pruning the decision rules by eliminating redundant ones
subset = is.subset(sorted_decision_rules, sorted_decision_rules)
redundant = colSums(subset, na.rm=T) >= 1
pruned_decision_rules = sorted_decision_rules[!redundant]
lhs rhs support confidence lift
{Windy=False,Humidity=Low} => {Rain=No} 0.1366120 0.9803922 1.1960784
{Windy=False,Cloudy=Clear sky,Pressure=High} => {Rain=No} 0.1612022 0.9516129 1.1609677
{Humidity=Low,Pressure=High} => {Rain=No} 0.1557377 0.9500000 1.1590000
{Humidity=Low} => {Rain=No} 0.3005464 0.9482759 1.1568966
{Windy=False,Humidity=Normal,Pressure=High} => {Rain=No} 0.2103825 0.9390244 1.1456098
{Windy=False,Cloudy=Clear sky} => {Rain=No} 0.2076503 0.9382716 1.1446914
This way we can alter the association rule algorithm and utilize for classification on the response variable (Rain = Yes or No)
Visualizing association rules
Fact from the industry - Walmart used association rule learning modeling technique to uncover a very interesting fact. They identified that men who go to the store to buy diapers will also tend to buy beer at the same time.
Happy learning!
Senior AI/ML Technology Leader | GenAI/Data Science Strategist
9 年Very nicely explained..
Data Maestro | Seeing beyond the numbers to drive business impact | ISB
9 年Nice article Akhil