Segmenting with Mixed Type Data - A Case Study Using K-Medoids on Subscription Data
Photo by Joyce McCown on Unsplash

Segmenting with Mixed Type Data - A Case Study Using K-Medoids on Subscription Data

With the new year, I started to look for new employment opportunities and even managed to land a handful of final stage interviews before it all grounded to a halt following the corona-virus pandemic. Invariably, as part of the selection process I was asked to analyse a set of data and compile a number of data driven-recommendations to present in my final meeting.

In this post I retrace the steps I took for one of the take home analysis I was tasked with and revisit clustering, one of my favourite analytic methods. Only this time the set up is a lot closer to a real-world situation in that the data I had to analyse came with a mix of categorical and numerical feature. Simply put, this could not be tackled with a bog-standard K-means algorithm as it’s based on pairwise Euclidean distances and has no direct application to categorical data.


The data

library(tidyverse)
library(readxl)
library(skimr)
library(knitr) 
library(janitor)
library(cluster)
library(Rtsne)

The data represents all online acquisitions in February 2018 and their subscription status 3 months later (9th June 2018) for a Fictional News Aggregator Subscription business. It contains a number of parameters describing each account, like account creation date, campaign attributed to acquisition, payment method and length of product trial. A full description of the variables can be found in the Appendix

I got hold of this dataset in the course of a recruitment selection process as I was asked to carry out an analysis and present results and recommendations that could helps improve their products sign up in my final meeting. Although fictitious in nature, I thought it best to further anonimise the dataset by changing names and values of most of the variables as well as removing several features that were of no use to the analysis

The data was raw and required some cleansing and manipulation before it could be used for analysis. You can find the anonimised dataset on my GitHub profile, along with the scripts I used to cleanse the data and perform the analysis


Data Exploration

This is a vital part of any data analysis study, often overlooked in favour of the fancier and “sexier” modelling/playing with the algos. In fact, I find it one of the most creative phases of any project, where you “join the dots” among the different variables and start formulating interesting hypothesis for you to test later on

data_clean <- readRDS("../00_data/data_clean.rds")

Here I’m loading the cleaned data before I explore a selection of the features, run through a number of considerations to set up the analysis as well as show you some of the interesting insight I gathered. You can find the post covering data cleansing and formatting on my webpage: Segmenting with Mixed Type Data - Initial data inspection and manupulation

In this project I’m also testing quite a few of the adorn_ and the tabyl functions from the janitor library, a family of functions created to help expedite the initial data exploration and cleaning but also very useful to create and format summary tables.


Country of Residence

Let’s start with the geographic distribution of subscribers, the overwhelming majority of whom are UK based

No alt text provided for this image

Cancellations

The dynamics of cancellations are definitely something that warrants further investigation (outside the scope of this study). In fact, Failed payments represent roughly 10% of total subscriptions and a staggering 38% when looking at cancellations alone

data_clean %>% 
   tabyl(canc_reason, status) %>%
   adorn_totals(c("row", "col")) %>% 
   adorn_percentages("col") %>%
   adorn_pct_formatting(rounding = "half up", digits = 1) %>% 
   adorn_ns() %>%
   adorn_title("combined") %>%
   kable(align = 'c')

canc_reason/status	    Active	     Cancelled	        Total
-	                100.0% (3510)	 18.1% (224)	 78.6% (3734)
Competitor	          0.0% (0)	      1.5% (18)	      0.4% (18)
Editorial	          0.0% (0)	      3.6% (44)	      0.9% (44)
Failed Payment	      0.0% (0)	     37.9% (469)	  9.9% (469)
Lack of time	      0.0% (0)	     19.5% (242)	  5.1% (242)
Other	              0.0% (0)	      6.5% (81)	      1.7% (81)
Price	              0.0% (0)	      8.2% (101)	  2.1% (101)
UX Related	          0.0% (0)	      4.8% (60)	      1.3% (60)
Total	            100.0% (3510)	100.0% (1239)	100.0% (4749)

RECOMMENDATION 1: Review causes of high failed payment rate as it could result in a 10% boost in overall subscriptions


Payment Methods

Credit Card is the preferred payment method for over two fifths (61%) of subscribers

data_clean %>% 
   tabyl(payment_method) %>%
   adorn_pct_formatting(rounding = "half up", digits = 1) %>% 
   arrange(desc(n)) %>%  
   kable(align = 'c')

payment_method    n    percent 
Credit Card     2905    61.2% 
PayPal          1230    25.9% 
Direct Debit     614    12.9% 


Campaigns

Three acquisition campaigns drove the largest majority of subscriptions in February 2018

No alt text provided for this image

The analysis is going to focus on these top 3 campaigns, which account for over 73% of total acquisitions. Adding additional campaigns to the analysis would likely see them all lumped into one “hybrid” cluster and result in less focused insight.

data_clean %>% 
   tabyl(campaign_code) %>%
   adorn_pct_formatting(rounding = "half up", digits = 1) %>% 
   arrange(desc(n)) %>% 
   top_n(10, wt = n) %>% 
   kable(align = 'c')

campaign_code    n    percent 
    57572      1546    32.6% 
    56472      1061    22.3% 
    56972       885    18.6% 
    51372       178     3.7% 
    51472       154     3.2% 
    57272       141     3.0% 
    47672       104     2.2% 
    53872       104     2.2% 
    48272        95     2.0% 
    52972        92     1.9%


Trial Length

Potential subscribes have a number of options available but the analysis is going to take into account only customers that took the 1-month trial before sign-up.

Given that all acquisitions refer to a particular month and their current status 3 MONTHS LATER, this will ensure a clear cut window for the analysis AND cover the vast majority of subscriptions (nearly 90%)

data_clean %>% 
   tabyl(trial_length) %>%
   adorn_pct_formatting(rounding = "half up", digits = 1) %>% 
   select(-valid_percent) %>% 
   kable(align = 'c')

trial_length    n    percent 
    1M        4233    89.1% 
    2M          1      0.0% 
    3M         95      2.0% 
    6M          7      0.1% 
    NA        413      8.7%


Product Pricing

Zooming in on the top 3 campaigns for clarity, we can see the two price points of £6.99 for the Standard product and £15 for the Premium one.

Surprisingly, campaign 56472 is attributed to the acquisition of both products. It is good practice to keep acquisition campaigns more closely aligned to a single product as it would ensure better accountability and understanding of acquisition dynamics.

data_clean %>% 
   filter(campaign_code %in% c('57572', '56472', '56972')) %>% 
   filter(trial_length == '1M') %>% 
   group_by(product_group, campaign_code,
            contract_monthly_price) %>% 
   count() %>% 
   ungroup() %>%
   arrange(campaign_code) %>% 
   kable(align = 'c')

product_group  campaign_code  contract_monthly_price   n 
  Standard        56472              6.99            1061 
   Premium        56972             15.00             264 
  Standard        56972              6.99             587 
  Standard        57572              6.99            1521


Subscription Rate

For the majority of the top 10 campaigns, sign up after trial hovers in the 74-76% range. This offers a good benchmark for the general subscription levels we can expect after trial.

data_clean %>% 
   # selecting top n campaigns by number of 
   filter(campaign_code %in% c(
      '57572', '56472', '56972', '51372', '57272',
      '51472', '47672', '53872', '52972', '54072'
                               )) %>% 
   filter(trial_length == '1M') %>%
   group_by(campaign_code, status) %>% 
   count() %>% 
   ungroup() %>% 
   # lag to allign past (lagging) obs to present obs
   mutate(lag_1 = lag(n, n = 1)) %>% 
   # sort out NAs
   tidyr::fill(lag_1, .direction = "up") %>% 
   # calculate cancelled to total rate perc difference
   mutate(act_to_tot = 
            (lag_1 / (n + lag_1)))  %>% 
   filter(status == 'Cancelled') %>%
   select(campaign_code, act_to_tot) %>% 
   arrange(desc(campaign_code)) %>% 
   adorn_pct_formatting(rounding = "half up", digits = 1) %>% 
   kable(align = 'c')

campaign_code   act_to_tot 
    57572         75.5% 
    57272         75.6% 
    56972         72.5% 
    56472         75.8% 
    54072         86.4% 
    53872         71.3% 
    52972         73.9% 
    51472         48.1% 
    51372         68.8% 
    47672         74.8%

The low subscription rate for campaign 51472 may be due to the higher contract price after trial (contract_monthly_price) of £9.65 compared to £6.99 seen for other Standard product contracts.

data_clean %>% 
   filter(campaign_code == '51472') %>% 
   filter(trial_length == '1M') %>% 
   group_by(contract_monthly_price, product_group, campaign_code) %>% 
   count() %>% 
   ungroup() %>% 
   kable(align = 'c')

contract_monthly_price  product_group  campaign_code   n 
         9.65             Standard         51472      128

RECOMMENDATION 2: Investigate different price points for standard product as lower price point could boost take up


Analysis Structure

In this analysis I want to understand what factors make a customer subscribe after a trial and to do so, I need to identify what approach is best suited for the type and cut of data I have at my disposal, and to define a measure of success.

  • APPROACH: The data is a snapshot in time and as such does not lend itself to any time-based type of analysis. In such cases, one of the best suited approaches to extract insight is clustering, which is especially good when you have no prior domain knowledge to guide you. However, the fact that data is a mix of categorical and numerical features requires a slightly different approach, which I discuss in the next section.
  • MEASURE: The perfect candidate to measure success is Subscription Rate defined as Active / Total Subscribers. The an focus on 1-month trial subscribers also ensures a clear cut window for analysis as their status reflects whether they signed up 3 MONTHS AFTERWARDS.


Methodology

When I started to research cluster analysis with mix categorical and numerical data, I came across an excellent post on Towards Data Science entitled Clustering on mixed type data, from which I borrowed the core analysis coding and adjusted it to my needs. In his article, Thomas Filaire shows how to use the PAM clustering algorithm (Partitioning Around Medoids) to perform the clustering and the silhouette coefficient to select the optimal number of clusters.

The K-medoid, also know as PAM, is a clustering algorithm similar to the more popular K-means algorithm. K-means and K-medoids work in similar ways in that they create groups in your data and work on distances (often referred to as dissimilarities) as they ensure that elements in each group are very similar to one another by minimising the distance within each cluster. However, K-medoids has the advantage of working on distances other than numerical and lends itself well to analyse mixed-type data that include both numerical and categorical features.

I’m going to calculate the dissimilarities between observations with the help of the daisy function from the cluster library, which allows to choose between a number of methods (“euclidean”, “manhattan” and “gower”) to run the calculations. The Gower’s distance (1971) is of particular interest to us as it computes dissimilarities on a [0 1] range regardless of whether the input is numerical or categorical, hence making them comparable.

Clustering

I’m going to segment the subscription data by the following 5 dimensions:

  • Status: Active / Cancelled
  • Product Group: Standard / Premium
  • Top 3 Campaigns by Subscription Numbers: 56472 / 56972 / 57572
  • Payment Method: Credit Card / Direct Debit / PayPal
  • Contract Monthly Price: £6.99 / £15

I’ll start with selecting the cut of the data I need for the clustering. I’m keeping account_id for my reference but will not pass it to the algorithm.

clust_data <-
   data_clean %>% 
      # filtering by the top 3 campaigns
   filter(campaign_code %in% c('57572', '56472', '56972')) %>%
      # selecting 1M trial subscriptions
   filter(trial_length == '1M') %>% 
      # select features to cluster by
   select(account_id, status, product_group, campaign_code, 
          contract_monthly_price, payment_method
          ) %>% 
      # setting all features as factors
   mutate(
        account_id           = account_id %>% as_factor(),
        status                 = status %>% as_factor(),
        product_group          = product_group %>% as_factor(),
        campaign_code          = campaign_code %>% as_factor(),
        contract_monthly_price = contract_monthly_price %>% as_factor(),
        payment_method         = payment_method %>% as_factor()
    )

Then, I compute the Gower distance with the daisy function from the cluster package. The “Gower’s distance” would automatically be selected if some features in the data are not numeric but I prefer to spell it out anyway.

gower_dist <- 
  clust_data %>% 
    # de-select account_id
    select(2:6) %>%   
    daisy(metric = "gower")

And that’s that! You’re set to go!

Assessing the Clusters

There are a number of methods to establish the optimal number of clusters to use but in this study I’m using the silhouette coefficient, which contrasts the average distance of elements in the same cluster with average distance of elements in other clusters. In other words, each additional cluster is adding “compactness” to the individual segment, bringing their elements closer together, whilst “moving” each cluster further apart from one another.

First, I’m calculating the Partition Around Medoids (a.k.a. PAM) using the pam function from the cluster library. The one number to keep an eye on is the average silhouette width (or avg.width), which I’m storing away in the sil_width parameter.

sil_width <- c(NA)

for (i in 2:8) {  
  pam_fit <- pam(gower_dist, diss = TRUE, k = i)  
  sil_width[i] <- pam_fit$silinfo$avg.width  
}

In a business context, we want a number of clusters to be both meaningful and easy to handle, (i.e. 2 to 8) and 5-cluster configuration seems a good starting point to investigate.

sil_width %>% 
  as_tibble() %>% 
   rowid_to_column() %>% 
   filter(rowid %in% c(2:8)) %>% 
   ggplot(aes(rowid, value)) +
   geom_line(colour  = 'black', size = 0.7) +
   geom_point(colour = 'black', size = 1.3) +
   theme_minimal() +
   labs(title = 'Silhouette Widths of k-medoid Clusters',
        x     = "Number of clusters",
        y     = 'Silhouette Width') +
   theme(plot.title = element_text(hjust = 0.5))
No alt text provided for this image

When deciding on the optimal number of clusters, DO NOT rely exclusively on the output of mathematical methods. Make sure you combine it with domain knowledge and your own judgement as often adding an extra segment (say, going from 5 to 6) may only add complexity for no extra insight.


Visualising the Segments with t-SNE

Now that I have a potential optimal number of clusters, I want to visualise them. To do so, I use the t-SNE ( t-distributed stochastic neighbour embedding), a dimensionality reduction technique that assists with cluster visualisation in a similar way to Principal Component Analysis and UMAP.

First, I pull the partitioning data for the 5-cluster configuration from gower_dist

pam_fit <- 
  gower_dist %>% 
       # diss = TRUE to treat argument as dissimilarity matrix
    pam(k = 5, diss = TRUE)

Then I construct the 2D projection of my 5-dimensional space with Rtsne and plot it

tsne_obj <- Rtsne(gower_dist, is_distance = TRUE)


tsne_obj$Y %>%
  data.frame() %>%
  setNames(c("X", "Y")) %>%
  mutate(cluster = factor(pam_fit$clustering)) %>% 

  # plot
  ggplot(aes(x = X, y = Y, colour = cluster)) +
  geom_point() +
  theme_light() +
  labs(title       = 't-SNE 2D Projections of k-medoid Clusters')  +
  theme(plot.title = element_text(hjust = 0.5))
No alt text provided for this image

Aside from a few elements, there is a general good separation between clusters as well as closeness of elements within clusters, which confirms the segmentation relevance


Evaluating the Clusters

At this stage of a project you get to appreciate the main difference between Supervised and Unsupervised approaches. With the latter is the algorithm that does all the leg work but it’s still up to you to interpret the results and understand what the algorithm has found.

To properly evaluate the clusters, you want to first pull all information into one tibble that you can easily manipulate. I start by appending the cluster number from the pam_fit list to the clust_data.

pam_results <- clust_data %>%
   # append the cluster information onto clust_data
  mutate(cluster = pam_fit$clustering) %>% 
   # sort out variable order
  select(account_id, cluster, everything()) %>% 
   # attach some extra data features from the data_clean file
  left_join(select(data_clean, 
                   c(account_id, canc_reason)), by = 'account_id') %>%
  mutate_if(is.character, funs(factor(.)))

NOTE THAT I’m also bringing in canc_reason from data_clean, a dimension that I did NOT use in the calculations. It does not always work but sometimes it may reveal patterns you have not explicitly considered.

From experience I found that a very good exercise is to print out the summaries (on screen or paper) of each single cluster - here’s an example for cluster 2.

Lay all the summaries before you and let your inner Sherlock Holmes comes out to play! The story usually unfolds before your very eyes when you start comparing the differences between clusters.

pam_results %>% 
  filter(cluster == 2) %>%
  select(-c(account_id , cluster, contract_monthly_price)) %>% 
  summary() 

##        status     product_group campaign_code      payment_method
##  Cancelled:202   Standard:900   56472:  0     Direct Debit:  0   
##  Active   :698   Premium :  0   57572:900     Credit Card :900   
##                                 56972:  0     PayPal      :  0   
##                                                                  
##                                                                  
##                                                                  
##                                                                  
##          canc_reason 
##  -             :738  
##  Failed Payment: 81  
##  Lack of time  : 46  
##  Price         : 16  
##  Editorial     :  8  
##  Other         :  5  
##  (Other)       :  6

When you have your story, you may want to frame it in a nice table or graph. Here I’m showing the code I used to assemble my own overall summary and then discuss the results but feel free to tailor it to your own taste and needs!

First I calculate subscription rate (my measure of success) and failed payment rate for each cluster…

# subscription rate - my measure of success... 
subscr <- 
   pam_results %>%
   group_by(cluster, status) %>% 
   count() %>% 
   ungroup() %>% 
      # lag to allign past (lagging) obs to present obs
   mutate(lag_1 = lag(n, n = 1)) %>% 
      # sort out NAs
   tidyr::fill(lag_1, .direction = "up") %>% 
      # calculate active to total rate perc difference
   mutate(sub_rate =  (n / (n + lag_1)))  %>% 
   filter(status == 'Active') %>%
   select(Cluster = cluster, Subsc.Rate = sub_rate) %>% 
   adorn_pct_formatting(rounding = "half up", digits = 1)

# ... and failed payment rate
fail_pymt <-
   pam_results %>%
   group_by(cluster , canc_reason) %>% 
   count() %>% 
   ungroup() %>%
   group_by(cluster) %>% 
   mutate(tot = sum(n)) %>% 
      # calculate failed payment rate to total 
   mutate(fail_pymt_rate = (n / tot)) %>% 
   ungroup() %>%
   filter(canc_reason == 'Failed Payment') %>%
   select(Cluster = cluster, Failed.Pymt.Rate = fail_pymt_rate) %>%
   adorn_pct_formatting(rounding = "half up", digits = 1)

Then, I bring all together in a handy table so that we can have a closer look

pam_results %>% 
   group_by(Cluster   = cluster, 
            Campaign  = campaign_code, 
            Product   = product_group, 
            Mth.Price = contract_monthly_price) %>% 
   summarise(Cluster.Size = n()) %>% 
   ungroup() %>%  
   arrange(Campaign) %>% 
     # attach subscription rate from subscr
   left_join(select(subscr, c(Cluster, Subsc.Rate)), by = 'Cluster') %>% 
     # attach failed payment rate from fail_pymt
   left_join(select(fail_pymt, 
                     c(Cluster, Failed.Pymt.Rate)), by = 'Cluster') %>% 
   kable(align = 'c')


Clus  Camp    Prod   Mth.Price  Clus.Size  Subsc.Rate  Fail.Pymt.Rate 
  1  56472  Standard   6.99        1061      75.8%         11.0% 
  2  57572  Standard   6.99        900       77.6%          9.0% 
  5  57572  Standard   6.99        621       72.5%         15.0% 
  3  56972  Standard   6.99        587       76.0%          9.9% 
  4  56972   Premium    15         264       64.8%          9.5%

Having focused on the top 3 campaigns was a good tactic and resulted in clear separation of the groups, which meant that the K-medoid discovered sub-groups within 2 campaigns.

Let’s take a look at campaign 57572 first:

  • cluster 5 has a lower Subscription Rate than most campaigns (74-76%, remember?) AND higher Failed Payment Rate
  • Further inspection reveals that the lower subscription rate is associated with Direct Debit and PayPal payments

.

pam_results %>% 
pam_results %>% 
   group_by(Cluster     = cluster, 
            Pymt.Method = payment_method) %>% 
   summarise(Cluster.Size = n()) %>% 
   ungroup() %>%  
   filter(Cluster %in% c(2,5)) %>%
   # attach subscription rate from subscr
   left_join(select(subscr, c(Cluster, Subsc.Rate)), by = 'Cluster') %>%
   kable(align = 'c')


Cluster    Pymt.Method    Cluster.Size    Subsc.Rate 
   2       Credit Card        900            77.6% 
   5       Direct Debit       165            72.5% 
   5         PayPal           456            72.5%

RECOMMENDATION 3: there may be potential to incentivise Credit Card payments as they seem to associate with higher Subscription Rate

Let’s move on to campaign 56972:

  • We already found this campaign is attributed to acquisition of both products, and has a lower overall subscription rate of 72.5% if compared to our benchmark of 74-76%
  • K-medoid split this out into two groups, revealing that premium product has a lower subscription rate than its standard counterpart
Clus  Camp    Prod   Mth.Price  Clus.Size  Subsc.Rate  Fail.Pymt.Rate 
  1  56472  Standard   6.99        1061      75.8%         11.0% 
  2  57572  Standard   6.99        900       77.6%          9.0% 
  5  57572  Standard   6.99        621       72.5%         15.0% 
  3  56972  Standard   6.99        587       76.0%          9.9%
  4  56972   Premium    15         264       64.8%          9.5%


RECOMMENDATION 4: there may be potential to investigate different price points for Premium product that could help boost take up

RECOMMENDATION 5: acquisition campaigns should be more closely aligned to a single product to ensure accountability and better understanding of acquisition dynamics

Summary of recommendations

  • Review and solve high failed payment rate - This could result in a 10% potential boost in overall subscriptions
  • Incentivise credit card payments - Subscriptions associated with this means of payment have a higher sign up rate that other payments methods
  • Investigate different price points for both standard & premium products - Lower price point could boost take up
  • Keep campaigns more closely aligned to product - This would ensure better accountability and understanding of acquisition dynamics

Closing thoughts

In this project I revisited clustering, one of my favourite analytic methods, to explore and analyse a real-world dataset that included a mix of categorical and numerical feature. This required a different approach from the classical K-means algorithm that cannot be no directly applied to categorical data.

Instead, I used the K-medoids algorithm, also known as PAM (Partitioning Around Medoids), that has the advantage of working on distances other than numerical and lends itself well to analyse mixed-type data.

The silhouette coefficient helped to establish the optimal number of clusters, whilst t-SNE ( t-distributed stochastic neighbour embedding), a dimensionality reduction technique akin Principal Component Analysis and UMAP, unveiled good separation between clusters as well as closeness of elements within clusters, confirming the segmentation relevance.

Finally, I condensed the insight generated from the analysis into a number of actionable and data-driven recommendations that, applied correctly, could help improve product sign up.

Code Repository

The full R code and all relevant files can be found on my GitHub profile @ K Medoid Clustering

References

Appendix

Table 1 – Variable Definitions

ATTRIBUTE              DESCRIPTION

Account ID	           Unique account ID
Created Date	       Date of original account creation
Country	               Country of account holder
Status	               Current status - active/inactive
Product Group	       Product type
Payment Frequency	   Most subscriptions are a 1 year contract term
Campaign Code	       Unique identifier for campaign 
Start Date	           Start date of the trial
End Date	           Scheduled end of term
Cancellation Date	   Date of instruction to cancel
Cancellation Reason	   Reason given for cancellation
Monthly Price	       Current monthly price of subscription
Contract Monthly Price Price after promo period
Trial Length	       Length of trial
Trial Monthly Price    Price during trial
Payment Method         Payment method



Originally published on https://diegousai.io on 22nd April 2020



要查看或添加评论,请登录

Diego Usai的更多文章

社区洞察

其他会员也浏览了