Segmenting with mixed scale data – a comparison
ChatGPT's visualization of this topic

Segmenting with mixed scale data – a comparison

Background

In 2021 Joseph White and I compared several programs for segmenting respondents using metric basis variables (you can find the paper here, on pages 215-226: https://sawtoothsoftware.com/resources/technical-papers/conferences/sawtooth-software-conference-2021). Frequently, however, we see a mix of variable types: some of our basis variables are metric (counts, percentages, rating scales) while others (particularly demographics) are categorical. In cases like this our alternatives for creating segments are more limited. If we want to stay in the world of distance-based clustering, we can calculate Gower distances between our cases, to put metric and categorical variables on a level playing field, then use a clustering method like partitioning around medoids (PAM). Alternatively, we can use finite mixture models (FFM) which include model-based clustering and latent class analysis. (Note for Sawtooth users: this is not the style of latent class MNL analysis provided in Lighthouse Studio, which involves respondents’ answers to a CBC or MaxDiff survey; instead it’s what’s often called “latent class clustering.”)

?Research Topic

Because Joseph and I found PAM to perform poorly compared to other distance-based methods in our 2021 paper, I’ve used FMMs when I have mixed variable types. In the past I relied on the convenient, commercially available Latent Gold package.? Recently, however, some interesting-looking R packages have appeared that also handle mixed variable types. The easiest of these to use, VarSelLCM can run FMM on mixed scale data, with or without variable selection. Wondering whether one or the other of VarSelLCM or Latent Gold does a better job of clustering mixed scale data, I decided to try them both out on some artificial data sets where I knew how many segments I had and which respondents belonged to which segments.

?Research Design

Each of 20 data sets contained 1,000 respondents.? The first 10 contained four segments of approximately equal size while data sets 11-20 each contained 4 segments of 100, 200, 300 and 400 respondents. For each of five categorical variables I sampled from a different randomly-selected segment-specific nominal distribution. For each of five metric variables I created a normal distribution (mean=0, standard deviation=3.5) around a randomly-selected segment-specific mean between 1 and 10 (with standard deviations smaller than 3.5, the segments were too easy to predict and both methods performed equally and extremely well – so I increased the standard deviation to make the classification job harder).

?Results

Each method identified the right number of segments in nine of the 20 data sets.? Each got the number of segments right six times out of ten for equal-sized segments but only three times out of 10 for the unequal size segments (replicating the common finding that segmentation methods struggle true segments of unequal size).? In all but one case, misses underestimated the number of segments.?

Table 1 – Number of segments identified by method

Next let’s look at how well each method put the right respondents into the right segments.? For this analysis we specified a 4-segment solution for each method and each data set.? For comparison we use a measure called the adjusted rand index (ARI).? ARI measures classification similarity and ranges from 0 to 1, with higher numbers suggesting more similar classification. In the first two columns of Table 2 we see the ARIs for comparing VarSelLCM and Latent Gold methods to the known segment membership, while the third column shows the ARI of VarSelLCM compared to Latent Gold.?


Table 2 – ARI by method

VarSleLCM and Latent Gold have very similar average ARIs (0.60 and 0.61).? In six of the 20 data sets Latent Gold outperformed VarSelLCM by more than a percentage point. The fourth column shows that in three of the even-sized data sets with equal sized segments VarSelLCM and Latent Gold produced exactly the same answer.? In fact, the two produced very similar answers across the board: in every case the two produced answers more similar to one another than either was to the true segment memberships.

?We see maybe a hint that Latent Gold put respondents into the right segments more successfully than did VarSelLCM, but in this small study of 20 data sets we can’t be sure this result would generalize – and in any case, both methods get the number of segments wrong equally often.

In summary, it appears that the Latent Gold and VarSelLCM methods performed about equally well segmenting mixed scale data.

?Future Research

This was a small test I put together in a couple of hours. I can imagine several ways to extend this work to make it more robust:

  • My clusters were hyperspheres, but FMMs can also handle segments with elliptical covariance structures, so how well do the two programs perform with those.?
  • I tested a case with 5 independent metric variables and 5 independent categorical variables.
  • What if we had more of one kind of variable or the other?
  • What if the variables were somewhat correlated? My way of making them kept them independent, an ideal case unlikely to be occur in practice.
  • Finally, there are a handful of other methods in R that handle mixed variable types in distance-based segmentation (using the Gower distance metric with PAM or hierarchical clustering) or using FMM methods (kamila, clustMD, Rmixmod).? A more complete study could include some or all of those as well.

?Extending the analysis in this way might make an interesting paper for a Sawtooth Research Conference one day (hint hint).

要查看或添加评论,请登录

Keith Chrzan的更多文章

社区洞察

其他会员也浏览了