Segmenting with mixed scale data – a comparison
Background
In 2021 Joseph White and I compared several programs for segmenting respondents using metric basis variables (you can find the paper here, on pages 215-226: https://sawtoothsoftware.com/resources/technical-papers/conferences/sawtooth-software-conference-2021). Frequently, however, we see a mix of variable types: some of our basis variables are metric (counts, percentages, rating scales) while others (particularly demographics) are categorical. In cases like this our alternatives for creating segments are more limited. If we want to stay in the world of distance-based clustering, we can calculate Gower distances between our cases, to put metric and categorical variables on a level playing field, then use a clustering method like partitioning around medoids (PAM). Alternatively, we can use finite mixture models (FFM) which include model-based clustering and latent class analysis. (Note for Sawtooth users: this is not the style of latent class MNL analysis provided in Lighthouse Studio, which involves respondents’ answers to a CBC or MaxDiff survey; instead it’s what’s often called “latent class clustering.”)
?Research Topic
Because Joseph and I found PAM to perform poorly compared to other distance-based methods in our 2021 paper, I’ve used FMMs when I have mixed variable types. In the past I relied on the convenient, commercially available Latent Gold package.? Recently, however, some interesting-looking R packages have appeared that also handle mixed variable types. The easiest of these to use, VarSelLCM can run FMM on mixed scale data, with or without variable selection. Wondering whether one or the other of VarSelLCM or Latent Gold does a better job of clustering mixed scale data, I decided to try them both out on some artificial data sets where I knew how many segments I had and which respondents belonged to which segments.
?Research Design
Each of 20 data sets contained 1,000 respondents.? The first 10 contained four segments of approximately equal size while data sets 11-20 each contained 4 segments of 100, 200, 300 and 400 respondents. For each of five categorical variables I sampled from a different randomly-selected segment-specific nominal distribution. For each of five metric variables I created a normal distribution (mean=0, standard deviation=3.5) around a randomly-selected segment-specific mean between 1 and 10 (with standard deviations smaller than 3.5, the segments were too easy to predict and both methods performed equally and extremely well – so I increased the standard deviation to make the classification job harder).
?Results
Each method identified the right number of segments in nine of the 20 data sets.? Each got the number of segments right six times out of ten for equal-sized segments but only three times out of 10 for the unequal size segments (replicating the common finding that segmentation methods struggle true segments of unequal size).? In all but one case, misses underestimated the number of segments.?
Next let’s look at how well each method put the right respondents into the right segments.? For this analysis we specified a 4-segment solution for each method and each data set.? For comparison we use a measure called the adjusted rand index (ARI).? ARI measures classification similarity and ranges from 0 to 1, with higher numbers suggesting more similar classification. In the first two columns of Table 2 we see the ARIs for comparing VarSelLCM and Latent Gold methods to the known segment membership, while the third column shows the ARI of VarSelLCM compared to Latent Gold.?
领英推荐
VarSleLCM and Latent Gold have very similar average ARIs (0.60 and 0.61).? In six of the 20 data sets Latent Gold outperformed VarSelLCM by more than a percentage point. The fourth column shows that in three of the even-sized data sets with equal sized segments VarSelLCM and Latent Gold produced exactly the same answer.? In fact, the two produced very similar answers across the board: in every case the two produced answers more similar to one another than either was to the true segment memberships.
?We see maybe a hint that Latent Gold put respondents into the right segments more successfully than did VarSelLCM, but in this small study of 20 data sets we can’t be sure this result would generalize – and in any case, both methods get the number of segments wrong equally often.
In summary, it appears that the Latent Gold and VarSelLCM methods performed about equally well segmenting mixed scale data.
?Future Research
This was a small test I put together in a couple of hours. I can imagine several ways to extend this work to make it more robust:
?Extending the analysis in this way might make an interesting paper for a Sawtooth Research Conference one day (hint hint).
Very useful!
Nice work here, Keith!