登录查看更多内容

Segmenting with mixed scale data – a comparison

Keith Chrzan

SVP Sawtooth Analytics at Sawtooth Software

发布日期: 2025年1月7日

Background

In 2021 Joseph White and I compared several programs for segmenting respondents using metric basis variables (you can find the paper here, on pages 215-226: https://sawtoothsoftware.com/resources/technical-papers/conferences/sawtooth-software-conference-2021). Frequently, however, we see a mix of variable types: some of our basis variables are metric (counts, percentages, rating scales) while others (particularly demographics) are categorical. In cases like this our alternatives for creating segments are more limited. If we want to stay in the world of distance-based clustering, we can calculate Gower distances between our cases, to put metric and categorical variables on a level playing field, then use a clustering method like partitioning around medoids (PAM). Alternatively, we can use finite mixture models (FFM) which include model-based clustering and latent class analysis. (Note for Sawtooth users: this is not the style of latent class MNL analysis provided in Lighthouse Studio, which involves respondents’ answers to a CBC or MaxDiff survey; instead it’s what’s often called “latent class clustering.”)

?Research Topic

Because Joseph and I found PAM to perform poorly compared to other distance-based methods in our 2021 paper, I’ve used FMMs when I have mixed variable types. In the past I relied on the convenient, commercially available Latent Gold package.? Recently, however, some interesting-looking R packages have appeared that also handle mixed variable types. The easiest of these to use, VarSelLCM can run FMM on mixed scale data, with or without variable selection. Wondering whether one or the other of VarSelLCM or Latent Gold does a better job of clustering mixed scale data, I decided to try them both out on some artificial data sets where I knew how many segments I had and which respondents belonged to which segments.

?Research Design

Each of 20 data sets contained 1,000 respondents.? The first 10 contained four segments of approximately equal size while data sets 11-20 each contained 4 segments of 100, 200, 300 and 400 respondents. For each of five categorical variables I sampled from a different randomly-selected segment-specific nominal distribution. For each of five metric variables I created a normal distribution (mean=0, standard deviation=3.5) around a randomly-selected segment-specific mean between 1 and 10 (with standard deviations smaller than 3.5, the segments were too easy to predict and both methods performed equally and extremely well – so I increased the standard deviation to make the classification job harder).

?Results

Each method identified the right number of segments in nine of the 20 data sets.? Each got the number of segments right six times out of ten for equal-sized segments but only three times out of 10 for the unequal size segments (replicating the common finding that segmentation methods struggle true segments of unequal size).? In all but one case, misses underestimated the number of segments.?

Table 1 – Number of segments identified by method

Next let’s look at how well each method put the right respondents into the right segments.? For this analysis we specified a 4-segment solution for each method and each data set.? For comparison we use a measure called the adjusted rand index (ARI).? ARI measures classification similarity and ranges from 0 to 1, with higher numbers suggesting more similar classification. In the first two columns of Table 2 we see the ARIs for comparing VarSelLCM and Latent Gold methods to the known segment membership, while the third column shows the ARI of VarSelLCM compared to Latent Gold.?

领英推荐

Get started with Free Mixpanel for your analytics needs

Naveen Kumar Sangwan 1 年前

5 Best Scenarios to Use Donut Charts in Power BI…

Anurodh Kumar 2 个月前

5 Best Scenarios to Use Pie Charts in Power BI Reports

Anurodh Kumar 2 个月前

VarSleLCM and Latent Gold have very similar average ARIs (0.60 and 0.61).? In six of the 20 data sets Latent Gold outperformed VarSelLCM by more than a percentage point. The fourth column shows that in three of the even-sized data sets with equal sized segments VarSelLCM and Latent Gold produced exactly the same answer.? In fact, the two produced very similar answers across the board: in every case the two produced answers more similar to one another than either was to the true segment memberships.

?We see maybe a hint that Latent Gold put respondents into the right segments more successfully than did VarSelLCM, but in this small study of 20 data sets we can’t be sure this result would generalize – and in any case, both methods get the number of segments wrong equally often.

In summary, it appears that the Latent Gold and VarSelLCM methods performed about equally well segmenting mixed scale data.

?Future Research

This was a small test I put together in a couple of hours. I can imagine several ways to extend this work to make it more robust:

My clusters were hyperspheres, but FMMs can also handle segments with elliptical covariance structures, so how well do the two programs perform with those.?
I tested a case with 5 independent metric variables and 5 independent categorical variables.
What if we had more of one kind of variable or the other?
What if the variables were somewhat correlated? My way of making them kept them independent, an ideal case unlikely to be occur in practice.
Finally, there are a handful of other methods in R that handle mixed variable types in distance-based segmentation (using the Gower distance metric with PAM or hierarchical clustering) or using FMM methods (kamila, clustMD, Rmixmod).? A more complete study could include some or all of those as well.

?Extending the analysis in this way might make an interesting paper for a Sawtooth Research Conference one day (hint hint).

Louis A. Tucci, Ph.D.

2 个月

Very useful!

1 次回应

John Wittenbraker

2 个月

Nice work here, Keith!

1 次回应

查看更多评论

要查看或添加评论，请登录

Keith Chrzan的更多文章

Handling mixed variable types in segmentation

2025年3月11日

Handling mixed variable types in segmentation

Frequently the data sets I segment feature a mix of metric and categorical basis variables. My go-to analysis package…

8 条评论
AI-assisted R&D on Segmentation Algorithms

2025年2月28日

AI-assisted R&D on Segmentation Algorithms

I recently wanted to illustrate something for a client who wondered why I preferred latent class/model-based clustering…

12 条评论
Metaphysics, Existentialism, and Segmentation

2025年2月17日

Metaphysics, Existentialism, and Segmentation

Lately, I’ve been thinking about the metaphysics of segmentation – what ARE segments, after all? Imagine a very simple…

11 条评论
Some Atypical Choice Experiments

2025年2月5日

Some Atypical Choice Experiments

Recently we’ve had projects that called for some slightly unusual choice models. Each type of model described below we…

13 条评论
Join Us at the Sawtooth Research Conference

2025年1月30日

Join Us at the Sawtooth Research Conference

My friend Joseph White and I will be discussing cluster ensembles and latent class analysis at the upcoming Sawtooth…

2 条评论
Six Uses for Random Forest Analysis

2024年8月15日

Six Uses for Random Forest Analysis

Random Forest (RF) analysis is pretty much the Swiss Army Knife of quantitative methods. As I’ve used it more and more…

11 条评论
Constructing Bundles Based on MaxDiff Experiments

2024年7月31日

Constructing Bundles Based on MaxDiff Experiments

In a recent study a client interviewed nearly a thousand consumers about a set of 15 possible features they could add…

10 条评论
Accounting for First Mover Advantage in New Product Simulations

2024年7月10日

Accounting for First Mover Advantage in New Product Simulations

This concern comes up sometimes from my pharma clients: “Hey, the simulator says our new PCSK9 inhibitor for lowering…

4 条评论
Comparing RLH With a New Count-Based Measure for Identifying Random Respondents in MaxDiff Experiments

2024年6月18日

Comparing RLH With a New Count-Based Measure for Identifying Random Respondents in MaxDiff Experiments

Introduction Subjects providing low quality responses in survey research is a serious problem for researchers who rely…

10 条评论
A Quick and Easy Way to Identify Non-Attended Attributes in Choice-Based Conjoint Experiments

2024年6月3日

A Quick and Easy Way to Identify Non-Attended Attributes in Choice-Based Conjoint Experiments

Introduction – ANA In choice-based conjoint experiments respondents may not use all the information we give them about…

4 条评论

See all articles

Segmenting with mixed scale data – a comparison

Keith Chrzan

SVP Sawtooth Analytics at Sawtooth Software

领英推荐

Keith Chrzan的更多文章

社区洞察

其他会员也浏览了

How to use data analytics to make bricks and mortar a success

The Growing Importance of Real-Time Data in Market Research in Africa

Finding the "why" in data

Analysis Paralysis

Data Schmata – Drowning in Data, Hungry for Insight and Starving for Impact

Get Big Data Insights with $0 Budget

Data Spread

Butterfly Graphs/Tornado Charts

Get started with Free Mixpanel for your analytics needs

Big Data and Location Analysis: The Key to a Successful Startup

领英推荐

Keith Chrzan的更多文章

Handling mixed variable types in segmentation

AI-assisted R&D on Segmentation Algorithms

Metaphysics, Existentialism, and Segmentation

Some Atypical Choice Experiments

Join Us at the Sawtooth Research Conference

Six Uses for Random Forest Analysis

Constructing Bundles Based on MaxDiff Experiments

Accounting for First Mover Advantage in New Product Simulations

Comparing RLH With a New Count-Based Measure for Identifying Random Respondents in MaxDiff Experiments

A Quick and Easy Way to Identify Non-Attended Attributes in Choice-Based Conjoint Experiments

社区洞察

其他会员也浏览了

How to use data analytics to make bricks and mortar a success

The Growing Importance of Real-Time Data in Market Research in Africa

Finding the "why" in data

Analysis Paralysis

Data Schmata – Drowning in Data, Hungry for Insight and Starving for Impact

Get Big Data Insights with $0 Budget

Data Spread

Butterfly Graphs/Tornado Charts

Get started with Free Mixpanel for your analytics needs

Big Data and Location Analysis: The Key to a Successful Startup