Synthetic Data: Modern Day Alchemy, or the Voice of the Under-Represented?

Synthetic Data: Modern Day Alchemy, or the Voice of the Under-Represented?

Just to be clear, I’m focusing on synthetic data as an augmentation to regular sample, not synthetic respondents/personas. They’re two different things, and the latter is an interesting area with exciting applications. (Good discussion on this courtesy of the MRS here).

Synthetic (augmented) data is being hailed as a solution to sample quality issues, and an easy way to bring a voice to hard-to-reach consumer groups.? But is it too good to be true?

Why it's a big thing...

Regardless of whether it’s credible or not, it’s clear that the growth of augmented/synthetic data represents a threat to existing business models, especially those focused on data collection:

  • It’s quick and cheap, giving it instant appeal
  • The debate raging around quality of online sample leaves the door wide open
  • Once clients compromise on human sample, it’ll be easy to move to automated outputs and DIY solutions. It's a slippery slope once we trade-off some of the fundamentals
  • Good agencies bring the consumer to life and showcase human truths.? It’s hard to do this (credibly) when many of your data points aren’t actually people!
  • It’ll answer anything, even bad questions.? This doesn’t help drive improvements in survey design needed to increase respondent engagement
  • It will always look (superficially) credible as it's, in large part, just mirroring the small amount of real data you collect

This makes it a big deal, and one we need to explore in more depth.

Scene-setters

  • Having explored several suppliers, I’m calling this as I see it at this point in time (personal opinion).? It’s possible I have the wrong end of the stick. And I know the solutions will improve...which isn't hard based on what we've seen! I’m keen to learn and open to changing my mind, so by all means add comments.? Note: A proper peer-reviewed white paper will be more convincing than a cherry-picked case study
  • I’m not a technophobe.? In fact I’m all-in for AI in the right application
  • FMCG/CPG Focus: I’m only offering an opinion about augmented data in the context of FMCG research.? It’s different in medical, financial and minority applications, where genuine confidentiality issues are at play
  • Alchemy:? Sounds nice, but a reminder of the definition; ‘Medieval forerunner?of chemistry, concerned with the transmutation?of matter, in particular with attempts to convert base metals into gold’.? Or less flatteringly; ‘A seemingly magical process of transformation, creation, or combination.’

What’s being promised?

  • ‘Boost under-represented sub-groups’
  • ‘Help fill hard-to-reach samples’
  • ‘Enhance the power of your data by increasing the sample size’
  • ‘Enable data to be collected faster’

What’s not to like? On the surface nothing.?

But from what I’ve seen, there doesn’t seem to be anything particularly clever going on, and that's what worries me!

Scratch below the shiny surface and it looks like glorified weighting at best…and in some cases worse!? It sniffs of 'AI Washing', the sort of AI that was an algorithm or a macro not so long ago!

Pertinent questions?

18 months ago, if you’d asked a statistician, “We’ve only found 50 young Hispanic males, but we promised the client 150. Can you turn that 50 into 150?" They’d have looked at you like you’d hacked into their Fantasy Football and swapped out Erling Haaland.

Imagine a pre-Gen AI conversation with a client along the lines of “Don’t worry we can’t find the people you’ve paid for, but we’ll use our algorithm to boost the numbers.”??Not sure that would have landed well?!

If you start with let's say 30 people, how representative of the population you’re sampling is that?? And does it really become more representative if you turn those 30 into 60 or 90 via augmented sampling? What if 3 or 4 of the 30 were outliers, they've just become 2 or 3x more important in your total sample.

Are we just getting a bit too impatient? In most cases it's not that hard to get quality feedback from enough real people in harder-to-reach groups. In FMCG/CPG work we're not talking about sufferers of a rare illness, or users of a super-complicated insurance product. You have to be prepared to pay more than $5 a response (imagine!!), and perhaps wait a touch longer. I know augmented data prevents us having to do this, but is it a compromise worth making?


What we expected to find

We’ve done quite a bit of exploration and were expecting to see;?

  • Advanced ways of creating new respondents (personas) from the existing data. These new ‘respondents’ would complete the survey as fresh ‘people' (a digital sibling if not a twin). Maybe taking this a stage further with external sources layering-in more information about the sub-group of interest to boost performance
  • Highly advanced AI/stats to ensure the structure of the data, at a respondent level, remains intact

Either of these will be possible (i.e. this will get better). But it’s not what we’ve found to date, which should be a worry.?

So what did we find??

Feed the data you’ve got into the system and define the sub-group of interest. This is the “training data”, which sounds rather grand!

There’s no requirement to provide any information on the structure of the data, or the purpose of the test.

'AI modelling/statistical methods' turn whatever base size you had into a bigger number - actual data records are created (which is the big sell of course). Up to 2 times more is recommended by some (more by others). Although in our trials we were provided with more rows of data and advised to randomly select the right number (minimum order quantity). Sitting comfortably statisticians?

You end-up with a bigger dataset that mirrors what you put in. But, unlike weighting, you now have extra data rows to play with. Not since Paul Daniels waved his wand at the lovely Debbie McGee has someone produced such a magic trick!

On getting the data back, we were surprised:

  1. We found a number (a large number) of completely duplicated rows. Maybe I'm being old school, but back in the day, duplicate data was the sort of thing that got agencies struck-off client rosters, wasn’t it?
  2. Lots of new data rows were created that ensured the resulting sub-group means and counts pretty much mirrored the data we supplied

So it’s sort of self-fulfilling, obviously “it works”!

But we’ve typically seen the augmented sub-group mean move a little closer to that of the total sample, suggesting the algorithms are playing it safe, leading to less difference between sub-groups than might be the case in reality. (My stats colleagues tell me this is quite normal. Common techniques like SMOTE use adjacent data points to create new data points in-between the actual ones - so the mean can only come inwards).

Note: Those same stats colleagues felt these techniques were really designed for situations where there's high understanding of the relationships in the 'system' being predicted. This isn't the case with how consumers react to a new food and drink idea/product. This is much more variable, which is why you need to speak/observe to more people to build an accurate picture.

A big problem with the new data we've seen (not the duplicate rows obviously!) is that the structure breaks down at an individual respondent level.

  • We assume the AI does (or will) account for halo/horn effects that exists when someone likes/dislikes an object. That should be easy enough to handle , although we weren’t convinced in the examples we’ve seen
  • But, and this is particularly the case with product testing, we tend to find lots of measures are loosely linked (correlated), and this is where we’ve found the augmented respondents fall foul. e.g. If you rate strength of flavour as too strong, chances you’ll also rate other aspects (sweetness etc) as too strong. We could argue that missing this structure doesn’t really matter if you're just performing over-arching sub-group analysis and the means all play out OK.? But it really does matter if you’re segmenting or using other advanced statistical modelling techniques, where it ruins the solution almost entirely

The question to ask suppliers isn't whether the augmented data's means match-up (self-fulfilling), but what happens when you replicate advanced statistical analysis on the new data.

Higher levels of confidence?

In the past, the research buyer had to balance confidence levels with budget. Now, it seems we can perform alchemy, meaning this choice doesn’t need to be made.

In theory we’ve created new 'people'.? But your new augmented sample is only comprised of those you interviewed properly, then a stack of data created either as duplicates or surrogates of those people.

So yes, you’ll see higher levels of confidence when you conduct statistical testing...

...But the real question is one of actual confidence, not statistical confidence.? Are you really more confident in the decisions you’ll make?

It’s a bit like deciding what to buy based on Amazon reviews.? There are 5 reviews of a product you’re considering, which doesn’t seem like many.? If you duplicate those 5, and create another 5 from a combination of the original ones, you can read 15 reviews. But are you really more confident in your choice?

Or if I manage to get a ticket to watch Villa in the Champions League (give me that one!) and then get it elaborately photocopied. Have I created a second ticket???If I sell or try to use it I think they call that counterfeiting!

Why be worried?

I get why it’s attractive to increase the base size; balancing-up samples, making fieldwork quicker and cheaper, bringing a higher weight to the voice of hard-to-reach groups etc.

I also get why people might favour this over traditional weighting - it removes some of the 'lumpiness' you can find with weighted samples, and gives you more records to play with in stats analysis.? But, based on our experience, this is at the expense of the integrity of data relationships, is that a trade-off worth making?

To me it feels like glorified weighting at best. But that certainly isn’t how it’s being sold and positioned. And bear in mind we’re paying for this augmentation, unlike when you weight data!

A further concern is that it’s an easy area in which to get away with things!? I remember the look on the faces of some young execs when I showed them you could change the total sample mean by weighting males and females differently, without changing either of the sub-group means.? It was like some form of witchcraft.? It’s easy to bamboozle with stats, impressive sounding models and the magical mention of AI!

I’m sure this can and will all get better, but for the time being it’s important that we ask the right questions of our suppliers.? All that glitters (to get back to the Alchemy point) isn’t gold!


As I said at the start, I’m all for exploring Simulated Respondents/Personas, the sort of application where we create digital twins, or fully synthetic consumers who then take part in research.? If that’s done well, with the personas created based on solid base data and a degree of freedom added into the AI completion of the task, then that’s a much more explainable and justifiable use of synthetics than we’ve seen to date in the synthetic data/augmented space.


On balance

I’m not a fan of augmented data – at least based on what we’ve seen so far!

I don't think we can divorce this from the ongoing debate around sample quality either.? If research buyers are questioning the quality of the data we get from online panels, that opens the door for synthetic data (i.e. the “is it any worse” argument).? As such it’s hard to think this doesn’t represent a serious threat to the business models of research agencies who’ve prided themselves on collecting representative data and gleaning insights from real people.?

Perhaps, as an industry, we'd be better off focusing on why we're struggling to find people who want to take part in our research. Crack this and we have the option of cost-effectively talking to enough people to negate some of the potential issues presented by augmented data. For example, 18-29, male Indonesian's might be a HARD-TO-REACH group, but they aren't a niche group, there's roughly 20,000,000 of them! The problem is it's hard to access them via online panels and we're told they don't want to take part in research. Any survey is always self-selective to an extent, but outside of that I don't buy that most harder-to-reach groups don't want to provide feedback, but the TikTok generation certainly don't want to complete a 30+ min survey about a topic they'd rarely even think about, especially if it's (obviously) been written by some old guy who lives half-way around the world!

Anyway, that's a different argument.

Back to the main point, as buyers of augmented data we need to be asking the right questions and focusing on actual not just statistical confidence. We could sleepwalk into a big industry issue if this snowballs.

Next time you’re offered a free dinner, it pays to ask what the catch is!

What are your thoughts on synthetic data in market research? Comments welcome...

#syntheticdata #augmenteddata #ai #mrx #marketresearch #restech

Jon Puleston

Chief Methodologist @ IPSOS

9 个月

Good thoughts Mat. We have spent a few months exploring the statistical reliability of synthetic data using various machine learning techniques and reaching similar conclusions. I think there is some potential for larger tracker style projects where you have enough data to sample boost for some types of questions... but for smaller samples , the scale of a typical survey say 600 responses, the noise and signal are so difficult to differentiate and so the boosted sample data is so vulnerable to over fit. So worse than, weighting, boosted sample may amplify error's. I am speaking at the ASC conference in 2 weeks time about this where I will be sharing some of these learnings.

Ben Leet

Investor, advisor, fractional CXO. Owner Stratify Consulting. Ex CEO Delineate, ex GM YouGov, ex MD Instantly. Generalist. GTM strategy, finance & planning, digital transformation.

9 个月

Good article Mat Lintern and definitely thought provoking. I think the analogy to weighting is right, but if it’s done correctly you shouldn’t have duplicates in your data sets and “noise” should be minimised. Leonardo Valente Agustín Elissondo it’s worth sharing your white paper / experience with Mat on this.

Heidi Grimmer

Managing Director @ ABBC Pty Ltd | Innovation, Consumer Research, R&D

9 个月

Rather fewer but authentic interactions. Agree augmented data tricky

回复
Phil Sutcliffe

Managing Partner, Nexxt Intelligence : Deeper insight, faster

9 个月

This is an excellent, really informative and important critique Mat Lintern many thanks for sharing. (Btw, once you've cracked augmenting Villa CL tickets, if you can turn your skills to Liverpool tickets I'd be grateful ??)

David Thomson

Founder & Chairman at MMR Research Worldwide & Annandale Distillery

9 个月

Couldn't agree more. Double the sample size with AI generated data and suddenly 'no significant difference' becomes 'significant'. Magic! Remember the old adage... "If it seems too good to be true, then it probably is!"

回复

要查看或添加评论,请登录

Mat Lintern的更多文章

  • The other way to be agile

    The other way to be agile

    Agile is the buzzword of the day, with everyone wanting to achieve results in less time. With NPD research this is…

    1 条评论
  • Fast Not Furious

    Fast Not Furious

    The world's speeding up – fact. But how many of us wish Usain Bolt had taken a bit longer, or sit in our car wishing it…

  • Big brands may not be dying, but plenty aren't in great health when it comes to NPD

    Big brands may not be dying, but plenty aren't in great health when it comes to NPD

    Reading this article from Ehrenberg Bass reinforced my belief that, whilst big brands might not be dying, they often…

  • Hotels; Why's it so hard to get the simple things right?

    Hotels; Why's it so hard to get the simple things right?

    I travel a lot, generally sampling the delights of a mid-range hotel. I don't have particularly complex needs.

    7 条评论
  • Big can be beautiful

    Big can be beautiful

    We believe the world’s biggest CPG manufacturers could do better to defend against guerrilla attack from highly-focused…

社区洞察

其他会员也浏览了