Will synthetic data replace real survey respondents in thought leadership?
Rob Mitchell
CEO at FT Longitude; NED; Adviser on communications strategy to global brands
If you have been following the marketing press in recent months, you’d have found it hard to avoid the discussion about AI-based synthetic data. The concept is not entirely new – market researchers have used synthetic data for several years in sectors like healthcare, where privacy of real-word patient data is incredibly important. But the hype around the concept has certainly accelerated, with some predictions even suggesting the complete disruption of the market research industry as we know it.
Although there are various versions and sub-branches, synthetic data essentially uses models and algorithms to mimic real survey data and generate human-style responses. At first glance, this has several obvious advantages. It is much quicker and cheaper than traditional market research and should, in theory, bypass any privacy concerns. Certainly, some early adopters have seen impressive results. In one example, professional services firm EY engaged a synthetic data company to re-run its huge brand survey of CEOs, CFOs and other business leaders. When it compared the real and synthetic datasets, EY found that the answers generated synthetically had a correlation with the “real’ survey data of 95%.
Not all experiments have yielded such promising results, however. Researchers from Kantar compared survey data from 5,000 respondents about attitudes to a luxury product. On simpler questions, for example about pricing, the synthetic and real data were similar. But with more nuanced questions, such as around the emotions people attach to the product, the deviation between the two was much more marked.
It's easy to understand why B2B brands are excited about the potential of synthetic data. Running surveys – especially of very senior, busy people like CEOs and CFOs, is time-consuming and expensive. The idea that this process can be replaced with a synthetic approach, providing results in hours rather than weeks, and at a fraction of the cost, is clearly appealing.
For now, though, most companies are only dipping their toes in the water. The view among many marketing directors is that synthetic data can supplement, rather than replace, traditional market research techniques. For example, companies may use synthetic data to boost samples of under-represented groups, fill in gaps in surveys where respondents have not fully completed them, or generate personas that can be used to determine the scope of the research and provide early indicators of key trends. These personas can be jumping off points for discussion, as opposed to something on which a company might base expensive business decisions.
But if we’ve learned anything from the explosion of AI in recent years, we know that this could change quickly. Market research firms that use traditional panels to complete surveys should rightly feel at risk of disruption – it’s probably a question of “when”, not “if” their business model comes under threat. What marketing director, if offered the choice, would not like frequent, near-instant perspectives of their customers, that could be used to inform product decisions and marketing investments with reliable accuracy? The idea that the brand survey, usually run annually and at great expense, could be replaced with monthly, or even weekly, versions of the same research using synthetic data, opens up huge possibilities.
In the world of thought leadership, there is still a fair degree of skepticism about using synthetic data for surveys among most industry types with whom I have spoken. Perhaps the biggest challenge is the fact that thought leadership should, by definition, produce fresh, original insight. But that is difficult to envisage with synthetic data because the insight it derives is based on what is already out there. It’s derivative and retrospective, as opposed to unearthing any insight that is genuinely new. If companies are all fishing in the same pool of data, then surely they end up all saying similar things, which is not what thought leadership producers want.
领英推荐
There are other problems with synthetic data. An excellent report from the Market Research Society, called Using Synthetic Participants for Market Research, highlights several issues. First, there is the question of how real people respond to questions, as opposed to synthetic personas. Although the views of psychologists differ, many do not consider that we answer questions in a computer-like way, but instead bring together different traces of information in a way that is constantly evolving. This can lead to valuable new insights and changes in perspective that synthetic data will find it hard to capture. Training data also tends to have a strong bias towards English-language, and especially US sentiment, which is not ideal if your thought leadership campaign needs to represent global perspectives. There are also legal and privacy concerns because a lot of the data comes from social media posts by individuals, which could be personally identifiable when used as part of synthetic dataset.
The report also references research from Angelina Wang, a fellow at Stanford University, who has argued that large language models struggle to represent marginalised groups, and often produce homogenised responses – a phenomenon that she calls “group flattening”. Nuances get brushed aside and data can feel one-dimensional and predictable. That’s not particularly useful in a thought leadership context, when one of the key goals is to unearth fresh perspectives or point to emerging shifts in how executives perceive different issues.
Does this mean that synthetic data does not have a role at present in thought leadership? Not at all. While I would not, as things stand, recommend replacing traditional surveys as a research input with synthetic data, I can see real value in using it in several contexts. As part of the planning around thought leadership campaigns, we spend a lot of time thinking about the pain points and key issues facing the audiences our clients want to reach. Synthetic data could play a useful role in identifying those pain points, which can then be used to shape research questions and core messaging.
I can also envisage, in the not-too-distant future, that companies may want to boost samples using synthetic respondents. It’s hugely expensive for companies to get representative samples across all the sectors and regions that are relevant to them. So it’s possible to see how synthetic data could round out these data-sets, particularly if there is real data against which to compare to provide confidence in its accuracy.
That would likely be the first step in using synthetic data more widely as part of thought leadership surveys. If those experiments go well, companies may go on to using synthetic data to generate more topical data in between large thought leadership campaigns to fuel ongoing content programmes. Whichever way this plays out, it will take time for confidence to grow, and these early experiences will be critical in getting companies to the point where they feel that these datasets have genuine value.
For now, though, I think the traditional market research companies are safe, at least from a thought leadership perspective. The clients we work with crave authenticity – and if the conclusions drawn are based on synthetic respondents, then this gets eroded. Like many other use cases of AI, synthetic data is a useful complement, or supplement, rather than a replacement. It will certainly become more widely used in thought leadership but it will take time for the scepticism, and attachment to existing ways of doing things, to subside.
This is a timely topic! The potential of synthetic data in enhancing market research is fascinating. What trends are you seeing in how companies are integrating this technology into their survey processes?
Global Marketing Communications Director | Thought Leadership | Campaigns | Brand | Content | Reputation | McKinsey & Company | Freshfields | Linklaters
4 个月Rob Mitchell I think you are spot on with the comment “synthetic data is a useful complement, or supplement, rather than a replacement." I do wonder though whether there may be quicker and earlier success in the area of measuring client satisfaction with sentiment at “large” clients being extrapolated from smaller groups that have participated in a survey??Similarly could AI help on the same basis when it comes to understanding relationship depth? ?
Partner at New Narrative Ltd.
4 个月Very interesting Rob Mitchell - I wonder if there's been research into the effect of the proliferation of GenAI content, which will inevitably get rolled into synthetic respondent training data - causing the kind of distortions you used to get when a low-res photo was photocopied over and over. Like flattening, I suppose, but introducing weird artifacts and distortions rather than just tending towards simple averaging of responses. Likert scales often tend towards boring 3-on-a-scale-of-5 answers anyway, so I wonder what they'd look like with only synthetic data...