Is Synthetic Sample worth it ?
Anil K Pandit
Executive Vice President-Publicis Media Services - Digital | Data | Tech | Privacy | Programmatic | MMA Council Member-AI & Data and Martech | Guest Lecturer | Speaker | IAB-Working Group Member
The concept of "synthetic sampling," or using generative AI to mimic human responses in market research, has garnered significant interest. Companies like Kantar and Emporia have tested this with promising yet imperfect results. AI-generated responses—though efficient and scalable—often exhibit a strong positive bias and lack the nuance, variability, and sensitivity to sub-group distinctions that real human responses provide.
?
Key Points on Synthetic Sampling in Market Research
Example: In Kantar's study, GPT-4 rated luxury product experiences significantly more positively than actual human respondents. While human feedback varied widely, AI’s responses frequently overestimated user satisfaction. For brands, this could lead to overestimating customer satisfaction and missing out on genuine improvement areas.
Example: When Kantar segmented responses by income levels, GPT-4 struggled to reflect how lower-income respondents viewed product price differently from higher-income ones. This shows that while AI can approximate general responses, it fails to capture the diversity within specific groups. Such a limitation could skew a product’s market fit assessment in targeted demographics.
Example: Emporia found that AI-generated responses for job satisfaction among IT decision-makers had a “herd mentality.” The synthetic personas were overwhelmingly “strongly satisfied,” unlike the varied responses from real people. This lack of variation could be misleading in B2B research where individual motivations and career challenges are critical insights.
Example: Off-the-shelf AI models are trained on generic data, making them unreliable for niche product testing or context-specific questions. Kantar’s analysis showed that generic AI responses missed key attitudes unique to specific product categories, leading to less accurate predictions of customer behavior.
Example: Using synthetic sampling for high-volume, low-complexity questions—like generic product feedback or brand sentiment—could speed up data collection. AI could provide initial responses on broad questions and leave detailed insights to human analysis, ensuring that core brand messages resonate across diverse groups.
These findings underscore a pivotal truth: AI is not yet a reliable substitute for authentic human responses, especially in qualitative insights. However, it could become a valuable supplement if fine-tuned with proprietary data and context. As models evolve, blending human and synthetic sampling could enhance research, particularly for scaling generic data or expanding response types where variability isn't critical.
Ultimately, while synthetic sampling is a fascinating prospect, the present over-reliance on AI-driven data might risk undermining the very authenticity and granularity that make market research insights meaningful. As we progress, it’s clear that AI needs further refinement and thoughtful integration to serve as a robust research tool rather than a shortcut.
Anil Pandit
Executive Vice President
Publicis Media
*Disclaimer: This post is for informational purposes only and does not endorse or disapprove of any specific tools, platforms, or technologies. The views and opinions expressed in this article are those of the author and do not reflect the official policy or position of the company he is employed in.
References :
A futurist building a new data philosophy @ theDATAfirm - World's first single source non PII Humanised Dataset, Protecting Privacy, in-depth profiling of 1.4+Bn Profiles - Creating new data standards.
2 周Using any of the LLM's without contextual referencing data causes such issues - the LLM's need very very Large contextual referencing data to be able to generate synthetic data. From a market research perspective the Human profile context is essential alongwith the Human ecosystem referencing - it's not enough to reference humans as segments without layering & fine tuning indirect data points of relevance eg. for luxury goods factors such as family size, type of family, size of house, lifestyle, home amenities, consumption, opinions, outlook, location, distance, cultural, regional nuances, financial mindset... from lifestyle related activity points...need to be layered to generate customised profiles that then can be used for generating sythetic data for a particular business vertical - we're at 13000+ attributes per profile today, just right in terms of data volumes to start training LLM's ??