Human vs. Synthetic Data: Unlocking the Potential of AI in Market Research.
Photo by Pavel Danilyukoto

Human vs. Synthetic Data: Unlocking the Potential of AI in Market Research.

With the potential to revolutionize traditional methodologies and open up a world of possibilities, AI technologies are sending shockwaves through the Market Research industry. While the obvious application of AI in market research lies in streamlining research processes and parsing vast amounts of human-generated data, there is another groundbreaking frontier emerging; the generation of “artificial” or “synthetic” data, AI-generated data that mimics human behavior and preferences.


The Rise of Synthetic Data

Synthetic data can be generated through a variety of techniques. One commonly used method is through “agent-based simulations ”, where synthetic data is generated by simulating the behavior and interactions of individual agents within a given system or environment. This technique is particularly useful in scenarios where complex dynamics and interactions need to be captured, such as in social sciences or economic systems where the behavior of individual agents influences the system as a whole. By defining rules and parameters, these simulations can generate realistic data that reflects the behaviors observed in real-world scenarios.

For example, let's consider a scenario where a city is planning to introduce a new public transportation system and wants to evaluate its potential effects on traffic congestion and travel times. To model this situation using an agent-based approach, researchers would create a simulation where individual agents (that we could call “personas”) represent commuters who make travel choices based on their preferences and priorities, some prefer faster travel times while others prioritize cost or convenience. These agents would interact with the transportation infrastructure, choosing the mode of transportation, departure time, or route, and interacting with road networks, public transit stations, and traffic signals.

By running these simulations researchers can observe how the introduction of the new system affects the travel behavior of the agents by analyzing metrics such as traffic congestion levels, average travel times, and sustainable travel choices, to assess the impacts of the new system on the overall transportation network.

Quality assurance, software testing, and machine learning training have become the main users of synthetic data, with companies like Gretel , Mostly AI , Tonic.ai , and Genrocket providing services. But what about the application of synthetic data to market research?


Synthetic Data in Market Research

Just like in the city-planning example, by harnessing AI techniques like agent-based models, researchers can create “synthetic agents or personas” that simulate the behaviors, opinions, and preferences of “human individuals”. This opens up a world of possibilities for market research, offering a powerful tool to explore scenarios, test hypotheses, and derive insights without relying solely on real-world data collection.??

As a result, the concept of synthetic data is gaining some traction, and early initiatives have already been launched in this space, such as Persona Panels or Synthetic Users that by analyzing vast amounts of data create personas that characterize behaviors, attitudes, and attributes of a human audience. This opens up new avenues for generating insights with speed and precision.

While the value and reliability of this kind of data are yet to be seen and researchers may be skeptical to trust predictions based on it, its potential to overcome limitations related to human data collection (speed, cost) will drive its adoption.?

Specifically, agent-based modeling is particularly useful when interrelatedness, reciprocity, and feedback loops are known or suspected to exist. The main advantages are:

  • Cost-effectiveness: Generating synthetic data can be less expensive than collecting and labeling real-world data, especially in cases where data collection is challenging or time-consuming. Synthetic data can be generated in large quantities, allowing for scalability in training machine learning models without relying on limited real-world data.
  • Privacy and security: Synthetic data can be generated to preserve real-world data's privacy and confidentiality. This makes it valuable in situations where sensitive information needs to be protected.
  • Control over data characteristics: Synthetic data allows for control over various data characteristics such as distribution, diversity, and complexity. This can be beneficial for testing and validating models under different scenarios.

However, we could argue that synthetic data cannot fully replace human data. While synthetic data has its advantages, such as privacy protection and cost-effectiveness, it still has limitations in replicating the complexity and authenticity of human data. Some disadvantages:?

  • Input quality: The reliability and quality of synthetic data heavily depend on the quality of the input data and the model used to generate it. Biases present in the input data may be reflected in the synthetic data, necessitating thorough validation and verification processes.?
  • Model validity: Due to the goal of mimicking real-world data, manual quality checks become critical when dealing with complex datasets generated using algorithms. Ensuring the correctness and accuracy of synthetic data before implementing it in machine learning models is crucial. This could be such a difficult endeavor when finding data parameters to assess the validity of the model is a heavy-lifting task.??
  • Outlier Replication: Synthetic data may not accurately replicate outliers present in real-world data, potentially leading to suboptimal performance in models that rely on such outliers for accurate predictions.


Running a Conjoint Analysis Using GPT-3

Another example of synthetic data for market research purposes is the use of? Large language models (LLMs) such as GPT-3 -which are artificial intelligence systems designed to comprehend and generate human-like language-. These models undergo training using vast amounts of text data, enabling them to grasp the patterns and structures of natural language effectively. The paper titled "Using GPT for Market Research " by James Brand, Ayelet Israeli , and Donald Ngwe delves into the potential of one such LLM.?

One significant finding of this paper is that GPT-3 can generate responses to a conjoint analysis survey that aligns with economic theory and established consumer behavior patterns. Conjoint analysis is a research technique used to understand how people value different attributes (features) that make up an individual product or service. It involves presenting people with a series of hypothetical product or service profiles, each with different combinations of attributes, and asking them to choose which one they prefer. By analyzing the choices people make, researchers can estimate the relative importance of each attribute and how much people are willing to pay for each level of each attribute.?

In the paper, the authors used conjoint analysis to evaluate the realism of model-based estimates of willingness-to-pay (WTP) generated by GPT-3. They focused on choices of toothpaste and use the queried responses to estimate a multinomial logit model similar to the kind that market researchers use to estimate preferences in standard conjoint analysis.

The results suggested that GPT-3 can provide consistent and reliable insights that match those of human consumers.

The paper also presents various future directions for research. One suggestion is to prompt GPT-3 to generate artificial data that may better simulate realistic scenarios. This approach could capture emergent properties and potentially yield more accurate results than conventional simulated data.


A theoretical exercise ?

Let's imagine the agent-based approach for pricing optimization in the context of introducing a new fast-moving consumer goods (FMCG) product. Here's how it could be done:

  1. Define agents: In this case, the agents represent consumers or buyers in the market. Each agent has different preferences and priorities when it comes to purchasing FMCG products. These preferences can include factors such as price, brand loyalty, product features, and availability.
  2. Create a simulation environment: Set up a simulation environment that represents the market for FMCG products. This environment should include competing products, distribution channels, pricing strategies, and consumer demand patterns.
  3. Define agent behavior: Program the agents to make purchasing decisions based on their preferences and priorities. Agents should consider factors like features, brand loyalty, perceived quality, advertising, and availability.?
  4. Implement pricing optimization: Introduce the pricing model into the simulation. The pricing model should consider factors such as production costs, competitor prices, market demand, and price elasticity of demand. Agents will evaluate the price of the new product alongside other factors and make purchasing decisions accordingly. The agents may also take into account external factors such as market trends or promotional offers.
  5. Run simulations: Run multiple simulations with different pricing scenarios for the new FMCG product. Explore various price points, promotional offers, and pricing strategies to observe how these factors influence the agents' purchasing behavior.
  6. Collect and analyze results: Collect data on metrics such as sales volume, market share, consumer preferences, and revenue for each simulation. Analyze the simulation results to understand how different pricing scenarios impact the market dynamics, consumer behavior, and the performance of the new product.
  7. Optimize pricing: Based on the simulation results, identify the pricing scenarios that lead to desirable outcomes, such as increased market share or revenue.

As a conclusion

While synthetic data cannot replace real-world or human data in the short term, it offers new possibilities when combined with traditional research methods. As the era of synthetic data emerges, caution, transparency, and disclosure are crucial.

Accurate reflection of real-world behaviors and human preferences is vital in designing and validating synthetic data models. It is up to us to navigate its challenges and embrace its potential for shaping the future of market research.

Dmitry Gaiduk

CPO at RIWI | Entrepreneur | Maximizing Impact by Decoding Customer Behaviour | AI & Neuroscience | Research Technologies

1 年

I agree, synthetic and augmented samples will impact industry a lot. Good article, Enric!

Dharmendra Jain

Founder & CEO | AI/Gen-AI | Market Research Tech | Insight250 Winner | ESOMAR Council Member | ESOMAR AI Taskforce | Thought Leadership | Speaker

1 年

Fascinating!

At Livepanel we work with Synthetic data from real users

要查看或添加评论,请登录

社区洞察

其他会员也浏览了