Faking it so it's real: synthetic data
Lonnie Miller
Consulting & Advisory Leader Driving Growth & Innovation Using Market Insights & Technology
Think back to a time where you had to confide in someone about another person’s situation (or maybe it was really you talking about yourself without letting the other person know for sure). In either case, you couldn’t reveal all of the details or specifics otherwise you’d betray the other person’s trust in you. Perhaps struggling, you had to convey just enough details to accurately characterize the situation in hopes of, perhaps, getting good advice from the person in which you’re confiding. Can you relate to this? I’ve gone through this and sometimes my efforts to mask the details didn’t always give me confidence in what I was feeding my friend in order to get a nuanced piece of advice from them.
So let’s shift from an interpersonal situation to an analytical situation and you find yourself butting up against why synthetic data has business merits. From an economics and compliancy view, this speaks to efficiencies and trust. Synthetic data generation holds merit for companies who have to predict future, representative outcomes based on limited information while avoiding introducing biases or spilling the privacy beans from past cases.
Synthetic data supports better data sharing by removing private or sensitive information while preserving critical elements needed to inform predictive models. Synthetic data also supports creating more cases where sparse past observations exist. And subsequently, synthetic data generation supports scalable access to more representative data.
Zero-in on multiplying images from pictures that need to feed a computer vision model. I really like this video since it sets up some of the challenges while providing ?clear examples about creating derivative, yet representative images for use in training computer vision models. At one point, you’ll see the point of how one original image spawns four others which may be used in a training data set. It makes me think of a past chat with one of my friends in the automotive industry who told me: “We have to basically train an autonomous car to decide, on the fly, if that small beige object 75 yards ahead of the car is a small child or a plastic grocery bag blowing in the lane.”
领英推荐
From HR to Marketing to Manufacturing, there are viable uses for synthetic data . Drawing from my past, economic gains from this technology can look like:
… third parties safely evaluating another supplier’s data to see if the data are worth buying. Based on sharing data that is holistic, unbiased and representative of future cases, the supplier increases its representation to potential buyers for use by others.
… marketers stumbling into next best offer recommendations after doing internal analyses that reveal few, but highly valuable purchase scenarios that are realistic amongst its target audience. In this case synthetic data lends a hand to deriving different purchase permutations residing in a data warehouse. Early in my career, I had to design market research studies that tested reactions to new product offerings. We used conjoint analyses and we struggled with thinking through all of the high-value combinations to test with a sample of respondents.
While analyst communities project continued growth in this space, I find having a sensible understanding of what this new type of data does at ground level is always useful. I hope this you in a pragmatic, conversational way.
Nice article Lonnie. We knew that all data were not created equal - we had mode, variance and outliers that gave us a sense of the relative position. But with GPTs now churning out more info, I wonder when there will be an inflection between raw (original) and synthetic data? Does this make a case for knowing the source of the data (like the good old days of having a citation / reference)?