Generated Data vs Monte-Carlo Simulations: What are the Differences?
I sometimes get asked this question: could you use simulations instead of synthetizations? Below is my answer, also focusing on some particular aspects of data synthetizations, that differentiate them from other techniques.
Simulations do not simulate joint distributions
Sure, if all your features behave like a mixture of multivariate normal distributions, you can use GMMs (Gaussian mixture models) for synthetization. This is akin to Monte-Carlo simulation. The parameters of the mixture — number of clusters, covariance matrix attached to each Gaussian distribution (one per cluster), and the mixture proportions — can be estimated using the EM algorithm. It is subject to model identifiability issues, but it will work.
If the interdependence structure among the features is essentially linear, in other words well captured by the correlation matrix, you can decorrelate the features using a linear transform such as PCA to remove cross-correlations, then sample each feature separately using standard simulation techniques, and finally apply the inverse transform to add the correlations back. This is similar to what the copula method accomplishes. Each decorrelated feature can be modeled using a parametric metalog distribution to fit with various shapes, akin to Monte-Carlo simulations.
Read the full article here, in including my answer to the following questions:
Co-Founder, BondingAI.io
1 年For the story, the picture here is the Pigne d'Arolla in Switzerland. I picked up that easy mountain (the way around the big wall is easy), hoping I could get my girlfriend to the top, with a guide. I had an engagement ring in my pocket during that 2-day "hike". I delivered it at the summit! We've been married for 23 years now. Apparently, it's less easy these days because a lot of the snow has melted.
AI & Data Strategy Leader ? AI-Powered Business Transformation ? B2B Growth through AI & Automation ? Helping Businesses Leverage AI for Scalable Revenue Growth
1 年Great article! I have some thoughts though, because it seems to me that synthetic data, while it can capture nonlinear correlations across features and model observed data more closely lacks context and is more likely to not capture causal dependencies. Whereas, a simulation may capture those dependencies depending on how you simulate. If we're talking about Monte Carlo simulation, which it seems you are, then yes, even that does not adequately capture causal relationships. However, I believe agent-based simulations would be much more robust, though slower to generate data and computationally expensive. Fundamentally, it depends on how well you understand a process and whether you're modeling something that is sparse in terms of data. Rare events with limited data or unobserved events would benefit from an agent-based model as they can extrapolate better with causal dependencies. Data that is abundant, but has privacy issues as is often the case in healthcare, would benefit from data synthetization. Perhaps that's a broad generalization though.