登录查看更多内容

Generated Data vs Monte-Carlo Simulations: What are the Differences?

Vincent Granville

Co-Founder, BondingAI.io

发布日期: 2023年7月10日

I sometimes get asked this question: could you use simulations instead of synthetizations? Below is my answer, also focusing on some particular aspects of data synthetizations, that differentiate them from other techniques.

Simulations do not simulate joint distributions

Sure, if all your features behave like a mixture of multivariate normal distributions, you can use GMMs (Gaussian mixture models) for synthetization. This is akin to Monte-Carlo simulation. The parameters of the mixture — number of clusters, covariance matrix attached to each Gaussian distribution (one per cluster), and the mixture proportions — can be estimated using the EM algorithm. It is subject to model identifiability issues, but it will work.

If the interdependence structure among the features is essentially linear, in other words well captured by the correlation matrix, you can decorrelate the features using a linear transform such as PCA to remove cross-correlations, then sample each feature separately using standard simulation techniques, and finally apply the inverse transform to add the correlations back. This is similar to what the copula method accomplishes. Each decorrelated feature can be modeled using a parametric metalog distribution to fit with various shapes, akin to Monte-Carlo simulations.

Read the full article here, in including my answer to the following questions:

Dealing with a mix of categorical, ordinal, and continuous features
Do Gaussian copulas work on non-Gaussian observations?
My simulations do as well as synthetizations, how so?
Sensitivity to changes in the real data

GenAI and Machine Learning

211,685 位关注者

Vincent Granville

Co-Founder, BondingAI.io

1 年

For the story, the picture here is the Pigne d'Arolla in Switzerland. I picked up that easy mountain (the way around the big wall is easy), hoping I could get my girlfriend to the top, with a guide. I had an engagement ring in my pocket during that 2-day "hike". I delivered it at the summit! We've been married for 23 years now. Apparently, it's less easy these days because a lot of the snow has melted.

4 次回应

Dr. Amit Shah, PhD

AI & Data Strategy Leader ? AI-Powered Business Transformation ? B2B Growth through AI & Automation ? Helping Businesses Leverage AI for Scalable Revenue Growth

1 年

Great article! I have some thoughts though, because it seems to me that synthetic data, while it can capture nonlinear correlations across features and model observed data more closely lacks context and is more likely to not capture causal dependencies. Whereas, a simulation may capture those dependencies depending on how you simulate. If we're talking about Monte Carlo simulation, which it seems you are, then yes, even that does not adequately capture causal relationships. However, I believe agent-based simulations would be much more robust, though slower to generate data and computationally expensive. Fundamentally, it depends on how well you understand a process and whether you're modeling something that is sparse in terms of data. Rare events with limited data or unobserved events would benefit from an agent-based model as they can extrapolate better with causal dependencies. Data that is abundant, but has privacy issues as is often the case in healthcare, would benefit from data synthetization. Perhaps that's a broad generalization though.

4 次回应

查看更多评论

要查看或添加评论，请登录

Vincent Granville的更多文章

10 Tips to Design Hallucination-Free RAG/LLM Systems

2025年3月20日

10 Tips to Design Hallucination-Free RAG/LLM Systems

The NVIDIA #GTC25 conference in San Jose, this week, is one of the largest AI conferences of the year. Besides robotics…

12 条评论
LLM Challenge with Petabytes of Data to Prove Famous Number Theory Conjecture

2025年3月7日

LLM Challenge with Petabytes of Data to Prove Famous Number Theory Conjecture

For direct access to the full article with code, challenge, and dataset, follow this link. In my recent article…

6 条评论
Invitation to Attend the Top AI Conference of the Year: NVIDIA GTC 2025

2025年2月27日

Invitation to Attend the Top AI Conference of the Year: NVIDIA GTC 2025

If there is one major AI event that you don’t want to miss in 2025, that’s the NVIDIA GPU Technical Conference (GTC) in…

2 条评论
Spectacular Connection Between LLMs, Quantum Systems, and Number Theory

2025年2月24日

Spectacular Connection Between LLMs, Quantum Systems, and Number Theory

In my recent research on cracking the deepest mathematical mystery, with version 2.0 published yesterday and available…

10 条评论
How to Improve RAG / LLM Accuracy & Resilience with Change Data Capture

2025年2月8日

How to Improve RAG / LLM Accuracy & Resilience with Change Data Capture

Register here. Change Data Capture (CDC) aims at detecting and tracking changes made to data.

2 条评论
Using AI to Solve the Deepest Math Conjecture

2025年1月28日

Using AI to Solve the Deepest Math Conjecture

The proof of the seminal result in question significantly benefited from our home-made AI technology: see the…

8 条评论
10 Great AI, LLM & GenAI Courses and Certifications to Boost your Career

2025年1月22日

10 Great AI, LLM & GenAI Courses and Certifications to Boost your Career

Covering all the AI topics most sought after by hiring companies: agents, multimodality, model evaluation, LangChain…

7 条评论
Piercing the Deepest Mathematical Mystery

2025年1月20日

Piercing the Deepest Mathematical Mystery

To skip the high-level presentation and directly download the paper, visit the AI research section here, and look for…

8 条评论
9 Tips to Design Hallucination-Free RAG/LLM Systems

2025年1月14日

9 Tips to Design Hallucination-Free RAG/LLM Systems

Here I explain how we manage to avoid hallucinations with our home-made Enterprise RAG/LLM. The most recent article on…

19 条评论
LLM 2.0, RAG & Non-Standard Gen AI on GitHub

2025年1月3日

LLM 2.0, RAG & Non-Standard Gen AI on GitHub

Full article available here. In this article, I share my latest Gen AI and LLM advances, featuring innovative…

See all articles

Generated Data vs Monte-Carlo Simulations: What are the Differences?

Vincent Granville

Co-Founder, BondingAI.io

Simulations do not simulate joint distributions

GenAI and Machine Learning

211,685 位关注者

Vincent Granville的更多文章

社区洞察

其他会员也浏览了

A Complete Guide to ICE Plots

Beyond 50/50: Predicting the Uncertain with "Monte Carlo Simulation"

Monte Carlo Simulation: The Secret Sauce for Predictive Success

Group-wise Precision Quantization with Test Time Adaptation (GPQT with TTA)

Simple Linear Equations

Overview of Bias-Correct Methods Used in Statistical Adjustment of GCM/RCM/SDSM Outputs

Seasonal Tas Changes From CMIP6 MIROC6 Projection

Random Thoughts

?? Monte Carlo Simulation: Unveiling the Secrets of Monopoly ??

Stochastic modeling to Destiny

Simulations do not simulate joint distributions

GenAI and Machine Learning

211,685 位关注者

Vincent Granville的更多文章

10 Tips to Design Hallucination-Free RAG/LLM Systems

LLM Challenge with Petabytes of Data to Prove Famous Number Theory Conjecture

Invitation to Attend the Top AI Conference of the Year: NVIDIA GTC 2025

Spectacular Connection Between LLMs, Quantum Systems, and Number Theory

How to Improve RAG / LLM Accuracy & Resilience with Change Data Capture

Using AI to Solve the Deepest Math Conjecture

10 Great AI, LLM & GenAI Courses and Certifications to Boost your Career

Piercing the Deepest Mathematical Mystery

9 Tips to Design Hallucination-Free RAG/LLM Systems

LLM 2.0, RAG & Non-Standard Gen AI on GitHub

社区洞察

其他会员也浏览了

A Complete Guide to ICE Plots

Beyond 50/50: Predicting the Uncertain with "Monte Carlo Simulation"

Monte Carlo Simulation: The Secret Sauce for Predictive Success

Group-wise Precision Quantization with Test Time Adaptation (GPQT with TTA)

Simple Linear Equations

Overview of Bias-Correct Methods Used in Statistical Adjustment of GCM/RCM/SDSM Outputs

Seasonal Tas Changes From CMIP6 MIROC6 Projection

Random Thoughts

?? Monte Carlo Simulation: Unveiling the Secrets of Monopoly ??

Stochastic modeling to Destiny