Why do we need synthetic tabular data?
Javier Marin
AI Innovation Leader & Business Catalyst | Turning Complex Tech into Market-Moving Solutions | 20+ Years Building Tomorrow's Digital Infrastructure
Since in NextBrain AI we released the first version of the open source library nbsyntehtic, we have received some concerns about the the practical side of using synthetic tabular data. This package is intended to produce synthetic data from small tabular datasets in order to generate larger ones. This kind of data is a fairly common sort of data in all kinds of organizations, and is commonly required to support decision-making processes. And here's the issue: due to its limited sample size, statistical approaches aren't appropriate for extracting insights from this data. Furthermore, this data is sometimes labeled as "poor quality" meaning that is has, for example, some missing values.
Synthetic datasets are generated from original datasets using state-of-the-art algorithms. The question is, "How can we create a synthetic dataset from scratch and discover new patterns that we can't find in the original dataset?" This "new" data is generated from the original, and thus the information it will provide is exactly what we were able to learn from the original. As a result of this reasoning, some argue that tabular synthetic data is rendered useless.?
We are all familiar with making a photo look better
We have lots of pictures of our lives in our pockets and in the cloud. This is why image processing is so widespread. From our smartphones to NASA’s James Webb Space Telescope infrared image caption, processing technology is used to improve the quality of pictures in order to make them look better. For example, image processing helps some scientists improve the resolution of an image to better identify some details that are not clear in the original images. Well, this improved image can be considered as a "synthetic dataset" generated from the "original dataset" (the original image).?
领英推荐
Special algorithms are able to improve the "resolution" and '"contrast" of tabular data, so we are able to reproduce it again like a "restored image".
When we improve an image (like the one in the head of this article), we are able to see some details that we couldn't identify in the original picture. This does not mean we are "inventing" anything new: we could say this information was there but we weren't able to see it. Something similar happens when we create a tabular synthetic dataset. Special algorithms are able to improve the "resolution" and "contrast" of tabular data, so we are able to reproduce it again like a "restored image". Terms like "contrast" and "relative contrast" already exist in geometry and are successfully applied in data analysis. For example, content-based data retrieval systems relied on these concepts.?
Synthetic data is not invented data
Synthetic tabular data is not invented data, just as a restored picture is not an invented picture. We can identify patterns in synthetic datasets that are not visible in the real dataset. If we want to restore a picture with low resolution, if in the bottom-right corner there isn't a dog sitting there, in the restored image there will be no dog either. The only difference is that in the original picture we can see something in a corner, but we don't know what it is. If we restore it with a special too, we have the possibility to identify this "thing" as a dog (which is not always possible). Something similar is happening with tabular data: we can see in the original data that, for example, there exists some relation between two variables, but we don't know more details. With synthetic data generation algorithms, there is the possibility of clearly identifying the nature of this relationship.?
Conclusion
It's interesting to use the advances in image processing technologies as an equivalent of how we can benefit from tabular synthetic data. When we work with numbers structured as a table, we need a high level of abstraction and a suitable level of knowledge of math and geometry. But with images, everything is much more easy. This is why using this analogy is interesting (and necessary) if we want to explain the benefits of tabular synthetic data to a larger audience.?