Why do we need synthetic tabular data?
Image by Pexels

Why do we need synthetic tabular data?

Since in NextBrain AI we released the first version of the open source library nbsyntehtic, we have received some concerns about the the practical side of using synthetic tabular data. This package is intended to produce synthetic data from small tabular datasets in order to generate larger ones. This kind of data is a fairly common sort of data in all kinds of organizations, and is commonly required to support decision-making processes. And here's the issue: due to its limited sample size, statistical approaches aren't appropriate for extracting insights from this data. Furthermore, this data is sometimes labeled as "poor quality" meaning that is has, for example, some missing values.

Synthetic data generation algorithm.

Synthetic datasets are generated from original datasets using state-of-the-art algorithms. The question is, "How can we create a synthetic dataset from scratch and discover new patterns that we can't find in the original dataset?" This "new" data is generated from the original, and thus the information it will provide is exactly what we were able to learn from the original. As a result of this reasoning, some argue that tabular synthetic data is rendered useless.?

We are all familiar with making a photo look better

We have lots of pictures of our lives in our pockets and in the cloud. This is why image processing is so widespread. From our smartphones to NASA’s James Webb Space Telescope infrared image caption, processing technology is used to improve the quality of pictures in order to make them look better. For example, image processing helps some scientists improve the resolution of an image to better identify some details that are not clear in the original images. Well, this improved image can be considered as a "synthetic dataset" generated from the "original dataset" (the original image).?

Special algorithms are able to improve the "resolution" and '"contrast" of tabular data, so we are able to reproduce it again like a "restored image".

When we improve an image (like the one in the head of this article), we are able to see some details that we couldn't identify in the original picture. This does not mean we are "inventing" anything new: we could say this information was there but we weren't able to see it. Something similar happens when we create a tabular synthetic dataset. Special algorithms are able to improve the "resolution" and "contrast" of tabular data, so we are able to reproduce it again like a "restored image". Terms like "contrast" and "relative contrast" already exist in geometry and are successfully applied in data analysis. For example, content-based data retrieval systems relied on these concepts.?

Synthetic data is not invented data

Synthetic tabular data is not invented data, just as a restored picture is not an invented picture. We can identify patterns in synthetic datasets that are not visible in the real dataset. If we want to restore a picture with low resolution, if in the bottom-right corner there isn't a dog sitting there, in the restored image there will be no dog either. The only difference is that in the original picture we can see something in a corner, but we don't know what it is. If we restore it with a special too, we have the possibility to identify this "thing" as a dog (which is not always possible). Something similar is happening with tabular data: we can see in the original data that, for example, there exists some relation between two variables, but we don't know more details. With synthetic data generation algorithms, there is the possibility of clearly identifying the nature of this relationship.?

Conclusion

It's interesting to use the advances in image processing technologies as an equivalent of how we can benefit from tabular synthetic data. When we work with numbers structured as a table, we need a high level of abstraction and a suitable level of knowledge of math and geometry. But with images, everything is much more easy. This is why using this analogy is interesting (and necessary) if we want to explain the benefits of tabular synthetic data to a larger audience.?

要查看或添加评论,请登录

社区洞察

其他会员也浏览了