登录查看更多内容

Why do we need synthetic tabular data?

Javier Marin

AI Innovation Leader & Business Catalyst | Turning Complex Tech into Market-Moving Solutions | 20+ Years Building Tomorrow's Digital Infrastructure

发布日期: 2022年9月27日

Since in NextBrain AI we released the first version of the open source library nbsyntehtic, we have received some concerns about the the practical side of using synthetic tabular data. This package is intended to produce synthetic data from small tabular datasets in order to generate larger ones. This kind of data is a fairly common sort of data in all kinds of organizations, and is commonly required to support decision-making processes. And here's the issue: due to its limited sample size, statistical approaches aren't appropriate for extracting insights from this data. Furthermore, this data is sometimes labeled as "poor quality" meaning that is has, for example, some missing values.

Synthetic datasets are generated from original datasets using state-of-the-art algorithms. The question is, "How can we create a synthetic dataset from scratch and discover new patterns that we can't find in the original dataset?" This "new" data is generated from the original, and thus the information it will provide is exactly what we were able to learn from the original. As a result of this reasoning, some argue that tabular synthetic data is rendered useless.?

We are all familiar with making a photo look better

We have lots of pictures of our lives in our pockets and in the cloud. This is why image processing is so widespread. From our smartphones to NASA’s James Webb Space Telescope infrared image caption, processing technology is used to improve the quality of pictures in order to make them look better. For example, image processing helps some scientists improve the resolution of an image to better identify some details that are not clear in the original images. Well, this improved image can be considered as a "synthetic dataset" generated from the "original dataset" (the original image).?

Data & Analytics 5 个月前

On My New Fixation

Hamze Ghalebi ? 1 年前

Shaping the Future of Space with Data Science and…

The Aerospace Corporation 2 年前

Special algorithms are able to improve the "resolution" and '"contrast" of tabular data, so we are able to reproduce it again like a "restored image".

When we improve an image (like the one in the head of this article), we are able to see some details that we couldn't identify in the original picture. This does not mean we are "inventing" anything new: we could say this information was there but we weren't able to see it. Something similar happens when we create a tabular synthetic dataset. Special algorithms are able to improve the "resolution" and "contrast" of tabular data, so we are able to reproduce it again like a "restored image". Terms like "contrast" and "relative contrast" already exist in geometry and are successfully applied in data analysis. For example, content-based data retrieval systems relied on these concepts.?

Synthetic data is not invented data

Synthetic tabular data is not invented data, just as a restored picture is not an invented picture. We can identify patterns in synthetic datasets that are not visible in the real dataset. If we want to restore a picture with low resolution, if in the bottom-right corner there isn't a dog sitting there, in the restored image there will be no dog either. The only difference is that in the original picture we can see something in a corner, but we don't know what it is. If we restore it with a special too, we have the possibility to identify this "thing" as a dog (which is not always possible). Something similar is happening with tabular data: we can see in the original data that, for example, there exists some relation between two variables, but we don't know more details. With synthetic data generation algorithms, there is the possibility of clearly identifying the nature of this relationship.?

Conclusion

It's interesting to use the advances in image processing technologies as an equivalent of how we can benefit from tabular synthetic data. When we work with numbers structured as a table, we need a high level of abstraction and a suitable level of knowledge of math and geometry. But with images, everything is much more easy. This is why using this analogy is interesting (and necessary) if we want to explain the benefits of tabular synthetic data to a larger audience.?

要查看或添加评论，请登录

查看全部

Why do we need synthetic tabular data?

Javier Marin

AI Innovation Leader & Business Catalyst | Turning Complex Tech into Market-Moving Solutions | 20+ Years Building Tomorrow's Digital Infrastructure

We are all familiar with making a photo look better

领英推荐

Synthetic data is not invented data

Conclusion

更多精彩文章

社区洞察

其他会员也浏览了

Data Science Talent | Newsletter Edition 4

Unveiling the Power of Vector Databases: Leveraging LLMs and Elasticsearch

Data Science Talent | Newsletter Edition 6

[CXOTALK #856] Real AI with Real Data Scientists: Realistic Advice

Data representation

Revolutionizing Vector Databases with Level of Detail (LOD): A Game-Changer in Data Retrieval

Group Think: A Deep Dive into the World of Clustering Algorithms

What Happens When AI Masters the March Madness

Real-time Distributed Data Science is The Future!

Ten predictions for data science and AI in 2020

We are all familiar with making a photo look better

领英推荐

Synthetic data is not invented data

Conclusion

Cool AI disruption

2024年11月5日

Exploring Emotional Intelligence in AI: A Perspective on Alignment

2024年10月21日

The Next Big Thing in Synthetic Data Generation: Quantum GANs

2024年10月9日

Leading the Way in AI: Transforming Reasoning Accuracy from a CEO's Question

2024年10月8日

Laws of Physics, AI and Business strategy

2024年9月18日

The road to AGI (and beyond): it's all about human alignment

2024年8月26日

Something is missing in the AI growth debate

2024年7月4日

From Digital Whispers to Digital Colleagues: Cultivating Your AI Workforce in the New Business Ecosystem

2024年6月26日

Disruptive Innovation in GenAI and LLMs

2024年6月13日

Promising AI Takeoff: A Critical Assessment

2024年6月5日

社区洞察

其他会员也浏览了

Data Science Talent | Newsletter Edition 4

Unveiling the Power of Vector Databases: Leveraging LLMs and Elasticsearch

Data Science Talent | Newsletter Edition 6

[CXOTALK #856] Real AI with Real Data Scientists: Realistic Advice

Data representation

Revolutionizing Vector Databases with Level of Detail (LOD): A Game-Changer in Data Retrieval

Group Think: A Deep Dive into the World of Clustering Algorithms

What Happens When AI Masters the March Madness

Real-time Distributed Data Science is *The* Future!

Ten predictions for data science and AI in 2020

Real-time Distributed Data Science is The Future!