登录查看更多内容

Hashing, Synthetic Data, Enterprise Data Leakage, and the Reality of Privacy Risks

Yaw Joseph Etse

Head of Privacy, Open Banking Engineering @ Capital One | Angel Fellow, Investor

发布日期: 2024年7月26日

The timely "No, Hashing Still Doesn't Make Your Data Anonymous" post from the FTC is a great reminder that, especially with the rise of large language models (LLMs) and generative AI, how those models are trained and fine-tuned creates opportunities for massive data leakage.

Synthetic data is often considered the convenient solution to the data privacy challenges associated with LLM training and fine-tuning. However, synthetic data is not equivalent to anonymous or de-identified data.

The Appeal and Enthusiasm for Synthetic Data

Synthetic data fills gaps where real data is hard to obtain. It’s great for simulating rare events or generating large datasets for many machine-learning models. Advances in AI-generated tools have enhanced synthetic data's capabilities, making it more versatile and powerful.

Synthetic data accelerates innovation by enabling rapid development and testing of new algorithms and technologies. It provides a sandbox for experimentation without the constraints of real-world data limitations. As TDS Editors mention, “Using synthetic data isn’t exactly a new practice: it’s been a productive approach for several years now.”

Generative AI and Privacy Risks

Despite the enthusiasm, it’s critical to recognize that synthetic data is not inherently anonymous. Generative AI and LLMs have unique privacy risks. Synthetic data can still reflect patterns from real data, leading to potential re-identification risks.

From a data loss prevention and data privacy point of view, there are several additional risks unique to LLMs:

Membership and Property Leakage from Pre-Training Data: LLMs can inadvertently memorize sensitive information during their training on vast datasets.
Model Features Leakage from Trained LLM: Features learned during training can leak sensitive information when the model is queried.
Privacy Leakage from Conversations (History) with LLMs: Data from user interactions can be stored and potentially exposed.
Compliance with Privacy Intent of Users: Ensuring that the data used and generated by LLMs complies with user consent and privacy expectations.

领英推荐

How iCIMS Built Its Responsible AI Program

iCIMS 7 个月前

Code vs Algorithm vs AI (LLM): Data Privacy

Concur - Consent Manager 1 个月前

BigID Bulletin: Back to School Means Back to Business

BigID 6 个月前

Enterprise data leakage becomes more relevant when leveraging LLMs in settings such as Retrieval-Augmented Generation (RAG) or fine-tuning LLMs with enterprise data to create domain-specific models. To prevent leaks, the privacy of enterprise (training) data must be safeguarded.

Synthetic Data: Appealing Yet Not Automatically Anonymous

Synthetic data addresses many issues related to privacy and enterprise data leakage. However, as previously stated, synthetic data is not automatically anonymous. Evaluating the quality of synthetic data and avoiding the pitfalls highlighted in studies such as “On the Inadequacy of Similarity-based Privacy Metrics: Reconstruction Attacks against ‘Truly Anonymous Synthetic Data’” is crucial.

It’s much better to rely on a mathematical definition of privacy guarantees like differential privacy. Differential privacy provides robust guarantees by ensuring that the inclusion or exclusion of a single data point does not significantly affect the outcome, thereby protecting individual privacy.

Evaluating Synthetic Data Quality

To ensure synthetic data is effective and safe, evaluate its quality rigorously.

Fidelity: Measures how well synthetic data reflects the statistical properties of real data. High fidelity means the synthetic data maintains the patterns and distributions of the original dataset. “If our model can produce synthetic data that can be considered to be a random sample from the same parent distribution, then we’ve hit the jackpot.”
Utility: Determines whether synthetic data can be used effectively for its intended purpose. High-utility synthetic data should yield similar results to real data in machine learning models or statistical analyses. “The synthetic data should be just as useful when put to tasks such as regression or classification.”
Privacy: Ensures that synthetic data does not allow for re-identification of individuals. The Maximum Similarity Test compares the similarities within and between datasets to detect privacy breaches. Skabar highlights that “the biggest danger with synthetic data points being too close to observed points is privacy; i.e., being able to identify points in the observed set from points in the synthetic set.”

Practical Techniques

Maximum Similarity Test: This test involves calculating the similarity between instances in synthetic and real datasets and comparing these similarities. Similar distributions indicate high fidelity and utility without compromising privacy.
Train on Synthetic, Test on Real (TSTR): Train a machine learning model on synthetic data and test it on real data. High performance in this test shows that the synthetic data maintains utility and accurately represents real data.
Gower Similarity: For mixed-type datasets, Gower Similarity measures the distance between data points, providing a comprehensive similarity measure that accommodates various data types.

The FTC article about the ineffectiveness of hashing as a privacy protection is a good reminder that robust privacy-preserving techniques and thorough quality evaluations are key to ensuring synthetic datasets are safe and functional.

Hope Frank

4 个月

Yaw, thanks for sharing! How are you doing?

1 次回应

要查看或添加评论，请登录

Yaw Joseph Etse的更多文章

Data Utility and Protection: A Practical Exploration

2025年1月8日

Data Utility and Protection: A Practical Exploration

Over the holidays, I had the opportunity to deeply explore the intersection of data utility, security, and privacy…

1 条评论
A New Year's Resolution: Giving Back to Open Source with Faster CTGAN on Apple Silicon

2025年1月8日

A New Year's Resolution: Giving Back to Open Source with Faster CTGAN on Apple Silicon

This holiday season, I got a chance to reinvest in something I love: contributing to open source! - Adding Apple Metal…
7th Generation Data Security: Zero Trust Data Access & Entitlements

2024年9月18日

7th Generation Data Security: Zero Trust Data Access & Entitlements

The Evolution of Data Security: Why Zero Trust Is the Future Over the years, data security has seen multiple…

6 条评论
Data Quality is All You Need: Synthetic Data and Model Collapse

2024年8月9日

Data Quality is All You Need: Synthetic Data and Model Collapse

As a follow-up to "Hashing, Synthetic Data, Enterprise Data Leakage, and the Reality of Privacy Risks," it's important…

3 条评论
Making Better Decisions: Good Arguments-Driven vs. Data-Driven Decisions

2024年5月18日

Making Better Decisions: Good Arguments-Driven vs. Data-Driven Decisions

Performance management season always brings a push for data-driven approaches, and it's usually not lost on people…
LIMA: Less is More for Alignment and Data Privacy

2024年4月9日

LIMA: Less is More for Alignment and Data Privacy

The recent LIMA (Less Is More for Alignment https://arxiv.org/pdf/2305.
Follow-up on IAPP Global Privacy Summit Panels and the future of data privacy strategy

2024年4月4日

Follow-up on IAPP Global Privacy Summit Panels and the future of data privacy strategy

First of all thank you Visa and Securiti for inviting me to speak on your respective panels yesterday, it was amazing…

2 条评论
IAPP, Machine Unlearning and the Future of Data

2024年3月31日

IAPP, Machine Unlearning and the Future of Data

I am excited to engage with researchers and academics at the IAPP - International Association of Privacy Professionals…
Embracing Followership and Technical Fluency in Modern Leadership

2024年3月21日

Embracing Followership and Technical Fluency in Modern Leadership

It’s always fun when extremely talented engineers bring up the topic of continuing down an individual contributor path…
Data and IP Protection: Use Cases Define the Choice of Privacy-Enhancing Technologies

2024年1月19日

Data and IP Protection: Use Cases Define the Choice of Privacy-Enhancing Technologies

I love the recent papers taking a deeper dive into synthetic data that highlight how a one-size-fits-all approach to…

1 条评论

See all articles

Hashing, Synthetic Data, Enterprise Data Leakage, and the Reality of Privacy Risks

Yaw Joseph Etse

Head of Privacy, Open Banking Engineering @ Capital One | Angel Fellow, Investor

The Appeal and Enthusiasm for Synthetic Data

Generative AI and Privacy Risks

领英推荐

Synthetic Data: Appealing Yet Not Automatically Anonymous

Evaluating Synthetic Data Quality

Yaw Joseph Etse的更多文章

社区洞察

其他会员也浏览了

Understanding PDPC Guidelines on Use of Personal Data in AI Systems: Fostering Accountability and Transparency

Anticipating Data Privacy Vulnerabilities in AI Development

DeepSeek in India: The Battle Between Innovation and Data Privacy

AI Errors vs. Human Errors - What Is More Reliable for Data Privacy?

The Data Dilemma: How Data Services Companies Can Unlock New Value with Privacy-Enhancing Technologies

ThinkINQ: Updates from July 2024

AI Knows What You Did Last Quarter

What is Synthetic Data? A Short Guide by Betterdata

Data Privacy in the Age of AI: Striking a Balance Between Innovation and Compliance

Mastering data privacy in the age of AI

The Appeal and Enthusiasm for Synthetic Data

Generative AI and Privacy Risks

领英推荐

Synthetic Data: Appealing Yet Not Automatically Anonymous

Evaluating Synthetic Data Quality

Yaw Joseph Etse的更多文章

Data Utility and Protection: A Practical Exploration

A New Year's Resolution: Giving Back to Open Source with Faster CTGAN on Apple Silicon

7th Generation Data Security: Zero Trust Data Access & Entitlements

Data Quality is All You Need: Synthetic Data and Model Collapse

Making Better Decisions: Good Arguments-Driven vs. Data-Driven Decisions

LIMA: Less is More for Alignment and Data Privacy

Follow-up on IAPP Global Privacy Summit Panels and the future of data privacy strategy

IAPP, Machine Unlearning and the Future of Data

Embracing Followership and Technical Fluency in Modern Leadership

Data and IP Protection: Use Cases Define the Choice of Privacy-Enhancing Technologies

社区洞察

其他会员也浏览了

Understanding PDPC Guidelines on Use of Personal Data in AI Systems: Fostering Accountability and Transparency

Anticipating Data Privacy Vulnerabilities in AI Development

DeepSeek in India: The Battle Between Innovation and Data Privacy

AI Errors vs. Human Errors - What Is More Reliable for Data Privacy?

The Data Dilemma: How Data Services Companies Can Unlock New Value with Privacy-Enhancing Technologies

ThinkINQ: Updates from July 2024

AI Knows What You Did Last Quarter

What is Synthetic Data? A Short Guide by Betterdata

Data Privacy in the Age of AI: Striking a Balance Between Innovation and Compliance

Mastering data privacy in the age of AI