登录查看更多内容

Randomness to the Rescue

Hardik Katyarmal

Building LattIQ | Ex-CRED, Flipkart

发布日期: 2024年6月11日

What’s the difference between you and Tom Cruise when you both crave for street food? While you can simply step out and feast, Tom has to resort to weird disguises—a wig, sunglasses, a fake moustache—just to blend in and protect his identity in a public place.

In the digital world, this act of adding elements of randomness to mask identities and behaviours is called Differential Privacy. By introducing just precisely enough randomness, or noise to the data, we can protect user privacy while still extracting meaningful insights.

How can noise ever be helpful?

If you are a product manager, an analyst, a data scientist, or an engineer, you might be wondering how adding noise can ever be a good idea, especially when most decision-making is driven on the back of accurate numbers. Differential Privacy is the surprising answer. It works by adding just the right amount of randomness to data, masking individual details while still providing accurate aggregates. How you ask?

Let's take an example of querying 谷歌 's HR database to find the average salary of an engineer. If this data is leaked, even without identifiers, it could potentially uncover the identity of specific highly paid individuals, especially if the dataset is small or if the individual salaries are unique. For example, if one engineer earns significantly more than others, and their salary is included in the dataset, it becomes possible to infer their identity.

Here’s where Differential Privacy comes in. When you make the query, differential privacy algorithms first observe the sensitivity of the query to changes in the data. Sensitivity measures how much the result of a query would change if a single data point were added or removed. Based on this sensitivity, the algorithms then add calculated noise to the salary data.

This means that while you can still calculate the average salary to a (good enough) approximation, the exact salaries of individual engineers are obscured. This careful balancing act—protecting individual privacy while preserving the utility of the data—is the essence of differential privacy. It allows us to uncover valuable patterns and trends without compromising personal information.

So, adding noise will make my algo differentially private?

Nope. Differential privacy (DP) focusses on ensuring that the presence/ absence of a single individual’s data does not significantly change the output of the process thus making sure no attackers can deterministically figure out if a particular record was present in the data.

Before we dive into the formal definition, some basics on what to expect from DP:

Consistent but noisy: First things first. Given we add noise to our data whenever we run a query, we have to anticipate that we will get similar (but not exactly identical) results even if we run the same query on the same data again. Eg: Average salary output could be 100,208$ on our first attempt and 100,310$ on the second.
Testing on similar datasets: To test whether an algorithm is Differentially Private or not, we have to test it on two datasets (D1 and D2) which are almost identical except for one individual's records. Eg: Dataset D1 includes your salary, and D2 doesn’t.
Probabilistic outputs: Given the answer to a query is not exact, it is best expressed in terms of probabilities (Pr) that an answer lies within certain range of values (S). Eg: Probability that average salary on D1 is between $100,000-$110,000 is 0.1

Now that we understand we are playing in probabilities, our privacy guarantees will also be expressed with the help of these probabilities.

The formal definition for any algorithm (A) to be differentially private is to follow:

I agree, its not very intuitive to understand when you read the above equation. But its not far from our understanding of what DP achieves. Let's break it down.

Privacy Loss: The formal definition expresses the notion of one individual record not significantly changing output by limiting the difference in probabilities of the outputs from the algorithm on datasets D1 and D2. This is done with the help of a privacy loss parameter ε. Eg: If D1 probability is 0.1 and ε is 0.2, RHS probability will be bounded within 0.1 x e^0.2, that is 0.122. We add a further parameter δ that this privacy guarantees may not hold in rare cases thus quantifying the exact risks of privacy loss

What makes DP so special?

Differential Privacy stands out among privacy-preserving techniques due to its unique and powerful qualities:

Privacy Guarantees: DP ensures that the presence or absence of a single individual’s data in a dataset does not significantly affect the outcome of any analysis. This means that whether an attacker wants to re-identify someone, determine if they are in the dataset, or deduce sensitive information to the row level, Differential Privacy provides a strong and comprehensive privacy guarantee.
Resilience: Unlike older methods, Differential Privacy does not rely on assumptions about the attacker’s prior knowledge or capabilities. It provides robust privacy protection regardless of what an attacker might already know about the dataset or its individuals, making it highly resilient to various attack vectors.
Quantifiable: One of the key strengths of Differential Privacy is its ability to quantify privacy loss using the parameter ε. This allows for formal and mathematically sound statements about the privacy guarantees. Organizations can adjust ε to balance privacy and data utility, ensuring controlled and measurable privacy risks.
Composability: Differential Privacy supports the composability of multiple processes. This means that when differentially private processes are combined, the overall system retains its privacy guarantees. This is particularly useful for complex data usage scenarios, such as publishing statistics, training machine learning models, and releasing anonymized datasets, as it allows for tracking and controlling privacy loss across multiple uses.

The Origin Story

The origins of DP trace back to groundbreaking work in 2006. This pioneering research introduced the mathematical framework for Differential Privacy, setting the stage for a new era in data privacy. They defined the concept in formal terms, introducing the parameters ε and δ to quantify privacy loss and deviations. This formalism provided a robust theoretical basis that could be widely applied to various data analysis processes.

Along with the theoretical framework, the early researchers developed the first Differential Privacy algorithms. These initial algorithms demonstrated how noise could be systematically added to query results to protect individual privacy while maintaining data utility. The Laplace mechanism, one of the first and most well-known Differential Privacy algorithms, adds noise drawn from the Laplace distribution based on the sensitivity of the query.

Following the introduction of Differential Privacy, several repositories and tools were developed to facilitate its implementation. Notable repositories include PyDP and PySyft by OpenMined, the OpenDP community, 谷歌 's open-source Differential Privacy library have been crucial in making these techniques accessible and practical for a broader audience.

Have we seen this before?

DP is one of the most application-ready privacy technologies. Few use cases where DP adds a lot of value:

Machine Learning - Exploratory data analyses typically focus on aggregate trends and need data sources to be as diverse as possible. With every organisation becoming AI-enabled, this could potentially open up sensitive datasets that were out of reach until now
Publishing Statistics - The US Census employs DP to protect confidentiality of responses. By adding noise to the data, they can publish aggregate statistics that are useful for policy-making and research without compromising the privacy of individual respondents
Consumer Insights - Often, high level insights on your users are crucial in efficient operations of your partners. With DP, you need not worry about identity leakage while making sure your marketing, supply chain or inventory management partners operate at their best

Who would have thought that a sprinkle of randomness could be so powerful?? Yet, here it is, rescuing us from the watchdogs in the digital age.

#Data #Privacy #DifferentialPrivacy #DP #ML

要查看或添加评论，请登录

Hardik Katyarmal的更多文章

Not your average movie review: The Arrival

2024年8月26日

Not your average movie review: The Arrival

???? SPOILER ALERT ???? Ever watched a sci-fi movie that left you thinking about the real world? That was Amy Adams'…
How to keep the servers blind?

2024年5月23日

How to keep the servers blind?

Imagine the legendary late Charlie Munger, sitting in his office, wanting to research a new stock on his Bloomberg…
Footprints of a Digital Traveller

2024年3月21日

Footprints of a Digital Traveller

Imagine embarking on a journey of digital detox to reclaim your privacy from cookie trackers and device IDs following…

1 条评论
Privacy: Does India really care?

2023年6月23日

Privacy: Does India really care?

While meetings between Elon Musk and PM Modi make the news on "obeying the local government laws or risk getting shut…
The Last Bite: Farewell to 3P cookies

2023年6月21日

The Last Bite: Farewell to 3P cookies

Wait..

1 条评论
The Market of Attention

2023年6月16日

The Market of Attention

If you're not paying for the product, YOU are the product. However cliché that sounds, it is true.

1 条评论

See all articles

Hardik Katyarmal的更多文章

Not your average movie review: The Arrival

How to keep the servers blind?

Footprints of a Digital Traveller

Privacy: Does India really care?

The Last Bite: Farewell to 3P cookies

The Market of Attention