About data protection, anonymization and aggregation
Tune Insight
Encrypted computing platform for data collaborations, valorization and machine learning
Introduction
At Tune Insight, our mission is to enable data usage while guaranteeing data protection, transforming the paradigm of the data economy into an insight economy that better protects sensitive data, that is more secure, fair, and protective of privacy and confidentiality rights. The core of our solution is to enable collaboration between data holders through decentralized computations and homomorphic encryption. But this alone is not enough to achieve the strong levels of privacy protection we wish to provide. Our cryptographic primitives are complemented by state-of-the-art privacy-enhancing technologies that help bring this vision to fruition.
In this article, we discuss how the findings of a recently published review “Anonymization: The imperfect science of using data while preserving privacy”, co-written by Tune Insight employee Florimond Houssiau, apply to our work and product at Tune Insight. Notably, the review focuses on the so-called centralized trust model, where a single trusted party holds the entire dataset. Yet, many of its insights apply in the decentralized setting that we focus on.
Data aggregation as a key to anonymization
The traditional approach to data protection is “de-identification”, where a sensitive dataset is modified so that individual records can no longer be identified, then shared with data users. However, as the review repeatedly emphasizes, this approach is not suited for modern datasets, being either trivially re-identifiable by motivated attackers or modified to the point of uselessness.
“Traditional record-level de-identification techniques typically do not provide a good privacy-utility trade-off to anonymize data.”
We could not agree more. When using Tune Insight products, individual-level data is never shared with other parties. Instead, computations run with Tune Insight always output an aggregated result computed from multiple users in multiple datasets. This output is computed under robust encryption, ensuring that no information other than the result can be inferred by anyone. This practice is in line with the recommendations of the review.
“Aggregate data [...] can offer a better trade-off, but they do not inherently protect against privacy attacks.”
“It is important to emphasize that, in general, releasing only aggregate data substantially reduces the vulnerability to attacks compared to record-level data in practice.”
Protecting aggregate data
As the review emphasizes, aggregating data is not enough in itself to guarantee the protection of individual-level data. Indeed, there is extensive recent research showing that it is possible for a motivated attacker to extract sensitive information about individuals from aggregates such as counting queries or machine learning models. This is especially true in the interactive setting, where analysts can query the data dynamically:
“In the interactive setting, the adversary can freely define the queries that are answered, and has therefore a lot of flexibility in defining which aggregate information is disclosed. This may allow the adversary to actively exploit vulnerabilities in the system”
Most Tune Insight projects fall within this interactive setting. Indeed, collaborations can often involve many researchers and practitioners who wish to extract different insights from the same dataset. We are keenly aware of this, and we design projects carefully to ensure that data privacy is protected. Fortunately, the review points out that the interactive setting also has strong security benefits:
“the interactive nature of data query systems allows data curators to implement additional measures that might mitigate the risk that an adversary can successfully execute attacks. These include, for instance, mandatory authentication for the analysts and keeping a log of all queries issued by any analyst to detect possible attack attempts.”
Risk mitigation tools are embedded at the core of the Tune Insight solution, including the two examples highlighted in the review (we use strong authentication, and all our instances include a tamper-proof log). Any computation run on a Tune Insight instance must have been reviewed and authorized by the relevant data protection officers before any data is accessed. This ensures that everything that goes in an instance is carefully vetted, accessible only to trusted people, and logged for later auditing.
领英推荐
At the project level, participants can also define additional security and privacy policies that can help mitigate potential data leakage. These include query set size restriction (rejecting computations if the local or collective dataset is too small), execution quotas (limiting the number of computations that can be performed on a project), and query limitations (only allowing a specific set of pre-approved queries). Taken together, these policies enable users to tailor projects to satisfy privacy and security expectations.?
Finally, there are situations where such heuristic approaches might not be sufficient, e.g., because the data or the application is too sensitive. In such cases, projects can be configured to enable differential privacy, a robust definition of privacy extensively covered in the review. While differential privacy usually decreases the accuracy of results, having access to this stringent privacy guarantee can help unlock collaborations that might not otherwise happen!?
Synthetic data at Tune Insight
Access to data is usually the most time-consuming step of project setup, which can be frustrating for analysts (and frankly, everyone involved). Furthermore, developing an analysis pipeline on sensitive data may itself be problematic, and straight up impossible in some cases. Tune Insight instances allow you to bypass these hurdles with synthetic data, automatically generated data that resembles a real dataset.
While some have argued that synthetic data can be a useful substitute for real data, recent research has argued that it cannot achieve a satisfactory privacy-utility tradeoff. In line with the review, we focus on generating synthetic data with strong privacy guarantees, and leave real analysis to the real data.
“Overall, we thus see synthetic data as a very useful tool for testing new systems and for exploratory analysis, but its accuracy strongly depends on the use case and any findings may need to be validated on the real data.”
Private Machine Learning
Machine learning is a tremendously valuable data application. As the review points out, however, even though the parameters of a model are an opaque black box, it doesn’t mean that they are inherently privacy-preserving. We take these risks very seriously, as we mentioned in our AI manifesto. Our hybrid federation learning approach helps mitigate some of the risks associated with federated learning. In cases where privacy is paramount, our approach also supports the full training of models using differential privacy, allowing for fully privacy-preserving collaborative AI without your data ever leaving your server.
“Training models with differential privacy guarantees is regarded as the most robust and principled way to prevent attacks, although current techniques often lead to a high cost in utility (or at the cost of weak formal privacy guarantees).”
The power of collaboration
A key message of the review is that truly privacy-preserving data analysis is difficult! But we believe that Tune Insight is uniquely suited to deliver data solutions that achieve an optimal privacy-utility tradeoff, by unlocking the power of collaboration. Beyond the inherent power of friendship, collaboration has a real down-to-earth advantage: analysts get access to more data. As the review repeatedly emphasizes, increasing the dataset size helps alleviate the tensions of private data analysis. By virtue of having more records, the contribution of any one data point to the output decreases, which tends to decrease the risk of successful attacks.
“The accuracy of an [attack against a machine learning model] is generally affected by the size of the training dataset,”
“a higher number of users [...] negatively affect the attack [against aggregate statistics].”
In the case of differential privacy, it is well known that increasing the dataset size usually yields better results for the same (stringent) privacy guarantees – which is part of the reason it has mostly been applied to very large datasets, so far.
“A general solution to improve the utility of differential privacy is to use more data.”
The key challenge of collaboration in practice is trust. Even if all the hospitals in Switzerland wanted to join their data for an important research question, it would be completely out of the question to centralize all their sensitive data within any one data center. Only decentralized collaboration can achieve this scale while preserving data security and privacy.
Fachlicher Abteilungsleiter Klinische Datenplattform Forschung - Universit?tsspital Zürich
6 个月Excellent summary. These are exactly the discussions we face each and every day in practice. Thank you Florimond Houssiau !