Do machine learning and so-called artificial intelligence go hand in hand with privacy? Is pseudonymization the right way to go? How about a ban?
Micha? Nowakowski, PhD
AI & Data | Partner #AI #Cyber #FinTech @ ZP Legal | GovernedAI.com | Polska Organizacja Niebankowych Instytucji P?atno?ci | PTI
Privacy. Everyone has heard of it, everyone knows it, and everyone wants it respected. At the same time, we want personalized products, better health care, and more scientific discoveries, but which do not interfere excessively in our intimate world. Algorithms and artificial intelligence models are based on data and require data not only in large quantity (number?) but also in quality. Quality can be defined in many ways, including in terms of representativeness or orderliness, but also in terms of "what" we find in that data. A wider range of information that identifies individuals [not directly] by certain characteristics, allows for the creation of more advanced, effective, and "good for society" solutions, e.g. in health care - see disease detection based on images or predicting the risk of disease, e.g. cancer.
Such data can be, for example, personal data and not only of a "basic" nature but also sensitive data as referred to in Article 9(1) of Regulation 2016/679 (GDRP), such as racial origin, biometric data used to uniquely identify an individual or health data. Of course, Article 9(2) establishes some exceptions [also in the context of health protection], but the predominant basis is usually user consent or the necessity of the processing. In some simplification. Suffice it to say that the legal framework for the protection of personal data and privacy in the European Union tends to be a barrier to the development of algorithms and artificial intelligence models, and it will be even more difficult, even despite the European Commission's efforts to create differentiated data pools. This does not eliminate some limitations and risks for us, which I will write about below.
[It's also worth remembering that ENISA puts the principle of "Privacy and Data Protection by Default and Design" at every stage of software development, and after all, what we're talking about is nothing but software].
[If you are not interested in legal considerations, you can skip this paragraph] So the question arises what is privacy, what is the purpose of data protection, and how does it affect the (in)opportunities that are left to the creators of algorithms and models or artificial intelligence systems [let's not argue about definitions here, because that's not the point]. The first issue is relatively easy to resolve, as long as we assume that we are not clinging to excessive flexibility. Article 7 of the (EU) Charter of Fundamental Rights provides that "[e]veryone has the right to respect for private and family life, home and communications." Similarly, Article 47 of the Constitution of the Republic of Poland - "Everyone has the right to the legal protection of his private life, family life, honor, and good name, and to decide about his personal life". Article 8 of the Charter ensures that every citizen of the Union has the right to the protection of personal data concerning him or her.
In other words, we have the right to demand that no one enters our lives unless there is a solid legal ground to do so. It also means that our data and our privacy are subject to special protection, which is somewhat emanated by the data protection and privacy regulations [soon to be even more interesting], but also, interestingly, by the proposed regulations for artificial intelligence systems [here is a link to the original version, which has already undergone several changes]. It is the role of the controller and processor to ensure that appropriate mechanisms (external and internal), such as data protection impact assessments, information obligations, or legal safeguards, are in place, both in the design and development, implementation, and use of various solutions related to data processing and the possibility of privacy breaches.
How does this relate to algorithms and models? I will assume here a certain simplification, of course, because algorithms and models may differ and pursue different goals, as well as be based on diverse methods and techniques, although they have one thing in common - they are based on data. When looking at self-learning models, such as machine learning or deep learning, we need to have data to produce a specific result [goal], which can be a prediction, recommendation, content, or decision. There are many options, but a lot of space is taken up by predictive models, which allow us, for example, to predict - with fairly high accuracy, though not certainty - the materialization of a particular risk or event. The better the data [not necessarily more, though there is an exception], the better the results. The worse the data, the worse the results - Garbage In, Garbage Out.
We use the data primarily to train the model, that is, to put it in a state of readiness and operational action. This is done with training data, which the AI Act I mentioned earlier defines as "data used to train an artificial intelligence system by adjusting its learnable parameters, including neural network weights." There is of course also test, validation data [relevant in the context of so-called overfitting], which is also an important part of the whole process involving the implementation of the solution and its use.
Data may be of various natures and usually, for our safety, we decide to use sets of anonymous [seemingly] data - also aggregated or subjected to pseudonymization which is a reversible process, i.e. allowing to identify a user, but only in case of having appropriate instruments. e.g. a cryptographic key. The GDPR itself defines this technique as "the processing of personal data in such a way that they can no longer be attributed to a specific data subject without the use of additional information, provided that such additional information is stored separately and is covered by technical and organizational measures to prevent its attribution to an identified or identifiable natural person." Quite recently, in March this year. ENISA - the EU agency responsible for cyber security issues published a very interesting document - "Deploying pseudonymization techniques. The case of the Health Sector", which describes interesting techniques in this area.
领英推荐
[It is also worth taking a look at the document "Data Protection Engineering. From Theory to Practice", which is a guide for "data protection by design and default"]
The problem that many people involved in preparing data for algorithms and models, as well as the developers themselves, "fall into" is the belief that the simple application of both anonymization [e.g., introducing additional anonymized elements from a set of information collectively constituting an "identifiable" whole] and pseudonymization techniques are a guarantee that no personal data will be processed, and God forbid no one will be able to reconstruct [identify] a person based on specific datasets. However, the truth is that firstly - there are more and more methods of "extracting" data from training data, and secondly, the trace that we leave on the Internet and outside can easily lead us to this identification. There have been many examples throughout history.
In the case of pseudonymization, we are essentially talking about stripping the data of a certain layer that identifies specific individuals, but this does not mean that it loses the qualities of personal data. It is still possible to recover them, although the level of complexity of the techniques that allow this depends on what method was used [e.g., hashing, which is considered relatively vulnerable to brute force attacks]. Pseudonymization can degrade the effectiveness of the algorithm and model, although much depends on what application we are talking about.
For datasets that identify a person - not even directly, but by various traces - the issue is even more complicated. If we want to avoid entering the data protection regime, we should try to anonymize the data, which in many cases will significantly reduce the effectiveness and accuracy of our model. We may, for example, use the k-anonymity method, which may be reduced to the assumption that we conceal certain data in a group of data, or differential privacy, although here we are closer to solutions based on Article 9.2 of the GDPR.
Article 10.5 of the proposed regulation on artificial intelligence, which refers to sensitive data, is also worth quoting here:
To the extent strictly necessary for ensuring the monitoring, detection, and correction of bias in high-risk artificial intelligence systems [that's such a special category of systems], providers of such systems may process special categories of personal data referred to in Article 9(1) of Regulation (EU) 2016/679, Article 10 of Directive (EU) 2016/680 and Article 10(1) of 1 of Regulation (EU) 2018/1725, subject to the application of appropriate safeguards guaranteeing the protection of the fundamental rights and freedoms of natural persons, including technical measures limiting the re-use of such data and state-of-the-art measures to ensure security and privacy, such as pseudonymization or, where anonymization may significantly affect the ability to fulfill the intended purpose, encryption.
However, apart from the loss of effectiveness of the model, this does not necessarily mean that someone will not be able to reach a specific person based on a specific set of data and, for example, reveal some harmful facts about his or her life. This may raise issues of compensation, and legal and regulatory risks. Can we recreate "anyone" based on any data set and in combination with data available online and offline? Probably not, but the line is always thin, so the stage of preparing data for training should be subject to strict control and oversight.
In practice, however, an important question arises - is it even possible to train models on any data at all, since any data can theoretically be "inverted" using reverse engineering? This would of course be absurd, but on the other hand, this question is quite often raised in the context of increasing legal and regulatory obligations in the area of data protection and privacy. There is no clear answer in this regard, and even the draft legal solutions for artificial intelligence systems are not very helpful in this regard, although we will find some hints there.
We should rely on what the GDPR "gives" us, i.e. transferring the risk to the entity that "touches" the data - risk-based approach, proportionality principle, the necessity to carry out DPIA - data protection impact assessment, and technological neutrality. However, I emphasize that the data preparation stage and the preceding identification of needs and opportunities in the organization are important. This should always be the starting point.