Hallucinations of Data Protection
Jon Neiditz
Insightful Ideation by Hybrid Intelligences for Everybody, + Voices for the Strategically Silent!
(In case you missed it, here's the FPF session.)
Looking forward to talking about regulation of LLMs with Amber Nicole Ezzell , Bertram Lee Jr. and Jules Polonetsky on one of Jules' lively Linkedin Lives tomorrow at 1:00 ET, and even more to demonstrating there and here why neither side in the battle royale between data protection regulators and LLMs should be immovable, and why flexibility and creativity will serve the goals of each combatant.
I will raise this big issue in the very early days of this battle by means of the example of "hallucination," when LLMs generate content that seems coherent and meaningful, but is not accurate or based on real events or objects. We can already see substantial movement by the Italian Garante on this issue, although I will argue that it is movement in a direction that makes no sense and does nothing for data subjects.
The Mirage of Rectification
In its order of March 30th, the Garante stated among its bases that ChatGPT processes personal data inaccurately insofar as output provided may not correspond to real facts. Even knowing that there is no current way to stop hallucination in LLMs, I nonetheless hoped the Garante would hold firm because they might next protect Italians against a certain large news organization after the Dominion defamation trial, but that is another issue. The fact is the Garante did not hold firm, dropped the idea of prohibiting hallucination, and in their list of demands that OpenAI must satisfy to reopen in Italy instead focused on correcting hallucinatory outputs after the fact. GPT-4 must "enable data subjects, including non-users, to obtain rectification of their personal data as generated incorrectly by the service, or else to have those data erased if rectification were found to be technically unfeasible." Correct the information or at least get rid of it.
The problem with both the initial order and this week's demands is that they understandably assume, as does the regulatory structure that they are enforcing, that an LLM's output replicates a profile in a database that can be corrected or deleted so as to protect the data subject. Such databases of profiles, whether dossiers maintained by the secret police or indicators of purchasing predilections, are of course the reasons for the existence of data subject rights under data protection laws, which were of course enacted to provide individuals with control over their personal data, ensuring that such data is accurate, up-to-date, and not misused by organizations.
Unlike the traditional systems that data protection laws were designed to regulate, LLMs do not store or maintain a database of personal profiles. Instead, they generate outputs based on the oceans of information they have been trained on, which are often ever-changing plausible assertions. These assertions are not static pieces of information that can be corrected, but rather dynamically-generated possibilities based on all the input that the LLM receives.
Since LLMs do not maintain a database of personal profiles, rectifying or erasing personal data does not really accomplish anything for the data subject. LLMs are designed to provide contextually relevant and plausible responses, which means that even if a piece of incorrect information is "corrected" today, there is no assurance or even likelihood that the same or a different incorrect assertion won't be generated again tomorrow.
Hallucination among LLMs will take a while to fix, but progress is visible. GPT-4 is better aligned than ChatGPT in this regard, and Bing and Bard have frequent encounters with reality thanks to their internet connections, but hallucination persists. I particularly love the idea of LLMs disclosing uncertainty estimates as a kind of FDA label on LLM assertions, which probably betrays a broader love of the value of uncertainty in AI goals to the most important AI safety issues, as well-argued by Stuart Russell in Human Compatible. Which brings us to AI regulation.
领英推荐
Data Protection vs. AI Regulation
To be sure, even if DSRs designed for databases may not make sense in the context of LLMs, many of the demands of the Garante, like age gates for LLMs, make every bit as much sense in the context of LLMs as in traditional data protection contexts. Data protection principles have been an important foundation of AI regulation, are necessary to AI regulation, but are not sufficient for AI regulation. Thus for example the OECD's Privacy Principles of 1980 needed to be supplemented by its 2019 AI Principles. Comparing the two, the different imperatives when dealing with intelligent agents become clear, including human-centered values, AI safety and traceability/contestability. Yet there are so many close relationships that the migration of many of us from data protection regulation to AI regulation in the age of generative AI will be a smooth one.
As the Council of Europe continues to consider AI regulation, it might draw some comparisons to and contrasts with Recital 16 of the GDPR, which carves out national and common security from the GDPR. Unlike surveillance by nation states, LLMs represent a fundamentally different type of technology than those contemplated by the GDPR and other data protection regimes. Some aspects of the GDPR apply well to LLMs and others do not. The EU's proposed AI Act works very well for LLMs in a number of important ways, most importantly for me in applying different levels of regulation based on different risk levels, as long as all LLMs are not classified as high-risk. When an LLM is being used in employment-related decisions or criminal sentencing it needs to exhibit high degrees of explainability, traceability, auditability, provability and contestability, but when it is being used to find solutions for climate change, the sky is the limit on speed and numbers of variables.
I would end this post by noting to all of my fellow data protection and privacy people that in moving from data protection to AI regulation of LLMs, the U.S. is well positioned to provide leadership this time, thanks to the complaint by Marc Rotenberg and the Center for AI and Digital Policy to the FTC the day before the Garante's order, focusing on AI rather than data protection regulatory standards. Probably unlike most DPAs, the FTC has broad Section 5 authority that would include regulation of AI under AI regulatory frameworks like the OECD's. Legislation is also under consideration by the Administration and the Senate. So just as three years ago I annoyingly pushed privacy to stop focusing on contact tracing apps and team up with civil rights, I now nudge my friends toward AI regulation.
Epilogue
The Garante accepted OpenAI's responses and remedial plan, of course, and your summer vacation in Italy need not be marred by geoblocking of the most advanced LLM. Moreover, the Garante accepted at least OpenAI's statement that rectification is impossible in noting that OpenAI:
introduced mechanisms to enable data subjects to obtain erasure of information that is considered inaccurate, whilst stating that it is technically impossible, as of now, to rectify inaccuracies.
Have any of you seen, however, any intelligent consideration of the value of rectification in an LLM? The surveillance economy of Web 2.0 was based on the monetization of personal information, so the DSRs have been very important in that context. If LLMs were completely uninterested in personal information, what would be the point of DSRs? I look forward to DPAs or other authorities figuring out purposes for rectification that are not hallucinations.
Attorney, AI Whisperer, Open to work as independent Board member of for-profit corps. Business, Emp. & Lit. experience, all industries. Losey.ai - CEO ** e-DiscoveryTeam.com
1 年Hard to understand how the Italian regulators can be so clueless. Appreciate the explanation Re profiles.
Attorney & Management Consultant Focused on Artificial Intelligence Risk Management
1 年Agreed Jon - There is an attempt to mash-up inconsistent models with which to handle data security and privacy. As we've seen over the last few decades, this doesn't work very well. Another tradition in the legislative and regulatory area, is that decision-makers take significant action only after there are major disasters. This is also a rather weak and ill-advised approach. What we need is to get back to is truth-telling, and ways to make sure that people are telling the truth (independent auditing for example, continuous automated pen testing is another example). That's how we get out of the hallucination. Once we know the truth, and others know the truth, then we can have a real conversation about how to go forward, and properly control this very-fast-advancing new technology.