Building Responsible Generative AI Models with Privacy-Enhancing Technologies
Brendan Quinn
Cybersecurity & Privacy Consultant @ PwC | (ISC)2 Certified in Cybersecurity
Abstract
This paper aims to analyze new developments in the use of generative artificial intelligence (AI) and the privacy concerns that have subsequently emerged. Specifically, the scope of this paper will reference generative AI as information-based tools like chatbots, rather than image-based generative tools such as Dall-E or Adobe’s Sensei GenAI. In response to new risks, this report will identify privacy-enhancing technologies (PETs) that can be applied to generative AI models to protect and keep confidential sensitive information. The purpose is to assess the effectiveness of the three key techniques: federated learning, homomorphic encryption, and differential privacy. This research intends to bring attention to emerging privacy issues triggered by an explosion of new generative AI tools and emphasize the benefits of PETs to bolster privacy protections, regulatory compliance, and consumer trust.
1 Introduction
OpenAI’s release of the generative AI chatbot, called ChatGPT, in late November 2022 marked a historic breakthrough in Artificial Intelligence and consumer technologies. The platform gained over 100 million users in its first two months after launch making ChatGPT the fastest-growing consumer application in history according to a UBS report. In comparison to other popular applications, the same feat took TikTok nine months and Instagram two and a half years to accomplish (Rutgers, 2023). The booming popularity of ChatGPT has sent companies around the world into a race to develop their own generative AI applications. A 2023 study conducted by McKinsey and Company claims that, for sixty-three generative AI use applications that were identified in the study, as much as $4.4 trillion could be added to the global economy. While many of the possibilities of generative AI are still being uncovered, and there is plenty of market space needing to be filled, one thing is for sure generative AI is here to stay and will continue to grow. Alongside the growth of generative AI tools, there are emerging concerns of bias, copyright infringement, data confidentiality, and data privacy. This paper will focus its concern on data privacy and confidentiality. For context, data privacy commonly refers to the ideology that empowers users to control how their data is processed, stored, transferred, and maintained. Data confidentiality is a complimentary ideal to privacy that aims to keep data secure and accessible only to authorized parties. Regulatory developments in the last decade have made gigantic leaps to stay relevant to modern technology processes with the adoption of laws such as the European Union’s General Data Protection Regulation (GDPR) in 2016 and the California Consumer Privacy Act (CCPA) in 2018. These laws treat data privacy and confidentiality as basic human rights while providing individuals (data subjects) with enumerated rights including the right to be forgotten and the right to be informed. However, most of these data privacy regulations came before the recent mass public exposure to generative AI tools. Regulatory bodies are still playing catch-up to create new generative AI privacy frameworks and regulations. Notably, the National Institute of Standards and Technology (NIST) released the voluntary Artificial Intelligence Risk Management Framework 1.0 in late January 2023, and the European Union AI Act is expected to be passed into law in early 2024. Still, consumers are exposed to privacy vulnerabilities and data exposure of sensitive or personal information. Regulating businesses cannot keep pace with new technologies and then these regulations are not bulletproof. This is where the idea of Responsible AI has been introduced. Adhering to the Responsible AI framework is not necessarily legally required, but instead, it is a practice that emphasizes building privacy, among other aspects, into AI systems. Implementing this framework also supports the practice of Privacy by Design (PbD). This approach to systems engineering makes privacy a main concern at every step of the engineering process. Privacy by Design is a requirement incorporated in some privacy regulations such as the GDPR. One way to implement both a Privacy by Design and a Responsible AI model would be to implement PETs. This paper explores eight PETs and how they could be used in generative AI models to create more privacy-focused and responsible generative AI tools in the future.
2 Privacy Risks with Generative AI
The $4.4 trillion void of economic opportunity is leading companies to push issues such as data privacy to the side to gain a competitive advantage by being one of the first in the generative AI space. Data privacy is not meant to be red tape, it is instead an essential function of modern business that not only protects consumers but also shields companies from public and regulatory scrutiny. Now, instead of choosing Responsible AI practices with proper privacy considerations, companies like OpenAI with ChatGPT are making privacy their greatest threat to lasting success versus business competition. In the United States, the Federal Trade Commission (FTC) opened an investigation into OpenAI, which was leaked to the Washington Post in July 2023, to determine whether the company has engaged in unfair or deceptive trade practices concerning its privacy and data security practices. OpenAI and Meta also face civil suits from individuals who claim their AI models used their intellectual property to train their AI models. In other countries, generative AI also faces growing scrutiny, such as in Italy where ChatGPT was temporarily banned over privacy concerns (Associated Press, 2023). This scrutiny and public fear stem from the processes necessary to make a high-functioning generative AI tool. While the techniques OpenAI used to obtain the training data for ChatGPT’s model were not publicly disclosed, it is a widespread belief among professionals and regulators that the company uses internet data scraping. Data scraping or web scraping is generally allowed in the United States based on a decades-old tech law. With the accessibility and capability of new AI technology to process much larger data sets in recent years, data scraping poses privacy and confidentiality threats to individuals’ personal or sensitive information. As of July 2023, OpenAI and Microsoft face a slew of lawsuits as well as a class action that alleges, among other claims, privacy violations as a direct effect of data scraping (Bloomberg, 2023). However, data scraping concerns do not only extend to individuals but also to third-party websites that ensure the privacy rights of their consumers who could face high fines for violations under the GDPR or similar regulations. With the assumption that ChatGPT uses data scraping, the privacy and confidentiality of personal information and intellectual property users may be vulnerable to re-identification attacks, data leakage, profiling, tracking, and phishing attacks. Companies such as OpenAI, Microsoft, Meta, and others must go beyond the threshold of regulatory compliance to truly ensure the privacy and confidentiality of information. Laws, regulations, and frameworks must be a minimum standard while instead investing in responsible AI practices can lead to more secure information practices, consumer trust, and competitive advantage. PETs are one of many methods for these companies to buy into responsible AI and minimize privacy concerns.
3 Privacy-Enhancing Technologies
Privacy-enhancing technologies refer to any type of tool, method, or technique that leverages technological processes, systems, and applications to enhance data security by minimizing the risk of unauthorized access, data breaches, or privacy violations. According to the UK Information Commissioner’s Office, there are two types of PETs. The first type of PETs can provide “input privacy” and reduce access to personal or sensitive information from non-privileged parties. In comparison, the second type of PETs provides “output privacy” which reduces the risk of unauthorized parties obtaining or interfering with information as a result of a processing activity. While not all PETs fall into one of the two categories, they should be used in combination to complement both input and output privacy. If implemented properly an effective suite of PETs can provide companies with a more privacy-focused and secure data environment that can also address compliance concerns. Unlike regulation, PETs are evolving at a similar rate to privacy-threatening technologies. Maximizing the utilization of PETs can allow companies to bolster their privacy compliance and even push past minimum requirements to help ensure data confidentiality. Implementing PETs can help companies enhance data minimization principles, demonstrate data protection by design and by default approach to data processing, and apply appropriate levels of security. PETs serve as an asset to companies and once again are not meant to be seen as regulatory tape that poses a threat to their ability to process data. In many cases, PETs do just the opposite while also maintaining the anonymity of personal or sensitive data. By reducing the likelihood of privacy risks, such as re-identification attacks, with anonymizing practices companies can gain the ability to work more freely with the data they obtain (ICO UK, 2023). That is not to say PETs do not come with their own set of risks. Only if PETs are properly executed will they benefit a company otherwise they may provide false reliance. PETs are not meant to serve as an impenetrable shield to privacy and regulatory compliance. They should be used in addition to other privacy-leading practices. Next, this paper will analyze the use and possibilities of three key PETs for generative AI: federated learning, homomorphic encryption, and differential privacy.
4 Federated Learning
Since generative AI models need to be trained on extremely large data sets to accurately and effectively emulate human intelligence, many times companies must rely on third-party data. This third-party data can come from other companies or publicly accessible information. If any of that data is personal or sensitive information, then that data must be handled carefully and could otherwise expose the processing company as well as the third party who stores the data to privacy scrutiny. Federated learning is a technique that allows for a more secure AI data training process. This method allows different companies to train the same AI model locally from their information. Instead of pooling data, federated learning identifies patterns, known as gradients, in each local training model and combines the gradients into a single global model. This technique gives companies the ability to train generative AI tools with a large training data set without needing to have widespread data sharing or selling between companies. For that reason, federated learning should be classified as an input privacy PET. There are two key methods to approaching federated learning, centralized and decentralized. In a centralized approach, a single trusted third-party server is used to build an AI model that is duplicated and sent to each of the participating local data sources to train confidentiality. Only the analysis of the data, not any real data points, is sent back to the main server to be synthesized and integrated into a centralized model. This process repeats itself to continually refine and enhance the AI model. In a decentralized approach, instead of managing coordination through a trusted third party, participating entities communicate directly with each other and update the global model individually. A decentralized approach is favored in certain instances because it eliminates potential security risks that come with relying on a single centralized server. While federated learning serves to protect data as well as reduce regulatory concerns about sharing consumer or sensitive company information, it does come with challenges. There are still risks associated with the use of federated learning that require attention. Since the data used to create the global model is based on real data points, sometimes inferences can still be made to identify information that would otherwise appear anonymous. This can be done through membership attacks or model inversion. Additionally, because the training process is exposed to multiple parties, the risk of data leakage and model manipulation is at a higher likelihood. This is why combining federated learning into a suite of PETs is so essential. When complemented with another PET such as homomorphic encryption, secure multi-party computation, or differential privacy, the inherent risk can be reduced.
5 Homomorphic Encryption
Homomorphic encryption is a type of input privacy PET that enables computations on encrypted data without needing to decrypt it first. Additionally, the computations are also layered with encryption. The output can then be decrypted to produce an identical result to what would have been if the computations were performed on plaintext data. Homomorphic encryption helps mitigate security risks to information during the AI model training and inferencing phases in which federated learning is vulnerable. This method can be implemented as fully, somewhat, and partial homomorphic encryption. Since one of the limiting features of homomorphic encryption is utility, a necessary trade-off for privacy in many instances, these three approaches serve as a way to properly balance this technique for the nature and scale of any particular computational purpose. Fully homomorphic encryption allows for any function to be computed and does not limit the type or complexity of the computational operation. However, the more complex an operation is, the more resources and time are required to complete the process. Somewhat homomorphic encryption sets a fixed amount of additions and multiplications on encrypted information, which limits the types of functions supported for encryption. This form of homomorphic encryption emphasizes utility while maintaining some security protections for data computations. Lastly, partial homomorphic encryption serves as a strong balance between privacy and performance. This variation of homomorphic encryption limits the type of functions to either addition or multiplication operations but sets no limit to the number of operations that can be performed. Implementing homomorphic encryption into a generative AI model can assist in privacy compliance by enhancing the security and confidentiality of information, which minimizes the risk of data breaches. In the event of a data breach, Homomorphic encryption helps render information unintelligible to an attacker by keeping personal and sensitive information encrypted at rest, in transit, and during computation. Additionally, homomorphic encryption provides a higher level of data accuracy because it can deliver information security without creating data noise, which requires data to be altered in the hope of removing the possibility of identification from sensitive information. Unfortunately, this PET is still not all-encompassing and should be combined into a comprehensive suite of PETs. One shortfall of homomorphic encryption, as well as federated learning, is that since it is an input privacy PET, the output privacy of a generative AI model could still be left vulnerable to privacy threats. In contrast, another type of PET called differential privacy does provide output privacy capabilities.
6 Differential Privacy
As an output privacy PET, differential privacy can help reduce data security risks for information resulting from data processing activities. It does this by providing a mathematical framework to randomly inject noise into a database or framework. This noise allows AI models to train with personal or sensitive information by making it impossible to make inferences with any confidence regarding data identity or provenance from the resulting data. Additionally, when AI generates synthetic data, differential privacy can be used to ensure that sensitive information from the training data is not precisely reproduced in the synthetic data. The level of noise can be controlled by the metric “Epsilon” or “?” which is commonly referred to as the “privacy budget” or “privacy parameter”. Noise allows for plausible deniability of a particular data point being traced back to the original owner. Increasing the level of noise or the privacy parameter also increases the level of plausible deniability. The two types of differential privacy differ in who controls the aggregation process. global or centralized differential privacy involves a trusted third party that has access to the real data of multiple data-sharing entities. This aggregator randomly injects noise into the output data that is then sent to the local users for their purposes such as training a generative AI model on a large data set. The risk of this type is that the central aggregator could act maliciously with the real data that is provided by the local users. To mitigate this risk, companies should remember to perform the appropriate third-party risk and privacy assessments. Local differential privacy is different in that the local model has users apply noise injections before sharing any data with a centralized aggregator. The total noise in this method is much more than in the global model, however, to obtain useful results it also requires many more user participants. If properly implemented, differential privacy can be a useful tool in enhancing a generative AI model. Both the global and local models can inform a generative AI database with anonymized information as an output that can help mitigate privacy risks such as re-identification attacks and information leakages. If the appropriate amount of noise is applied to the model, then statistical analysis and broad trends can be accurately determined. Unfortunately, too much noise makes determining detailed patterns within data difficult.
7 Privacy-Enhancing Technology Suite
Altogether, when combined, federated learning, homomorphic encryption, and differential privacy present a complimentary suite of PETs. With input and output privacy, data security practices, and information anonymization methods covered between these techniques, implementing all appropriately serves as a strong introduction to privacy-enhancing technologies. Other innovative PETs on the market include synthetic data, confidential computing, secure multi-party computation, zero-knowledge proofs, and trusted execution environments. Researching what is most suitable for a generative AI model is imperative and highly variable based on the purpose and processes of a particular AI tool. Luckily, many PETs like the ones listed in this paper can be individually customized and combined to create a generative AI PET suite that can produce confidential, private, and informative data results. Due to the complicated nature of PETs, a dedicated team of cybersecurity, legal, IT, and privacy professionals should be engaged to implement such techniques. If unknowingly and improperly implemented, PETs could pose additional risks that may be overlooked due to risks being wrongfully considered mitigated. Privacy and legal professionals should be a part of the implementation processes because these tools can be used to bolster regulatory and privacy framework compliance.
8 Future Directions
领英推荐
The possibilities of generative AI are still being imagined and so are the privacy risks associated with AI data models. Regulatory strictness and comprehensibility need to be improved especially in the United States, which does not have a single comprehensive data privacy law or regulation. Instead, the United States contains a spiderweb of state regulations and industry-specific federal laws. In many cases, these laws can compete or overlap with one another leading to confusion and complexity for companies to maintain privacy compliance. Other areas of the world do have these types of regulations such as the European Union and Brazil, but companies can only lean on these regulatory compliance so much. Regulatory compliance protects companies from facing punitive fines but arguably a greater risk would be loss of reputational and consumer trust. Buying into concepts such as Responsible AI, adhering to privacy frameworks, and implementing innovative privacy-enhancing technologies can give companies an additional layer of protection, which consumers have a growing appetite for along with their unbounded expectations of the results to be generated by AI. Proactively investing in privacy can be a competitive advantage for companies. In the boom of generative AI, privacy controls can make or break a company. As seen with ChatGPT, they are a dominant brand in this space yet are at the forefront of FTC scrutiny because they may have failed to fully consider privacy best practices. Companies need to consider privacy from all existing lenses. Privacy functions should not solely exist within legal departments at large companies. They require considerations from a complexity of perspectives. There are also technical, business, consumer, and cultural considerations that need to be made to create an appropriate privacy program for a generative AI tool. In recent years, new c-suite roles have sprung up to help tackle privacy issues. What we have yet to see are AI-specific data privacy roles and teams forming. If AI technology continues to innovate as expected, especially with developments in quantum computing and its possibilities in combination with generative AI, consumer trust will continue its downtrend. The next major reorganization of corporations will be to create robust AI-specific data privacy departments.
9 References
Federated learning: Supporting data minimization in AI. (n.d.). Retrieved August 4, 2023, from https://iapp.org/news/a/federated-learning-supporting-data-minimization-in-ai/
FTC investigating ChatGPT creator OpenAI over consumer protection issues. (2023, July 13). AP News. https://apnews.com/article/openai-chatgpt-investigation-federal-ftc-76c6218c5069969422 82d7f5d608088e
Generative AI: A “new frontier.” (n.d.). Retrieved August 4, 2023, from https://iapp.org/news/a/generative-ai-a-new-frontier/
Generative AI: Privacy and tech perspectives. (n.d.). https://iapp.org/news/a/generative-ai-privacy-and-tech-perspectives/
Hu, K. (2023, February 2). ChatGPT sets record for fastest-growing user base - analyst note. Reuters. https://www.reuters.com/technology/chatgpt-sets-record-fastest-growing-user-base-analy st-note-2023-02-01/
Komiya, K., Mukherjee, S., & Mukherjee, S. (2023, May 20). G7 calls for developing global technical standards for AI. Reuters. https://www.reuters.com/world/g7-calls-developing-global-technical-standards-ai-2023-0 5-20/
OpenAI’s Legal Woes Driven by Unclear Mesh of Web-Scraping Laws. (n.d.). News.bloomberglaw.com. Retrieved August 4, 2023, from https://news.bloomberglaw.com/ip-law/openais-legal-woes-driven-by-unclear-mesh-of-w eb-scraping-laws
Privacy-Enhancing Technologies for Generative AI 15 Poireault, K. (2023, January 27). #DataPrivacyWeek: ChatGPT’s Data-Scraping Model Under Scrutiny From Privacy Experts. Infosecurity Magazine. https://www.infosecurity-magazine.com/news-features/chatgpts-datascraping-scrutiny/
Poireault, K. (2023, January 28). #DataPrivacyWeek: Addressing ChatGPT’s Shortfalls in Data Protection Law Compliance. Infosecurity Magazine. https://www.infosecurity-magazine.com/news-features/chatgpt-shortfalls-data-protection/
Privacy-enhancing technologies (PETs). (2023, June 19). Ico.org.uk. https://ico.org.uk/for-organisations/uk-gdpr-guidance-and-resources/data-sharing/privacyenhancing-technologies/
The economic potential of generative AI: The next productivity frontier | McKinsey Live | McKinsey & Company. (n.d.). Www.mckinsey.com. Retrieved August 4, 2023, from https://www.mckinsey.com/featured-insights/mckinsey-live/webinars/the-economic-poten tial-of-generative-ai-the-next-productivity-frontier
The latest in homomorphic encryption: A game-changer shaping up. (n.d.). Retrieved August 4, 2023, from https://iapp.org/news/a/the-latest-in-homomorphic-encryption-a-game-changer-shaping-u p/ Volpicelli, G. (2023, March 3).
ChatGPT broke the EU plan to regulate AI. POLITICO. https://www.politico.eu/article/eu-plan-regulate-chatgpt-openai-artificial-intelligence-act/
What is Differential Privacy? – MIT Ethical Technology Initiative. (n.d.). https://eti.mit.edu/what-is-differential-privacy/
Global Data Privacy & Risk Management Leader
1 年Nicely done, Brendon!
CISSP | Data & Cloud Security | RSAC Security Scholar '24
1 年Insightful and well written, Brendan!
Technology Risk Consultant @ EY
1 年Excellent work!
LDP Technology and Innovation Analyst at Truist
1 年Wonderful article, Brendan! I found it to be really insightful.
Incoming 1L JD Candidate at Texas Tech University School of Law
1 年Real insightful work, great job, Brendan!