The Dangers of Data Egress and Ingress for LLM Usage
Anthony Walsh
The AI PMM | Product Marketing Manager at Atlassian, Meta & venture-backed startups | Accelerating AI development securely & efficiently
Generative AI is quickly becoming a pervasive utility in the workplace. According to research by Microsoft, 78% of corporate employees use their own preferred machine learning models at work, irrespective of corporate policy. This presents a concern for employers that may not have properly vetted the models or applied the appropriate privacy, security, and compliance controls to govern their usage. Last year, an anonymous Cisco-sponsored survey of 2,600 security and privacy professionals from 12 countries reported that 27% of organizations banned generative AI. Although many corporations remain vigilant about the negative repercussions of data egress, data ingress is twice as likely to occur and exposes companies to regulatory risks.
Corporate prohibitions on LLMs
Notable firms in regulated industries – from banking to defense and telecom – have followed suit, given the widespread adoption of ChatGPT since 2022. An employee at Samsung once pasted confidential source code into ChatGPT, triggering an internal ban on all AI tools. Other companies imposed usage limits, temporary restrictions, or permitted access for certain users until they had developed secure, in-house solutions.?
JPMorganChase, which restricted access to ChatGPT in February 2023, has empowered 140,000 staffers with their own internal LLM Suite to optimize productivity and cost-savings, forecasted to deliver $2 billion in net gains over the next three to five years. President and COO Daniel Pinto claimed that the bank is exploring how to apply AI to “every single process” with 400 use cases underway. America’s largest financial institution has mandated a prompt engineering curriculum for new hires in its asset and wealth management division, raising the requirement for AI training hours by 500% between 2019 and 2023. The parent company of JP Morgan and Chase plans to migrate at least 75% of its data to the cloud by the end of the year to gain further utility of LLMs and generative AI applications.
Proprietary data egress
Not only is ChatGPT the most popular LLM among consumers, but it’s also highly favored by professionals. In a study of 1.6 million employees by Cyberhaven, the data protection software vendor found that 10.8% tried to use ChatGPT between its November 30, 2022, release date and June 1, 2023. During the observed period, 4.7% of the workforce leaked sensitive data into OpenAI’s LLM. This encompassed 11% of all data pasted into its prompts, in a process known as data egress.?
Corporate professionals entered a vast amount of confidential information between April 9 and 15, 2023. For every 100,000 employees, 319 shared internal-only data, 278 entered source code, 260 sent client details, 171 submitted personally identifiable information, 146 egressed patient health data, and 84 copied content from project planning files.?
Proprietary data ingress
While data egress relates to data leaving a network and transferring to an external location, data ingress refers to data transferred to an organization’s network from third-party sources. Cyberhaven observed an all-time high of 7,999 data egress incidents for every 100,000 employees by June 2023, during which ChatGPT users were recorded pasting company data into a prompt. Nearly twice as many incidents of data ingress – 13,188 for every 100,000 employees – occurred on the same day users copied data directly from ChatGPT.
Copyright infringement & content licensing
Ongoing debates continue to emphasize the significance of sourcing and representing information ethically. While LLM developers seek cost-effective methods to source content for model training, artists and creators demand proper attribution. Unless explicitly indemnified, companies that publish material shared directly by LLMs may be susceptible to copyright infringement if the training datasets include unlicensed media or the original inputs and derivative outputs are copied verbatim.?
Class action copyright infringement claims
LLM developers have been the target of class action lawsuits in recent years. A group of authors, including Brian Keene, Abdi Nazemian, and Stewart O’Nan, filed a lawsuit against Nvidia. They asserted that the defendant utilized unauthorized content from a compilation of 196,640 books to train LLMs on replicating conventional literature. The Books3 dataset – a digital library ingested by ChatGPT, Claude, Llama, and other LLMs – originates from a compilation of media pirated by Bibliotik.
In November 2022, the Joseph Saveri Law Firm of San Francisco sued Microsoft, its subsidiary GitHub, and its business partner OpenAI for the same reason. The plaintiff alleges that Copilot — Microsoft’s AI-enabled coding assistant trained on public repositories — unlawfully reproduced software without properly crediting the original software vendor. The presiding judge dismissed claims of copyright infringement, however, allegations related to violating open-source licenses and breaching contracts remain pending.?
The following year, the same law firm represented authors Sarah Silverman, Christopher Golden, Richard Kadrey, Paul Tremblay, and Mona Awad in their pursuit of damages and retribution from Meta and OpenAI for copyright infringement. Once again, the plaintiffs pledged that their work was illegally copied to train models without consent.?
Initially, the U.S. District Court of Northern California dismissed all claims for both cases, except for alleged direct copyright infringement by OpenAI. However, testimonies provided by Meta employees reveal that engineers torrented another online library, deliberately trained Llama on protected text with Mark Zuckerberg's approval, and removed acknowledgments to the original authors. For now, the case is still under deliberation.
Corporate copyright infringement claims
Longstanding and well-established media brands have also fallen victim to plagiarism. The New York Times, Daily News, India’s Asian News International and other regional news outlets have accused Microsoft of training Copilot and ChatGPT on millions of copyrighted publications and resharing written work without citations. Another copyright infringement case was filed against Perplexity by News Corp in October. Both applications have arguably duplicated content, summarized text with similar expressive style, and falsely attributed output to the publishers. Consequently, the publisher of the Wall Street Journal, New York Post, and Dow Jones demanded that the LLM search engine stop training AI models with proprietary content and purge its database of copyrighted work.
A collective of record labels composed of Concord Music Group, Universal Music, and ABKO Music sued Anthropic for direct and secondary copyright infringement and violating the Digital Millennium Copyright Act (DMCA). The plaintiffs claim that the defendant altered and used content to train Claude without proper licensing, then copied, distributed, and displaced song lyrics in its responses while omitting copyright management information. According to Getty Images, Stability AI allegedly infringed on more than 12 million photos and produced the open-source AI image generation model and web interface, Stable Diffusion and DreamStudio. The multimodal AI developer violated trademarks by replicating the plaintiff’s watermark rather than applying provenance to denote any distinction.?
Although these corporate lawsuits are currently pending, lawyers from the Joseph Saveri Law Firm have compared the prevalence of copyright infringement claims against LLM developers to that of the music-sharing app Napster, “which everybody loved but was completely illegal.” As a result, Google, OpenAI, and Anthropic have explicitly indemnified their users of copyright infringement claims issued by third parties. However, the latter two LLM developers have specified that these privileges are reserved for enterprise customers.
Individual & collective content licensing
To address these claims by News Corp and complaints raised by the New York Times, Forbes, and Wired, Perplexity launched a revenue-sharing partnership with the press. TIME, Der Spiegel, Fortune, Entrepreneur, The Texas Tribune, and WordPress.com are the first publishers to participate. The multi-year agreement between the LLM developer and the media calls for a “double-digit” revenue split and the following as mutual benefits:
OpenAI has resorted to licensing content from individual publishers as a hedge against legal repercussions, offering anywhere between one and five million dollars. The LLM developer entered into relationships with domestic and international broadcasters, which include dozens of labels by News Corp, Dotdash Meredith, Vox Media, The Atlantic, FT Group, Le Monde, Prisa Media, Axel Springer, TIME, Hearst, and Condé Nast.?
Such agreements are not conducive to training and developing models at scale, leading to the rise of content licensing collectives. Organizations like the Dataset Provider Alliance issue blanket licenses permitting other vendors to use and distribute protected material. In other cases, LLM developers have procured private datasets due to the limited availability of high-quality training datasets.
A promising future
Companies incorporating generative and agentic AI into products or operational workflows increasingly rely on hardened infrastructure and stringent security protocols to prevent data egress. On the other hand, LLM developers are culpable for training their models on work protected by copyrights. OpenAI and Google indemnify users from inadvertent infringement caused by propagating content with inputs sourced by LLMs. As part of its settlement with the record labels, Anthropic agreed to enforce and expand guardrails to prevent Claude from reproducing lyrics owned by the rights holders. This commitment underscores the LLM developer’s mission to usher in an era of safe and ethical AI.