Phantom Data: A New Tool for Copyright Holders to Detect AI Training Usage
Jenish Pithadiya
CHRO & Co-Founder |AI Development & Consulting | Working with ISRO | Machine Learning Expert | Deep Learning Expert | Computer Vision | NLP | Web Development Services | Mobile App Development | Aero Space
In a groundbreaking study, researchers from Imperial College London have introduced a novel technique that could enable copyright holders to determine if their work has been used in training large language models (LLMs). This innovative method was presented at the International Conference on Machine Learning in Vienna and detailed in a preprint on the arXiv server.
Generative AI, including advanced LLMs, relies on vast amounts of internet-sourced text, images, and other content to develop its impressive capabilities. However, this often occurs on legally uncertain grounds regarding the use of training data. Addressing this issue, the new paper from Imperial College experts proposes a mechanism to detect the use of copyrighted data for AI training.
Lead researcher Dr. Yves-Alexandre de Montjoye, from Imperial's Department of Computing, explains, “Inspired by early 20th-century mapmakers who used phantom towns to detect illicit copies, we explore how injecting 'copyright traps'—unique fictitious sentences—into original text enables content detectability in trained LLMs.”
Content owners can embed these copyright traps across their documents. If an LLM developer scrapes this data and uses it for training, the data owner can identify irregularities in the model's outputs, proving their content was used.
This technique is especially suited for online publishers, who can discreetly insert these traps in news articles. Dr. de Montjoye notes that while LLM developers might develop techniques to remove these traps, doing so consistently would require significant resources.
领英推荐
To validate their approach, the researchers partnered with a team in France to train a truly bilingual English-French 1.3B-parameter LLM, embedding various copyright traps in the training data. Their successful experiments suggest this method could enhance transparency in LLM training.
Co-author Igor Shilov highlights the increasing reluctance of AI companies to share training data information. “While the training data composition for older models like GPT-3 and LLaMA is known, it’s not the case for newer models like GPT-4 and LLaMA-2. This lack of transparency makes it crucial to have tools that inspect the training process,” Shilov said.
Co-author Matthieu Meeus adds, “The issue of AI training transparency and fair compensation for content creators is vital for the future of responsible AI development. We hope this work on copyright traps contributes towards a sustainable solution.”
#AI #GenerativeAI #MachineLearning #LLMs #CopyrightProtection #AITrainingData #TechInnovation #DigitalRights #TransparencyInAI #AIResearch #ImperialCollegeLondon #ContentCreators #FairCompensation #ResponsibleAI #AIethics