Meta's Fair Use Argument: Part 3
David Atkinson
AI Legal Counsel | A.I. Ethics and Law | University Lecturer | Veteran
If you missed Part 2, you can find it here: Part 2.
First, some notes:
I hadn't mentioned this before because it's implied, but because Llama is not meaningfully different from other large language models (LLMs), a finding of fair use for Meta would mean virtually anyone could do the same, undermining any potential licensing market. And because there is no set definition of LLMs, a lot of people and companies could hide behind fair use. This doesn't even consider spinoffs, such as large reasoning models (LRMs), which are merely a variation on LLMs. The breadth of a fair use finding would be immensely consequential.
To give an idea of the scope, Hugging Face hosts over 900,000 AI models. Not all of them are LLMs, but a significant chunk are. Each LLM requires the use of a trove of copyrighted content without authorization. If things swing Meta's way, virtually all of those LLMs will be protected by fair use.
Regarding the amount of outputs that are likely infringing, here are the results from one study, strongly indicating that Llama is not merely learning statistical facts about words, separate from the protected expression:
The Association for the Advancement of Artificial Intelligence's workshop on copyright has a slide deck with several other studies that affirm that Llama is often the worst offender when it comes to regurgitating copyrighted information verbatim or slightly modifying the content.
Ok, back to regular programming, picking up on page 33 of Meta's filing.
1. Plaintiffs' Motion Should Be Denied in Full
Dr. Choffnes opines only that there is a greater than 99.99999% chance that Meta uploaded at least one piece of Plaintiffs' works. Id. at 25. That opinion is based on faulty reasoning, as Meta's expert, Ms. Frederickson-Cross, attests in her declaration, raising at the very least a genuine issue of disputed fact on distribution ... Plaintiffs cannot rely on disputed allegations that Meta distributed their works--a determination on which they do not seek summary judgment and implicitly concede they cannot establish as a matter of law--to overcome Meta's defense that its copying of Plaintiffs' works to train Llama was fair. Meta looks forward to addressing the facts and law undercutting the viability of any distribution claim when the newly ordered discovery and expert work is completed.
I am adding this here because the court will likely have to revisit it. I have a feeling Meta will have to eat these words, just as they had to eat their claim from 2023 that they didn't copy any works of the plaintiffs (the quote was included in Part 1 of this series). It's my understanding that Meta allowed distribution from torrenting when they gathered data for Llama 3 and only blocked reuploading when torrenting for Llama 4 (the model that will train on 30 to 60 trillion tokens...meaning a heckuvalot of copyrighted material was pirated!).
2. Plaintiffs' Motion Should Be Denied in Full
As explained above in the discussion of the first fair use factor, any evidence of bad faith in copying works for a transformative purpose is of little, or no, consequence to the fair use analysis. Further, Meta downloaded copies of datasets that included Plaintiffs' books for the fair use purpose of training Llama, which does not contain, replicate, reproduce, or distribute those works or let anyone see or read them.
Again, if the idea that any transformational use is dispositive for whether something can be taken in bad faith, the judge will have to explain why the same logic would not apply to any given human the way Meta wants it to apply to them.
As touched on in Part 2 (and more completely here), humans are more likely to make transformative uses of copyrighted works. It's not clear at all why society should grant greater legal privileges to a for-profit mega company than to an ordinary human being, when the reasoning applies equally to both entities.
领英推荐
Additionally, as discussed earlier in this series, if the court permits entities to extract copyrighted works from known pirate sites to create products they deem to be fair use, the ruling will likely lead to a significant increase in torrenting on a massive scale. It will also likely lead to a massive incentive for people to take copyrighted works and post them on pirate sites, allowing them to be laundered under the guise of fair use. This would extend beyond just books to movies, songs, images, and everything in between. The shortsightedness of Meta's argument and the accompanying hubris blow my mind.
2. Plaintiffs' Motion Should Be Denied in Full
[T]here is no allegation or evidence that the copies Meta made were used for reading Plaintiffs' books--by Meta employees or anyone else.
LLMs were designed to mimic the human brain (the analogy isn't perfect, but that's why we use terms like neural net). The goal is to train a model that acts like a brain, but on a larger scale. When Meta says nobody reads the works, they're eliding the fact that the works were "read" by the LLM, which is just a way of externalizing a brain, or offshoring the work (if that metaphor works better for you). If it's not lawful for a human to read the works, it can't be lawful just because a model designed to make billions of dollars for a tech company is fed the same content.
But also, humans do read portions of the datasets. It's a best practice, as noted by AI researchers at Princeton, Ai2, and elsewhere.
1. The Court Should Grant Summary Judgment to Meta on the DMCA Claim
Plaintiffs rely on innuendo, not evidence, that Meta removed CMI with culpable scienter. The record, however, shows that CMI removal had nothing to do with concealing infringement. The Meta engineer whose team wrote the script to remove certain text from Libgen testified that he chose the sequences of text that were removed because they commonly occurred in the books and do not bring any value to training.
While not suited for summary judgment, I do have some questions about this:
2. The Court Should Grant Summary Judgment to Meta on the DMCA Claim
Other Meta witnesses testified that removal of duplicative text in training data is standard to avoid overfitting (i.e., memorization) and improve model performance.
The removal of duplicative work is true. But what Meta is implying here is that of the millions of torrented documents, they had the exact same author names, publishers, publication dates, and so on. After all, exact matches are what they mean by duplicative. And, of course, the content was not identical across all the documents. Or even most of the documents.
3. The Court Should Grant Summary Judgment to Meta on the DMCA Claim
Any purported intent to conceal is also belied by Meta's public disclosure of its use of Books3 upon release of Llama 1, alleged in Plaintiffs' initial complaint, and the fact that any datasets used to train more recent models were disclosed to Plaintiffs in discovery. (explaining the timeline of Meta's production of books-related datasets to Plaintiffs).
I guess here, Meta is saying that they expect authors and publishers and others to learn about what data was used for training by reading Llama's documentation, not from using the model that they heavily promote/advertise.
In other words, if Meta doesn't reveal it in the model itself (the thing that Meta trumpets has been downloaded over a billion times), that's ok because it's in the documentation (which has probably been downloaded tens of thousands of times at most, almost entirely by researchers.
That's all the time we have today, friends.
Founder at Axate, consultant on interaction between media, AI and copyright
4 天前The continued undermining of legal protection for copyright will lead to content being withdrawn from general access and ever-greater protections. If you cannot publish in public without the protection of the law then you can’t publish in public
Intellectual Property Attorney - Legal, Managerial, Technical - USPTO Reg. 60652
5 天前“Under the guise of” seems like a rather over use of categorizing an otherwise legitimate action as a sham. Methinks that you had a preconceived notion and simply applied confirmation bias in your statements to arrive at the conclusion that you wanted to reach.