Do Microsoft and OpenAI Have a Fair Use Defence against the NYT Copyright Complaint?
Aleksandr Tiulkanov
Upskilling people in the EU AI Act - link in profile | LL.M., CIPP/E, AI Governance Advisor, implementing ISO 42001, promoting AI Literacy
To follow up on my yesterday's post about The New York Times vs Microsoft and OpenAI story: people are asking me whether the fair use argument from the Authors Guild v Google (Google Books) decision will apply to the situation where the chatbots are allegedly providing users with almost wholesale-copied NYT articles?
What is Fair Use?
To answer this question, we need some background. The US copyright law has a doctrine of fair use, codified into section 107 of the US Copyright Act 1976, which allows for an otherwise copyright-infringing use, provided the court decides that in a particular case, all things considered, that use is "fair".
This is established based on the following four criteria, assessed on balance (that is, none of the four being decisive alone):
1. The purpose and character of the otherwise infringing use, including whether it's commercial or not.
2. The nature of the copyrighted work — whether the work leans towards purely factual description of the world or towards highly subjective and original creative expression.
3. The amount and substantiality of the portion used in relation to the copyrighted work as a whole; and
4. The effect of the use upon the potential market for or value of the copyrighted work.
Can the Alleged Use of the NYT Materials in Chatbots Be Deemed Fair?
1. Purpose and character of use: If we're to trust the complaint, this criterion is very likely not met, as OpenAI and Microsoft's use of the NYT articles content for training LLMs is commercial. They are not open-sourcing their models and weights as Mistral currently does, for example.
2. Nature of copyrighted works: This second criterion, in my estimation, slightly favours OpenAI and Microsoft, as regular newspaper material normally leans towards factuality and descriptiveness. The NYT articles, one would suppose, are not generally fiction, and creativity is only a secondary objective.
3. Amount and substantiality of the used portions: The degree to which this third fair use criterion is met is still to be determined. The NYT has produced a voluminous exhibit full of almost wholesale-copied NYT articles, but many commentators have questioned the NYT's prompting tactics.
That is, allegedly, an average chatbot user is very unlikely to obtain the same results full of wholesale-copied NYT materials. Further evidence and independent assessment might be necessary to establish whether NYT has a case based on this third criterion, and OpenAI has all the time and means to alter the LLM outputs so that there is no further wholesale copying, if it ever was the case.
领英推荐
4. Effect on market and value of copyrighted works: Whether this fourth fair use criterion is met is also still to be determined. The NYT argues that the defendants are indeed falling afoul of this criterion.
It is on this fourth (effect on market) and also third (amount of used portions) criterion that this case differs from the one of Google Books, where Google prevailed on the fair use grounds.
In the Author's Guild v Google, the United States Court of Appeals (Second Circuit) said: "Google's unauthorized digitizing of copyright-protected works, creation of a search functionality, and display of snippets from those works are non-infringing fair uses. The purpose of the copying is highly transformative, the public display of text is limited, and the revelations do not provide a significant market substitute for the protected aspects of the originals."
Google Books limits the viewable portion of each book to such an extent that using that service cannot substitute for actually buying the book. In the case of newspaper articles, you cannot deliver pages worth of content without infringing on the copyright — if that is in fact what happens.
The defendants' chatbot outputs are also arguably very different from traditional web search results. In the case at hand, the NYT alleges that, unlike traditional search engine-delivered snippets, ChatGPT, Bing Chat etc. outputs extensively reproduce the NYT articles' and do not provide prominent hyperlinks to the articles.
This way, the defendants arguably disincentivise users from visiting the NYT resources, as chatbot outputs' may in fact serve as an adequate substitute for reading the article itself. OpenAI and Microsoft therefore may be in fact competing in the same market in which NYT itself operates.
And by obviating the need for the users to visit the NYT resources, defendants are arguably preventing the NYT from obtaining the profits they would otherwise get from letting the users through the paywall and from ad and referral revenue.
This is, at least, if we consider the evidence in the exhibit to the NYT complaint to be a true representation of how the defendants' chatbots actually operate — which is, as I have noted, is already disputed by some commentators.
What's Next?
The fate of this case remains to be seen. In fact, it may never proceed to the stage at which the court will reach the final decision — some commentators are alleging that the complaint is merely the NYT's negotiating tactic to force the defendants to settle on a royalty scheme acceptable to the claimant.
If the parties settle, it will not be an unexpected outcome. However, from the societal and business perspective, much more clarity would be provided if the case proceeds to trial and a precedent-setting decision is made (and affirmed after appeals) on whether the use of copyrighted material for training AI systems similar to those operated by OpenAI and Microsoft constitutes fair use or not.
To that end, it would be nice to have some robust research at hand on to what extent operating generative AI systems at scale can indeed have an effect on market and value of original copyrighted works on which they are being trained and to what extent consumers consider these systems' outputs to be adequate substitutes for these original works. Chances are, some researchers might be helping us soon on this front.
Indie Game Dev | Software Educator | XR & Esports
10 个月I don't think New York Times has lost any money or business due to ChatGPT
Aleksandr Tiulkanov Copyright law is triggered when actual copying and *redistribution* of copyrighted material occurs. (Fair use generally provides for making substantive 'in house' copies, but not their redistribution.) Can you tell me where you think that substantive copying and *redistribution* of copyrighted material occurs in the context of LLM training? Note, that statistical computation across a training corpus is not illegal. Thanks.
Cyber Strategist, Cyber OSINT
10 个月#CRIMECHAT #DirtyLLMs
Data protection, privacy, and some AI-related stuff. Advising on it. Training people in it. Writing about it. Creating useful resources for it. Recording little videos about it.
11 个月This is a fantastic article Aleksandr Tiulkanov. Exactly what I have been looking for. I knew that the case was not clear-cut (despite the many people who seem to have become overnight experts on the "fair use" exception) but I can't even begin to predict what the outcome will be. All I know is that every US privacy case against OpenAI I've read has been totally hopeless. So this really seems like a better angle for US litigants. There is a case to answer under the GDPR, however, and the Garante's enforcement action last year only scratched the surface. I have my eye on Lukasz Olenjnik's complaint to the UODO in Poland, among other cases. Regarding "fair use", I think I put more weight than you on the "creativity" argument, in favour of NYT. A lot of NYT articles are long-reads, opinion pieces, and even short stories. I don't see this content as any less creative than fiction books. But I appreciate I don't know the law, and that NYT might not be alleging that these types of works have been plagiarised by OpenAI.
Founder & CEO of KYield. Pioneer in Artificial Intelligence, Data Physics and Knowledge Engineering.
11 个月Full content reproduction is included as evidence -- from 90% to verbatim (several examples). That's one of two, and looks fatal to me. The second is competition, and there is no question LLM chatbots are competing with publishers and everyone else in the knowledge economy, by training on competitor's data for free. I read the full complaint and it looks like a very strong case for the NYTs to me.