Foundational models will make or break news media, pending formal agreements.

Foundational models will make or break news media, pending formal agreements.

In the space of a few days, OpenAI struck a deal with the Financial Times to train models on its news content; and was also sued by eight other newspapers over alleged copyright infringement (while training models on their news content).

What’s happening? A backlash. The list of news media organizations alleging that their content has been unlawfully used to train artificial intelligence is growing. So is pressure on foundational model companies to be more transparent about the provenance and legality of their training data.

Inference: OpenAI’s deal with the FT is a sign that the news business is beginning to hedge its bets. Striking licensing deals for content used to train foundational models (and to create innovative news products) could be the only path forward for newsrooms if legal cases fall short of proving any wrongdoing by model creators.

This is an edition of Inferences, by Minerva Technology Policy Advisors. If you'd like to read it each week on a Monday, please consider subscribing here.

Why does it matter? The tussle over news archives is part of a bigger fight over who controls access to the data needed to train models. Newsrooms are saying that using their content, without permission or paying up, violates their intellectual property rights. OpenAI has made its view clear in the past, saying that it would be “impossible” to train cutting-edge foundational models without using copyrighted materials and arguing that it constitutes “fair use.” The resulting legal contests could have massive implications for the future of news media, and foundation model companies.

The background is that earlier this year; The Intercept, Raw Story and AlterNet launched similar suits. High-profile authors, including George R. R. Martin, Sarah Silverman, Ta-Nehisi Coates and Mike Huckabee have put their names to claims that foundational models have been trained on their work, and their outputs, therefore, constitute infringement. Now the New York Daily News, Chicago Tribune, Denver Post and several others are accusing OpenAI of “purloining millions” of pieces of content without permission. These cases join arguably the highest profile claim: a dispute between Microsoft, OpenAI and The New York Times, in which an estimated $450 billion in damages are at stake.?

The disputes are over fair use of information gathered from the internet. Model companies contend that harvesting data for training is legitimate under fair use protection in US jurisdictions, but international copyright law is much less clear. Speculation already abounds that LLM companies have run out of data with which to train their models; having scraped and ingested most of the text-based internet. Proprietary data, or private data held under copyright, is an immensely valuable new frontier into which they could expand.?

If the news companies are to win out, they will have to prove that the outputs from foundational models are sufficiently similar to their original articles. Some legal challenges have already been partially thrown out for lack of evidence that model outputs are similar enough to original stories and books, and OpenAI is confident enough in its position that it has promised to defend its enterprise customers and pay any costs incurred through legal claims around copyright infringement.

The compromise might be to make formal licensing agreements that give model companies access to articles as training data, and bring in a dividend for the news organizations. One news executive at a major UK-based publication told Inferences that they are considering licensing agreements with foundational model companies that would help them to generate new, composite articles for their readers that would feature personalized content, themes of interest, or even emphasize the views of a particular journalist or commentator. In return, model companies could gain access to a wealth of rich text data to train on.

The upshot is that these agreements could make or break many news platforms that lack the financial and legal firepower to challenge model companies over copyright claims, and still hope to remain relevant in a media landscape increasingly dominated by personalized, automatically-generated content; where news becomes more conversational.?

Beyond the news business, similar copyright and intellectual property disputes extend to other forms of media, including video content, as foundational model video-generation capabilities improve and become more commonplace. Audio, and voice-generation in particular, are already becoming more contentious. Code generation is another frontier set to become more hotly disputed, with Microsoft and its subsidiary GitHub already under pressure over whether their coding co-pilots are expropriating ideas that belong to other people without giving them due credit.

Worth watching: The copyright fight around foundational model training data could have wider geopolitical implications, too. Governments are likely to step up scrutiny of who has access to domestic companies’ or citizens’ proprietary data as the scramble for new data to feed into foundation models intensifies. That could further stoke debates about digital protectionism and so-called “data sovereignty”. If foundation models trained on a corpus of data from relatively more liberal Western media perform better and have wider applications than models trained in countries with stricter censorship, it could give the US and other liberal democracies advantages over countries like China in development of groundbreaking LLM applications.?

要查看或添加评论,请登录

Kevin Allison的更多文章

社区洞察

其他会员也浏览了