What’s your GAI plan if copyrighted material is disallowed by SCOTUS?
Mark Montgomery
Founder & CEO of KYield. Pioneer in Artificial Intelligence, Data Physics and Knowledge Engineering.
While much has been written about indemnity offered by LLM chatbots and generative AI, I argue that indemnity is really a mirage intended to attract the masses and large enterprises. When indemnity is denied, it will be done quietly on an individual basis and blamed on the user.
Copyright liability for LLM companies is an existential risk, and they would likely be required to pay damages for customers by the courts anyway, so offering indemnity costs little if anything. For most businesses using vendor apps, the financial risk for copyright liability appears to me to be minor.
However, behind the mirage of indemnity is a much greater risk for organizations, and that is the tens of $billions (perhaps soon to be hundreds of $billions) invested in technology run on copyrighted data. The more important risk management question for enterprise decision makers is therefore what’s their plan if SCOTUS orders chatbots to rerun their LLMs on data stores free from copyrighted material? Is that a risk business leaders should be taking? Many aren’t—they are using a combination of ‘legal’ data of their own augmented with licensed data and synthetic data, which is what we offer in our KOS.
From a technical perspective, the functionality of LLM chatbots would be greatly degraded if copyrighted material were removed—for both general AI and narrow AI. This is well understood by experts, which is presumably why OpenAI decided to run LLMs on large data stores known to consist of copyrighted material, including news and book publishers.
The surprise to me and others is why a company like Microsoft would apparently knowingly enable vast scaling of an LLM chatbot run on copyrighted material (I have known Microsoft since the early 1980s). Since Microsoft was the first big tech company to invest in such LLM bots, and offer indemnity, let’s use their own words for guidance. From Microsoft’s blog announcing the indemnity:
“As customers ask whether they can use Microsoft’s Copilot services and the output they generate without worrying about copyright claims, we are providing a straightforward answer: yes, you can, and if you are challenged on copyright grounds, we will assume responsibility for the potential legal risks involved.”
That may sound reassuring, but customers won’t be covered by indemnity unless they use technology correctly to mitigate the use of copyright content, including “guardrails such as classifiers, metaprompts, content filtering, and operational monitoring and abuse detection”. From an AI system architect’s perspective, if one wanted to protect against use of copyrighted material, one wouldn’t run the bots on copyrighted data (obviously), and would automate filters rather than leave it up to the customer and/or interpretation after the fact.
“We believe the world needs AI to advance the spread of knowledge and help solve major societal challenges. Yet it is critical for authors to retain control of their rights under copyright law and earn a healthy return on their creations. And we should ensure that the content needed to train and ground AI models is not locked up in the hands of one or a few companies in ways that would stifle competition and innovation.”
I completely agree with Microsoft’s statement here, as do large numbers of authors and artists suing for copyright infringement, but of course that is not occurring with chatbots, including the largest in ChatGPT made possible by Microsoft’s investment and cloud infrastructure. Microsoft appears to be claiming one thing and doing just the opposite, albeit through its chatbot partner OpenAI.
The perspective from the Fourth Estate
Let’s take a look at the situation from the perspective of the news and media industry. The News/Media Alliance (N/MA) recently released a research report that argues:
“the pervasive copying of expressive works to train and fuel generative artificial intelligence systems is copyright infringement and not a fair use”.
The N/MA membership includes 2,200 members, ranging from international newspaper and media companies to small town newspapers.
The recent report by the N/MA is quite extensive and interesting (I highly recommend reading it carefully), which states among many others the following:
“The GAI copying for “training” does not serve a purpose different from the original works because LLMs typically ingest (i.e., copy) valuable news, magazine, and digital media web content for their written expression, so that they can mimic that very form of expression.”
This above quote speaks to one of the two most important factors in "fair use" of copyrighted material—purpose and character of the use. The other important factor is the impact on the market of the original copyrighted material, or competition, and it’s quite clear the intent of LLM chatbots is to directly compete with authors, artists, and publishers of original works the LLMs are trained on:
领英推荐
“The use of these models to provide complete narrative answers to prompts and search queries goes far beyond the purpose of helping users to navigate to original sources (i.e., search) that has been found in the past to justify the wholesale copying of online content to build search engines. Indeed, GAI developers boast that users no longer need to access or review such sources.”
One of the most relevant paragraphs in the report commissioned by the N/MA, which I think should serve to guide decision makers, is a discussion on the recent SCOTUS ruling in Andy Warhol Foundation for the Visual Arts v. Goldsmith, which states that a product or service that “add[s] something new ... does not render such uses fair,” and that (fair use) does not undermine the “economic incentive to create original works, which is the goal of copyright.”
My position has been that scraping data owned by others for monetization was so obviously wrong and repugnant that we wouldn’t consider it (never have), and that the catastrophic risks inherent with consumer LLM chatbots are so severe they should have been stopped immediately by emergency action until such time they could be thoroughly tested by independent experts. Here we are a year later and nothing has been done. Even the executive order on AI from POTUS did nothing to address current catastrophic risks.
As if the risk of accelerating bioweapons (and similar risk profiles) were not sufficient, transferring the world’s publicly accessible knowledge base to a small group of companies for monetization would by my estimation rapidly collapse the knowledge economy, which represents a significant portion of the economies in the most developed countries, including the U.S., Europe, and other democracies.
However, democracy is not threatened by economics alone through the misuse of AI, but could also occur by destroying the Fourth Estate (Journalism). The N/MA report also cites another case, Associated Press v. Meltwater U.S. Holdings, Inc., which involved direct scraping of AP news content. Judge Denise Cote said:
“Permitting Meltwater to take the fruit of AP’s labor for its own profit, without compensating AP, injures AP’s ability to perform [its] essential function of democracy.”????
Why control over our own data is essential
One of the most basic rights within the U.S. construct (and most other democratic republics) is the right to control one’s property. Data now represents most everything of value, and by extension essentially all human rights simply because everything is increasingly digitized. If we do not control our own data then we effectively lose our rights, increasingly including even our own thoughts.
It’s not surprising we are seeing a backlash against such aggressive moves like scraping the world’s knowledge base and repurposing for monetization, which let’s be clear ultimately provides control over markets and economy. Creative artists and authors had to take matters into their own hands due to the failure of their government to protect them, including ‘poisoning’ their work so that it distorts any bot that scrapes their data.
It should surprise no one as these people are fighting back to protect their property, craft, and ability to make a living. Organizations and entire industries are in the same situation, though they have more resources and better options to protect themselves. ?
My advice to enterprise decision makers has therefore been as follows for the duration (since long before LLM bots were unleashed to the public):
1)? Maintain ownership and control of your data at all times. Otherwise, sovereignty is increasingly impossible and displacement is likely. While the generative AI function takes longer to train using legal data, when done properly it’s far more accurate and much safer for all concerned.
2)? Either build or adopt AI systems that have strong embedded governance over the entire system (called the CKO app in our KOS), and has the ability to execute your organizational principles (see our EAI principles, which are executable with the KOS).
3)? Extend your EAI system to partners and customers. Our digital assistant DANA is provided to all employees and can be extended to partners and customers in an easy to use and disciplined manner.
4)??Design or adopt AI systems with integrity (including protecting property rights) that are trained on high quality data, with accurate provenance and lineage to the source. This greatly increases accuracy, safety, and security.
5)??Adopt efficient systems rather than wasteful systems. LLMs can be used effectively on a limited basis when trained on high quality data, but at web scale they are among the most inefficient and costly methods known, including environmental, financial, and human costs. We avoid most of the associated problems with precision data management, and have long engaged in research to greatly compress language, which dramatically reduces waste and cost.
Conclusion
Compensation alone for our data isn’t enough. In the not-too-distant future (in a few short years), AI bots will be able to perform at higher performance levels than today without requiring vast amounts of external data. Research in this area has been ongoing for years and is evolving rapidly. It’s imperative that all entities control and manage their own data, which when designed correctly and combined with good software engineering practices, enables the control and effective management of AI systems. AI strategies should be future proof.
Software, Systems, Simulations and Society (My Opinions Merely Mine)
1 年Paraphrased from your cited SCOTUS ruling, "Fair Use" should not undermine the “economic incentive to create original works, which is the goal of copyright". Amen. We can readily see individuals and organizations launching lawsuits or scrambling to find ways to secure and restrict from involuntary incorporation into AI systems their valued intellectual properties. So imo, it is impossible to view these #GenAI "Foundation Models" with their broadly scraped internet data sources, as NOT undermining the economic incentive to create original works. Isn't that the explicit business model for Generative AI Outputs - to outcompete (via automation) manually created original works? Only problem being that these Gen AI systems utterly depend upon those hu-manually created original works as training data. Aren't the GenAI developers encouraging Users to feed more Inputs into these Systems from numerous "other intellectual domains" to "improve the systems" and further scale up their Foundation Models? If manually created original works weren't valuable, these "Gen AI Systems" wouldn't have to use them. Our only conclusion must be that This Tech depends on Human Creativity. Human Creativity does not depend on this Tech.
Cloud Engineering | SWE | MLE | Organizer
1 年I would agree w it if they did this modern day piracy is horrible