Unsettled Copyright Questions
David Atkinson
In-House Legal Counsel for AI Institute | A.I. Ethics and Law | University Lecturer | Veteran
There are still many, many unsettled questions swirling around the intersection of copyright and AI. Here, we'll explore them in point-counterpoint fashion.
Licensing Materials
Argument
If training on copyrighted materials without consent or compensation to the copyright owner is not fair use, it’s an existential problem for GenAI. Either advanced language models cannot exist without the fair use exemption because gathering the data through traditional licensing is too burdensome and expensive, or the only entities that can train high-performing LLMs are those that either already have the data (Adobe, Google, Facebook) or that have enough money to acquire it via licensing (Microsoft, Amazon, Apple). Nonprofits and startups will never be able to meaningfully participate or compete.
Counterargument
Why shouldn’t AI companies have to work out a licensing agreement with copyright holders, or at least make a good faith attempt to? It is premature to claim, without effort, that such an approach is impossible or impracticable. The right of a company to create GenAI should not trump the right of a copyright owner to profit from their labor. There shouldn’t be favoritism to tech companies over writers, artists, and musicians.?
And, once a workable licensing scheme is created at the expense of the largest companies, the same scheme can apply to smaller entities. That is, it may well be expensive to create a scheme to properly license copyrighted material, but it’s an upfront cost that can be borne by the entities best positioned to create and fund its development.
Alternatively, even if fewer companies can build high-quality GenAI, is that necessarily bad? We must also consider whether it’s better to have fewer AI companies but better compensated creators, or more AI companies but virtually no compensation for creators. Which outcome produces a more interesting, enjoyable, inspiring, and useful world??
Expressive vs. Underlying Ideas
Argument
The material models train on is collected and analyzed only for its underlying factual basis. It may use some of your ideas, but as long as it doesn’t create a substantially similar output, it’s not infringing. The same applies to humans who can be inspired by the essays of others and who may even write a similar essay on the same topic. Copyright protects expression, not ideas.
Counterargument
Why should the use of training material only be considered informational and not also expressive? Data curation doesn't separate expressive elements from the "ideas." It trains on all of it. The sentences LLMs produce are not merely functional and they must gain their expressive qualities from something other than the non-copyright-protected ideas and facts of the material the LLMs are trained on. And even if the LLM doesn’t care about the expressive quality of a sentence as such (i.e., it doesn’t care for a particular turn of phrase), the AI creators (i.e., the humans) have curated its training dataset based partly on the quality and quantity of expressive writings. Better writing (as judged by a human) leads to better GenAI. A model trained only on 4chan and YouTube comments would not be good. So it can be argued that the expressiveness of training material is an essential element AI creators want the LLMs to internalize (i.e., be more like a novelist or nonfiction artist and less like an angry overly-political uncle). Finally, if expressiveness wasn’t valuable, AI companies wouldn’t give greater weight to certain sources, like they allegedly do with the New York Times, even when the facts expressed by the Times can be found in Reuters, Axios, and other sites.
Pirated Materials
Argument?
We need not argue that all copyrighted works should be considered fair use as inputs. If that argument fails, it should surely be the case that all works that were created and shared with the intention of being free, like most content on the internet, should be fair game for training language models. Humans are allowed to learn from such materials and AI should not be treated any differently. If people choose to publicly display their content in public spaces or forums, then they have no reasonable expectation others won’t use them for tasks like training GenAI. This is similar to a situation where a filmmaker decides to make a short film and captures a famous statue in a public setting in the background. The statue is in a public space and preserves the context of the film’s setting. The filmmaker does not have to pay the statue’s creator royalties despite their creation being incorporated into another work of art for profit.?
Counterargument
Often–perhaps in the vast majority of situations when the content is not merely a comment or social media post– the reason the content is shared freely is because the creator receives compensation from advertisements. This is the case with food blogs, for example. The content isn’t truly free; the expense is meant to be recouped by other means. In other cases, as with open-source code and perhaps as with some blogs, the author is sharing their content freely in exchange for the opportunity to build an audience, receive recognition, and/or to build a reputation for the quality of their creations. Allowing language models to simply take the content and then provide no attribution to the creators at any time undermines the entire structure and purpose for the particular choice of how the content was shared.
LLMs as Humans
Argument
LLMs should receive the same benefits as humans. Humans are allowed to read anything on the public web and learn from what they read, so why can’t LLMs? Throughout our lives, humans learn something new every day, and they retain some part of the information they learn into short- and long-term memory. We retain a variety of different kinds of information like words from a book, pictures from an article, or even how we felt when watching a movie. What LLMs do is virtually identical.
Our human brains retain so much information from our environment and copyrighted materials so that whenever we have to come up with an “original” idea, we can draw inspiration from what we already know. Models like ChatGPT are doing the same thing when we ask them to generate an answer to our inquiries. Whatever machine learning does with training data, it has no more impact on the original author/artists than our learning process does when we absorb information from the internet, or anywhere else.
Counterargument?
LLMs do not function the same as humans and we should not treat them as the same or significantly similar for the purposes of legal analogies. For example, a human may retain some ideas from some articles, but we forget most of what we read/watch/see. Very few people can even recall all the articles and books they read in the past week. While the pirated dataset Books3 had 196,000 books, even a 40-year-old who has read an average of one book a week for their entire adult life will have read no more than 1,200 books. This is not an apples-to-apples comparison.
GenAI is different. It can remember, forever, details about books, songs, movies, articles, essays, and so forth in remarkable detail. To be clear, the hardware of the brain is not the same as the hardware of GenAI servers.[1] While GenAI can summarize a work in a matter of seconds, a human takes at least several minutes to complete the same task, assuming they even recall enough information about the material off-hand to make a useful summary.
Moreover, and perhaps more importantly, if the argument is that GenAI should receive free access to all copyrighted content because the outputs are transformed, wouldn't that mean humans could use the same argument to take and consume copyrighted works as long as they don't use them to create substantially similar items? Because humans don’t have the ability to memorize billions of chunks of text/audio/images at a level that would count as plagiarism in academia, humans are less likely to produce substantially similar outputs at scale. Meaning, the output from a human will almost always be more transformative than an output from an LLM.
As AI Snake Oil puts it:
Note that people could always do these kinds of repurposing, and it was never a problem from a copyright perspective. We have a problem now because those things are being done (1) in an automated way (2) at a billionfold greater scale (3) by companies that have vastly more power in the market than artists, writers, publishers, etc. Incidentally, these three reasons are also why AI apologists are wrong when claiming that training image generators on art is just like artists taking inspiration from prior works.
LLMs as Non-Human
Argument
LLMs should not be treated like humans for some aspects of copyright law. Unlike humans, who have subjective experiences when they consume copyrighted material of any sort, an LLM has no such experience. We should draw the line at whether the consumer of information (LLM or human) derives “enjoyment” from the information. If there is enjoyment of the work, then it’s possible copyright infringement for reproduction may have occurred, but if there is no enjoyment, then no infringement occurred. Therefore, an LLM does not infringe when it trains on the data because LLMs have no subjective experience and cannot enjoy the copyrighted works. Oren Bracha, a law professor at UT-Austin, makes the argument this way:
[N]on-expressive copies do not infringe for a more fundamental reason and with no need to reach the fair use exemption: they do not fall within the subject matter domain of copyright.? Non-expressive copies do not involve any use or access to the protected expression as expression and therefore no copyrightable subject matter is taken or enjoyed by the user. The non-expressive reproduction is a mere physical fact that has nothing to do with copyright whose proper and only domain is expression.[2]
A second argument is that the use of content to train LLMs radically transforms the content in the process. For example, when a human consumes a book or a piece of art, the human consumes it in its original form: the book is written with words and the human reads those very same words; the art is created with colors and the human consumes those colors. In contrast, machines consume words, colors, sounds, and all other inputs as numbers. In other words, the original content is necessarily transformed at the beginning of the training process, and to such an extent that someone viewing the list of numbers associated with the words will have no idea what they are looking at.?
A final argument is that, unlike humans, LLMs are only interested in the metadata of the content it is trained on. It cares only about the association of words and concepts. These, notably, are not protected by copyright law. To say an LLM understands content enough to have any appreciation of its expressiveness is misleading because GenAI does not possess any meaningful understanding of what it’s doing. It is merely making probabilistically-based guesses at how words or pixels should relate to one another based on how they’ve related to each other during pretraining.
领英推荐
Counterargument
The first argument assumes that enjoyment is defined as getting joy from consuming the work. However, joy is a subjective standard. For example, if I read a textbook, I probably am not doing it for joy; I read it for the underlying facts/ideas and the words and illustrations are merely a means to an end.?
If enjoyment is instead defined as having a subjective experience of what it’s like to consume a work, regardless of whether it’s joyful, this implies the cutoff for who is infringing and who is not is based on consciousness and intelligence (being self-aware and intelligent enough to consume the content in the manner it was intended to be consumed). This provides clarity for now (see the section in LEAI on AGI), but why should this be the line? What’s special about a subjective experience regarding whether something should be considered copyright infringement or not, and how does that distinction serve the purpose of copyright law to promote science and the arts?
If enjoyment is defined as deriving value from the work, how is the LLM not deriving value just as a human derives value? If there was no value in the work, the model would not train on it. Further, why shouldn’t the ability to derive value at all (whether it’s to find joy or to create a helpful AI system) take prominence over the type of value (joy versus helpful AI)?
Another way to think of the difference between a human enjoying a work and the LLM consuming the work is to consider that the LLM is merely an extension of the human brains that control the LLM company. The humans at the company get the benefit of the LLM consuming the works without the humans at the company having to consume it themselves. In essence they are outsourcing the task of learning everything, which would make them incredibly valuable to society, and instead having the machine do it. They then profit from others paying to use the LLM. That is, the LLM is barely one step removed from the humans taking and consuming the works themselves.?
Finally for the initial argument, what if enjoyment and the value of the underlying ideas of works can’t be separated. Can’t humans both enjoy reading a work and find great practical value in the ideas expressed in the work? If separation isn’t possible, isn’t the argument really just that LLMs should receive an exemption from copyright laws because…reasons?
For the second argument, about transformation of the content, if the transformation of words into numbers is thought to be meaningful to some for an argument of fair use, the transformation that occurs in the brain is even more radical than what happens in LLMs. Whereas LLMs turn words into numbers (i.e., a symbol into another symbol), and those numbers can be re-converted to words, the brain turns words into electrochemical components and the process is irreversible. But, and perhaps more notably, merely transforming from one medium to another is not important under copyright law. A song sung at a concert that is then converted into an mp3 file does not magically make the mp3 file non-infringing. The same is true for turning words into numbers, which are later, at the output stage, again made into words.
To dig a little deeper into the idea of how ideas are learned by LLMs versus humans, what’s the differentiation between how a model consumes the information versus how a brain consumes it? In both instances, the information is necessarily copied (to the weights and to the neurons in the brain, respectively). If anything, the copying to the brain seems more transitory/impermanent, and is therefore less likely to cause someone to make an infringing output. Paradoxically, perhaps, even with this lower retention in the brain, humans are still the only entity capable of creating truly novel insights and outputs (calculus, evolution, cubism, etc.) because of our ability to reason, be curious, and experiment. So, if we want to promote science and the arts, wouldn’t looser permissions make more sense for humans than for AI models??
In short, if the argument hinges on the fact that a machine is consuming the text rather than a human, why is that distinction important regarding the impact on promoting science and the arts? How does allowing machines to consume the data promote science/arts more than allowing humans to do the same thing? Or in the inverse, why would allowing humans to consume it hinder science/arts but allowing machines to do it won’t?
For the final argument, how can one disentangle expressive elements of works from the metadata/underlying facts and ideas? When a human reads a book, how are they not primarily consuming metadata (after the text is processed by their brain) rather than the expressive elements? It’s generally not the expressive elements that people take away from books. Most people, at best, can paraphrase concepts even when asked to restate only expressive elements from memory, which is the unprotected portion of copyright.
Only the Output Matters
Argument
The output (i.e., the generation) of the LLM is the only factor that should matter for copyright infringement analysis. This builds on the idea that LLMs and humans should be treated similarly. Just as humans aren’t committing copyright infringement when they read a book and gather ideas from it, LLMs should also not be considered to infringe. While LLMs do make a reproduction of copyrighted material for its dataset, and then another copy within the model during pretraining, that copy is a necessary step to a non-infringing purpose.?
Moreover, humans essentially make the same reproduction when reading: the words are transcribed into electrochemical signals that pass through the brain. So, if making a copy is not infringement–and it shouldn’t be–then only the output matters. Only if the output is substantially similar to a copyrighted work’s expression should an LLM be considered to be infringing.?
Finally, it makes little sense to punish the copying of works to datasets and the model (i.e., the inputs). Consider a case where a model is never used to generate anything. How was anyone harmed by the copying of the material? Clearly the only thing that matters is whether the output infringes.
Counterargument
Why should output be the only concern for evaluating infringement of GenAI but not also the only concern for humans? As mentioned above, humans are far more likely to make non-infringing uses of inputs than LLMs, so if the underlying goal of copyright law is to promote the sciences and the useful arts, this goal is achieved at least as much by providing all copyrighted works to humans as it is by providing all copyrighted works to LLMs. Yet, we’d think it absurd if all humans were allowed to take any books, movies, and songs from anywhere and consume them without permission or payment to the creators. Furthermore, we’d feel this way even if the human never made an infringing output.?
So if a human is not allowed to consume works in such fashion despite very likely making better use of it for society than AI, why should AI be allowed to perform such consumption?
Quantity of Infringing Output
Argument
Nobody can know with certainty what any given GenAI will produce as an output to any given prompt because the LLM works probabilistically, not deterministically. As such, the GenAI may, on rare occasions, produce verbatim or near-verbatim copies of data it was trained on.?
However, this is a bug, not a feature, and the percentage of outputs that are substantially similar to the training data inputs is vanishingly small. It makes no sense to find a GenAI developer or deployer liable for such outputs when it has no control of the outputs or how the GenAI is prompted by the user.?
Moreover, if all it takes is a single substantially similar output for a successful copyright infringement claim, it will discourage anyone from working in the GenAI field for fear of legal assault. If a model routinely produces infringing content, that is one matter, but when a model rarely does so or only does so when given an extravagant prompt deliberately intended to produce an infringing output, and it is capable of substantial non-infringing uses, then the GenAI developers and deployers should not be held liable for what would otherwise be considered infringement if performed by a human.
Counterargument
If a human writes one infringing work, they may be held liable even if it constitutes just a tiny fraction of a fraction of the human’s lifetime output (text messages, other writings, sketches, etc.). Why should LLMs be exempt but not humans? What’s the legal basis?
Some may argue that humans are different because if a human infringes by making a substantially similar work, that production usually requires intent whereas an LLM has no such intent. Machines don’t intend to do anything; they are not conscious. But copyright infringement does not require intent, so it’s irrelative to the argument. How can AI developers say their GenAI deserves similar treatment as humans for some items (i.e., that if humans are allowed to learn from reading articles then so should AI), but then say it should be exempt from legal liabilities for others (i.e., because it can produce more than just infringing works it shouldn’t be liable for copyright infringement even when a human making the same infringement would be held liable)??
Finally, why should society allow multi-billion-dollar tech companies to have their cake and eat it too when it comes to copyright enforcement, while the rights of the actual copyright holders are overlooked or outright dismissed?
Enabling Greater Creativity
Argument
GenAI helps promote the arts by enabling less creative people to be more creative, and therefore its proliferation should be encouraged. This is no different from how cars may be used for good (transportation) and bad (running over objects), but we recognize that the good outweighs the bad. Therefore, society supports the proliferation and use of vehicles with roads, parking spaces, and more. Society should similarly support the positive benefits of GenAI as a tool to aid and improve the ideas of humans.?
Counterargument
Even if the premise is true, does the fact that others may be more creative mean GenAI developers should be permitted to take materials from the copyright owners without compensation or permission? Following this logic, would this mean nobody should have to pay for anything that might make them more creative??
[1] Any argument that alludes to photographic memory is also a dead end because it doesn't exist.
[2] See, Oren Bracha, Generative AI’s Two Information Goods (unpublished).
The following students from the University of Texas at Austin contributed to the editing and writing of the content of LEAI: Carter E. Moxley, Brian Villamar, Ananya Venkataramaiah, Parth Mehta, Lou Kahn, Vishal Rachpaudi, Chibudom Okereke, Isaac Lerma, Colton Clements, Catalina Mollai, Thaddeus Kvietok, Maria Carmona, Mikayla Francisco, Aaliyah Mcfarlin
Information Security Professional
4 个月This is kind of amazing! It never occurred to me how material with a copyright works into training an LLM and the ethics behind it. Seriously good read, so much so it itched that part of my brain and I had to see for myself. I have the official Sybex CISSP study guide 9th edition on my desk, turned to a random page and asked ChatGPT summarize the page number from the 9th edition and sure enough it did it and added a few more bullets than what was actually on the page. The 10th edition just came out and tried it with that version, it was waaay off, factually correct information but did not correspond to the the page I asked it for. Guess no one leaked the kindle version for that one yet. Crazy. Wonder if this dilemma will turn back the clocks and publishers stop making electronic versions of books out of fear of infringement... That is until someone "xeroxes" the paper copy and puts it online anyways lol Again, Good stuff man! Got me thinking.