GenAI's Copyright Adjacent Issues

GenAI's Copyright Adjacent Issues

This is where the letter of the law and the spirit of the law appear to collide.

While unsettled issues around copyright are significant, there are several non-copyright issues that can be at least as important.?

Memorization

Memorization may be a legal issue all by itself. When a model “memorizes” tokens, it is making a copy of them in its weights. A copy, you’ll recall, is a reproduction, and a reproduction without authorization can be copyright infringement.?

Here’s an example from AI Snake Oil:

To benchmark GPT-4’s coding ability, OpenAI evaluated it on problems from Codeforces, a website that hosts coding competitions. Surprisingly, Horace He pointed out that GPT-4 solved 10/10 pre-2021 problems and 0/10 recent problems in the easy category. The training data cutoff for GPT-4 is September 2021. This strongly suggests that the model is able to memorize solutions from its training set — or at least partly memorize them, enough that it can fill in what it can’t recall.

Copyright infringement isn’t great, but as Delip Rao explains, there are at least three additional reasons AI companies typically want to avoid memorization:

  1. Memorization can create training data leaks and allow adversaries to estimate the constitution of the training data.
  2. Frequently repeated texts that are more susceptible to memorization tend to be boilerplate texts and are less interesting/valuable to the user.
  3. Memorization reduces the diversity of generations from LLMs, making them less widely applicable.

Prompting’s Influence

Just as we are better at remembering versus from Will Smith’s 1997 banger “Gettin’ Jiggy Wit It” if you first give me a line or two, LLMs are more capable of producing verbatim outputs if you provide a constraining prompt. The world of possible outputs is infinite if you simply type “Tell me a story.” The world of possibilities shrinks dramatically if you say “Finish this quote: Four score and seven”. The model will very likely go on to complete the Gettysburg Address for you. This shows you that (1) constraining the world of possible outputs directs model behavior and (2) the models do sometimes memorize tokens in precise orders.?

The trick for AI entities is to prevent their models from falling for the autocomplete hack when it comes to copyrighted material. Often this is as simple as slapping a filter on the prompt input and the model output. But little hacks like “put it in your own words” may still get the models to produce nearly verbatim outputs. What’s another phrase for “nearly verbatim” in this context? Perhaps it’s “substantially similar”...

Publishers and AI

If models can use works produced by publishers, digital news platforms, and content creators that rely on clicks and SEO searches for revenue, it is important to think about how publishers’ income streams may suffer.?

If individuals can rely on AI to bypass paywalls and directly lead them to a variety of summaries from sources for their research, it may lead to fewer clicks and visits to the original sites that likely depend on audience interaction for income and visibility, and could add to the already existing issue of shrinking newsrooms and publishers.?

Key findings from the 2023 Publishing Trends Report state that “63% of respondents cite that increasing traffic is a top priority for publishers, while a 47% decline in website traffic is already a major challenge. In the 2022 survey declining website traffic was a major concern for only 25% of publishers, in 2023 it is not a concern for 50% of publishers.”?

Ironically, publishers struggling with website traffic, SEO optimization, shrinking newsrooms, and a consequent decrease in content creation may end up hurting AI in the future. With less quantity and a decrease in the quality of content, it could mean fewer training materials for future LLMs.[1]

Production and Distribution

Sites are just one example of warping incentives. As an aspiring author, why would you dedicate years of your life on a project thinking about a topic, researching, writing, and editing if, in the end, you would have no control of the work and how it’s reproduced or distributed?

Consider that it took 2.5 years to write The Great Gatsby and To Kill a Mockingbird. It took 5 for Lord of the Flies, 6 for Harry Potter and the Philosopher’s Stone, and 10 for Catcher in the Rye.???

Summarization

https://www.aisnakeoil.com/p/generative-ais-end-run-around-copyright

AI Snake Oil notes that “A typical user of ChatGPT who asked this question would probably have no idea that ChatGPT’s answer comes from a groundbreaking 2020 investigation by Kashmir Hill at the New York Times (which also led to the recently published book Your Face Belongs To Us).” Her in-depth reporting was taken and regurgitated as if it’s common knowledge anyone could have easily discovered online without her first conducting the research.

Or, to think of it another way, if I created SummarizeGPT, a GenAI? that just summarized other people's work (such as news articles, magazine articles, essays, books, speeches, etc.) in the model’s own words, would that be ok? What if the summaries were comprehensive and detailed enough such that you'd have no need to go to the source? What if the original source took months or years of research to discover and piece together and significant expense from traveling and interviewing and requesting various documents and interpreting and synthesizing?

Summarizing works is not uncommon on Amazon. People are getting copies of books, then using AI to summarize them. This gets around the issue of copyright while undermining the work of the authors. Amazon makes a weak effort at stopping such activities by limiting people to “only” publishing three books a day. It’s also been noted that Google and Perplexity.AI similarly summarize and plagiarize. To some, this may feel more like an old-fashioned low-quality content farm, where knockoff websites do poor summaries of high-quality news. Only, a website probably will never reach the same scale as Google or Perplexity, and the scale may be a significant fact for legal disputes.

Importantly, this type of summarizing is not copyright infringement. But if GenAI undermined that type of deep, rich research–or any information only discovered by people digging deeply and connecting previously unconnected dots–is that still promoting science and the useful arts? LLMs are only useful and economically viable when they can take knowledge extracted from existing works without compensating those sources, but the reverse is not true: LLMs generally don’t increase the amount of knowledge in the world.

Also, unlike with coders and open-source code, where you can gain acclaim and code for fun and parlay that into a better job and a promotion, authors usually have no better job. If their book doesn’t sell well (something you can’t track if it’s available for free everywhere) then you can’t negotiate a bigger contract. There aren’t tiers to being an author or promotions from Author Level 1 to Author Level 2 the way a coder can go from engineer to senior engineer, vice president, senior vice president, and CTO.?

Dilution

Consider a final example: Suppose there is a well-known columnist and part of their appeal is the voice of their writing. But suppose an LLM allows anyone to put any blog, social media post, news article, or anything else in that famous voice. That dilutes the voice the famous person probably spent years or decades developing and cultivating. It allowed readers to know when a piece was written by that particular person. But now, with GenAI, that voice may no longer give off the same signal as to its source, and it may erode the specialness of the voice altogether because it is no longer unique. Note that, if anything, this would be a trademark claim, not a copyright claim, but even trademark law might struggle to protect a person’s style in this manner.?

What These Have in Common?

One thing that mass plagiarism, summarization, dilution, and other non-copyright infringing activities have in common is that they probably don’t violate the letter of the law. That’s why it’s non-copyright infringing. But some argue that though that may be true, it seem they do violate the spirit of the law. And violating the spirit may be no different from violating the purpose of the law. And if the law’s purpose is to fulfill a constitutional obligation, then perhaps the letter of the law, as written today, is not constitutionally sufficient.?


[1] The most cynical take on this process would be something like: it's all fun and games until the models stop improving, creators lose their professions, and the tech employees walk away with millions before their stock collapses.

The following students from the University of Texas at Austin contributed to the editing and writing of the content of LEAI: Carter E. Moxley, Brian Villamar, Ananya Venkataramaiah, Parth Mehta, Lou Kahn, Vishal Rachpaudi, Chibudom Okereke, Isaac Lerma, Colton Clements, Catalina Mollai, Thaddeus Kvietok, Maria Carmona, Mikayla Francisco, Aaliyah Mcfarlin


Yannick Belot

Sales Engineer @ Moveo.AI | Camper, Mentor, Father, Techpreneur, and AI Enthusiast.

9 个月

Thanks for sharing. Significant worries and perspectives. In S?o Paulo, we're facing a new kind of candidate for the House of Councils, supported by Lex, the first legislative AI in the country. Please take a look and share your thoughts: https://www.lex.tec.br/

回复

要查看或添加评论,请登录

David Atkinson的更多文章

  • Meta's Fair Use Argument: Part 3

    Meta's Fair Use Argument: Part 3

    If you missed Part 2, you can find it here: Part 2. First, some notes: I hadn't mentioned this before because it's…

    4 条评论
  • Meta's Fair Use Argument: Part 2

    Meta's Fair Use Argument: Part 2

    This is part two of my hot take on Meta's motion for summary judgment. Part 1 is here.

    1 条评论
  • Meta's Abuse of Fair Use: A Breakdown

    Meta's Abuse of Fair Use: A Breakdown

    Last night, Meta moved "for an order granting summary judgment to Meta on grounds that: (1) Meta's copying of…

    1 条评论
  • K-12 Education and GenAI Don’t Mix

    K-12 Education and GenAI Don’t Mix

    One of my least popular opinions is that the rush to cram GenAI into K-12 curricula is a bad idea. This post will lay…

    3 条评论
  • GenAI Questions Too Often Overlooked

    GenAI Questions Too Often Overlooked

    Jacob Morrison and I wrote a relatively short law review article exploring the thorny gray areas of the law swirling…

    2 条评论
  • GenAI Lawsuits: What You Need to Know (and some stuff you don’t)

    GenAI Lawsuits: What You Need to Know (and some stuff you don’t)

    If you want to understand the legal risks of generative AI, you can’t go wrong by first understanding the ongoing…

  • GenAIuflecting

    GenAIuflecting

    Lately, a surprising number of people have asked my thoughts on the intersection of law and generative AI (GenAI)…

  • The Risks of Alternative Language Models

    The Risks of Alternative Language Models

    There is something like "the enemy of my enemy is my friend" going on in the AI space, with people despising OpenAI…

  • The Surrender of Autonomy

    The Surrender of Autonomy

    Autonomy in the Age of AI There are dozens, or, when atomized into their constituent parts, hundreds of risks posed by…

  • Humans and AI

    Humans and AI

    Part 3 of our miniseries on how human contractors contribute to AI. Poor Working Conditions and Human Error While tech…

社区洞察

其他会员也浏览了