Training data
"Fields of books" by OpenAI's DALL-E 2

Training data

How we got here (recent past)

A debate, and legal challenges, on how AI foundation models can be trained reaches far beyond AI and is challenging us to rethink copyright law and how search engines and a variety of specialized websites can function. The latest salvo is from the NYTimes which reportedly will sue OpenAI to entirely retrain GPT-4 in order to exclude their news content from their training data.

First, a little bit about why we have copyright law at all and how it has changed over the years. The very first copyright laws of which we have any record were created in ancient Greece. Then, as now, the objective was to protect the commercial interests of the content creators, protecting their content from being reproduced by others without compensation to the original creators. While copying in that time (and well into the middle ages) meant a manual process of a person rewriting the content, copyright challenges became more acute in the age of Gutenberg. Beginning a long running conflict between content owners and technologies for reproduction, printing presses allowed multiple exact copies of a text to be produced. The church and government, recognizing this threat to their control of information pursued a number of options for controlling this reproduction. Given the potential for dissenting ideas to be rapidly disseminated, the ideas of liberty and free speech developed alongside the printing press and in opposition to copyright laws (and control of the press).

While this conflict combining the commercial interests with censorship formed the early history of copyright law, by the time of the American revolution the balance had firmly shifted toward protecting commercial interests. In passing the first copyright law in the US in 1790, James Madison and others argued that protection for an individual's creation was essential "to promote the progress of science and useful arts." This first US copyright law granted the creator protection for an initial 14 years, renewable for an additional 14 years. This law was largely a copy and paste of a 1710 UK law called the Statute of Anne which was the first time in European history that a copyright law protected authors rather than publishers.

One debate, especially as the length of copyright protection has increased (now the author's lifetime plus 70 years), has focused on the impact that such protection has on further progress of science and useful arts, given that creation is so often (perhaps always) the result of the influences of works that the creator has previously seen. In some ways this connects back to the entanglement of issues of censorship vs. free speech albeit with the author rather than a government in the role of censor. In order to protect "freedom of expression" a large number of legal cases have been argued, attempting to define the possible cases of "fair use" for copyrighted works. For example, in a 2013 decision, Google won a case brought against them by authors for a project in which the full texts of books were scanned, although only portions would be displayed to viewers based on searches. Many other cases though have found in favor of the copyright owner. In a closely watched Supreme Court case, Andy Warhol was found to have infringed a photograph of musician Prince that had been taken by Lynn Goldsmith.

In this case the surprising adversaries were Justice Kagan and Justice Sotomayor. In finding against Warhol, Sotomayor wrote that,

"The use of a copyrighted work may nevertheless be fair if, among other things, the use has a purpose and character that is sufficiently distinct from the original,"

But the majority found that Warhol's work did not meet this standard. Kagan, in her dissent, disagreed with Sotomayor's ruling (which was largely on the basis that Warhol received compensation for his image) saying that the majority ruling,

"...will stifle creativity of every sort... It will impede new art and music and literature. It will thwart the expression of new ideas and the attainment of new knowledge. It will make our world poorer."

As we consider the implications for training foundation models, this conflict between protecting the commercial interests of creators while nonetheless promoting the continued creative efforts which benefit all of us must both be considered. Copyright law since the beginnings of western civilization in ancient Greece has been an important way to encourage creation by insuring that creators will be compensated for their creation. But our continued capacity to create is also dependent upon how we absorb and reuse all that has been created before us. Which will win out as we think about AI foundation models and their entirely new capacity to comprehend knowledge? As we litigate and legislate, what might be the unintended consequence be on the rest of how we use information on the Internet? Does the way in which foundation models process information and deliver this information to users infringe copyright or is it fair use? And will we inadvertently make the world poorer if we take away this capability and can we justify the compensation for copyright holders against the loss for humanity?

In related news, 2024 is the year that we finally #freethemouse

CHESTER SWANSON SR.

Realtor Associate @ Next Trend Realty LLC | HAR REALTOR, IRS Tax Preparer

1 年

Thanks for posting.

要查看或添加评论,请登录

Ted Shelton的更多文章

  • Consumerization of Technology

    Consumerization of Technology

    (where we are now..

    8 条评论
  • AI Interregnum

    AI Interregnum

    An interregnum: where one epoch is fading and another struggles to emerge. I have these wildly disparate conversations.

    11 条评论
  • Quantum-Enhanced AI?

    Quantum-Enhanced AI?

    Wednesday evening I tried to go to sleep early as I had to get up for a flight the next day and then two full two days…

    6 条评论
  • Cargo Cults and the Illusion of Openness

    Cargo Cults and the Illusion of Openness

    In the South Pacific during the 1940s, indigenous islanders witnessed military planes landing with supplies. After the…

    8 条评论
  • From WIMP to AI

    From WIMP to AI

    Evolving Interfaces and the Battle Against Cognitive Overhead The GUI Revolution and Its Growing Complexity Graphical…

    19 条评论
  • Harvesting your data

    Harvesting your data

    Much has been written this week about DeepSeek - overreaction by the markets, handwringing about China, speculation…

    5 条评论
  • Cognitive Surplus

    Cognitive Surplus

    Clay Shirky's 2010 book Cognitive Surplus: Creativity and Generosity in a Connected Age recently came to mind as I…

    17 条评论
  • Enterprise AI adoption

    Enterprise AI adoption

    I am going to go out on a limb here and just say that everyone will be wrong. Including me.

    21 条评论
  • Predictions for 2025

    Predictions for 2025

    What should we expect from AI research and development in the coming year? Will the pace of innovation that we have…

    14 条评论
  • Assessing 2024

    Assessing 2024

    At the end of each year that I have been writing this newsletter I have made a few observations about what may be…

    7 条评论

社区洞察

其他会员也浏览了