The Risks of Alternative Language Models

The Risks of Alternative Language Models

There is something like "the enemy of my enemy is my friend" going on in the AI space, with people despising OpenAI, Meta, and Google, so they cheer on research labs, DeepSeek, and Hugging Face's second-tier models that heavily rely on distillation, allow for commercial use, and only exist because they scraped tons of data from the web to train the underlying language model.

However, copyright infringement and breach of contract are still copyright infringement and breach of contract. A highly successful second-rate model may be even more harmful to content creators because those models tend to be cheaper to use, which means more people may use them, and it will be much more difficult to bring litigation to right the wrongs because the use is so dispersed, rather than concentrated in deep-pocketed tech giants.

Be careful what you wish for.


A quick note: my social media presence is minimal. If you like this newsletter, please share it with others, because they probably won't hear about it otherwise.

Another note: I typically post explainers written in more neutral language. Today's post is decidedly not neutral. I don't know why.


Key Takeaway

If content creators are serious about protecting their works, they should pay at least as much attention to the second-tier model creators as they do to the well-known entities.

The PR Angle

So far, private nonprofit research institutions, universities (Berkeley, Stanford, University of Washington), and for-profit platforms like Hugging Face have done a masterful job of obfuscating how they are little different from the tech giants when it comes to generative AI (GenAI). In article after article after article the press swoons over how this or that entity was able to replicate some model. Tech journalists seem allergic to questioning how the underlying models were created and instead focus on the outputs, raving that a model performs as well as Llama or DeepSeek or some other (ultimately inconsequential) comparison.

But these second-tier models share a common problem with the tech giants they attempt to emulate: they are all dependent on taking copyrighted works without authorization by scraping the web, and some of that scraping violates terms of service, meaning probably at least millions of contracts were breached in the name of producing another unneeded chatbot.

A Research Exemption?

There is no blanket exemption from the law for research. Look no further than how universities have lost copyright cases even when the copying and distributing was for nonprofit academic purposes. Breaching contracts is even more suspect, as there is no "fair use" defense for breaching.

That said, nonprofit scientific research can receive great leniency under some circumstances. For GenAI, that would probably mean releasing datasets and models under a noncommercial license, limit use to scientific research only, and gate access (much like the requirement for the text data mining research exemption in the US). Notably, this is not what research institutions, universities, or Hugging Face do. Rather, they release the models under an open source license like Apache (which was designed for code, not models, but that's another story), and datasets under ODC-BY (which isn't actually open source. You must adhere to the license terms of all the underlying content in the dataset. But, again, that's another story).

Some models, including prominent ones created by research institutions, use datasets that consist of content scraped from the web, but the scrapers then release the data under a CC-BY license. The dataset DCLM is a prime example. CC-BY, unlike ODC-BY, is open source, but what the researchers who created DCLM did is also likely illegal. You can't take something from someone and just proclaim it can be used for any purpose. Doing so would seem to encourage people to actually treat the underlying data as open source, which is another way of saying it implicitly encourages and facilitates copyright infringement.

It's also worth noting that Hugging Face does not seem to have an issue with facilitating this data laundering. It's how they make money. Anyone can create a dataset, put it under whatever license you want, and Hugging Face will host it, allowing others to easily access and use the content. Hugging Face does this even though they know or could easily confirm that the data is licensed inappropriately. This is especially true for well-known datasets like DCLM.

Hugging Face has a much ballyhooed "Society and Ethics" group, and some of the work they do is interesting and helpful, but they don't seem to have any appetite for tackling the greatest ethical challenge of their company: widespread copyright infringement. Instead, they'd prefer to talk about hypotheticals, like fully autonomous agents.

The Outcomes

The effect of hyping these alternative models might be that more of them are adopted for widespread use. The hyping also encourages those entities to continue pumping out second-tier models.

This could lead to at least two noteworthy outcomes:

  1. The impact on the environment will continue to accelerate and worsen because the harms are 100% externalized. The research institutions, universities, and Hugging Face can simply wash their hands of the issue, blaming systemic failures (hello, Hugging Face) or saying the burden is on the users to deal with the harms of using their models. Whoever is at fault, it's certainly not the people creating the models and encouraging others to use them! (talk about cognitive dissonance)
  2. The ability to enforce laws will weaken. Right now, if copyright owners want redress, they can sue Meta, Microsoft, Perplexity, OpenAI, Anthropic, Grok, etc. Each of those entities is valued in the billions or trillions and are easy to find. But if the smaller models are more widely used and hosted on private servers, it will be much more difficult to learn who is using the models and therefore very difficult to bring a meaningful lawsuit. Even if you can bring a lawsuit, it's very possible the entities won't have enough money to make the litigation worthwhile.

For open source advocates, #2 above is a myopic victory. None of them seem to think about the downstream effects of undermining the law and how the internet economy functions. This is probably because the incentives are misaligned. Most of the creators of these models and datasets will be handsomely paid, gain prestige from papers based on their work, and vest high-valuation shares in meantime. It's too easy to champion ethics while promoting ethical conundrums when your bank account continues to grow every two weeks.


要查看或添加评论,请登录

David Atkinson的更多文章

  • K-12 Education and GenAI Don’t Mix

    K-12 Education and GenAI Don’t Mix

    One of my least popular opinions is that the rush to cram GenAI into K-12 curricula is a bad idea. This post will lay…

    3 条评论
  • GenAI Questions Too Often Overlooked

    GenAI Questions Too Often Overlooked

    Jacob Morrison and I wrote a relatively short law review article exploring the thorny gray areas of the law swirling…

    2 条评论
  • GenAI Lawsuits: What You Need to Know (and some stuff you don’t)

    GenAI Lawsuits: What You Need to Know (and some stuff you don’t)

    If you want to understand the legal risks of generative AI, you can’t go wrong by first understanding the ongoing…

  • GenAIuflecting

    GenAIuflecting

    Lately, a surprising number of people have asked my thoughts on the intersection of law and generative AI (GenAI)…

  • The Surrender of Autonomy

    The Surrender of Autonomy

    Autonomy in the Age of AI There are dozens, or, when atomized into their constituent parts, hundreds of risks posed by…

  • Humans and AI

    Humans and AI

    Part 3 of our miniseries on how human contractors contribute to AI. Poor Working Conditions and Human Error While tech…

  • AI and Its Human Annotators

    AI and Its Human Annotators

    Part 2 of our miniseries on the role of humans in creating AI. Pluralism In AI Unlike most traditional AI, where you…

  • RLHF and Human Feedback

    RLHF and Human Feedback

    Part 1 of our miniseries on RLHF and the role humans play in making AI. RLHF puts a friendly face on an alien…

  • Some Concluding Thoughts on GenAI and the Workforce

    Some Concluding Thoughts on GenAI and the Workforce

    This is Part 4 of our bite-sized series on GenAI and the workforce. The Reality: For Now, Human Labor Is Still More…

  • UBI? My, Oh, My

    UBI? My, Oh, My

    Part 3 of our bit-sized series on GenAI’s potential impact on the workforce. Economic Impact If many people lose their…

社区洞察

其他会员也浏览了