The Risks of Alternative Language Models
David Atkinson
AI Legal Counsel | A.I. Ethics and Law | University Lecturer | Veteran
There is something like "the enemy of my enemy is my friend" going on in the AI space, with people despising OpenAI, Meta, and Google, so they cheer on research labs, DeepSeek, and Hugging Face's second-tier models that heavily rely on distillation, allow for commercial use, and only exist because they scraped tons of data from the web to train the underlying language model.
However, copyright infringement and breach of contract are still copyright infringement and breach of contract. A highly successful second-rate model may be even more harmful to content creators because those models tend to be cheaper to use, which means more people may use them, and it will be much more difficult to bring litigation to right the wrongs because the use is so dispersed, rather than concentrated in deep-pocketed tech giants.
Be careful what you wish for.
A quick note: my social media presence is minimal. If you like this newsletter, please share it with others, because they probably won't hear about it otherwise.
Another note: I typically post explainers written in more neutral language. Today's post is decidedly not neutral. I don't know why.
Key Takeaway
If content creators are serious about protecting their works, they should pay at least as much attention to the second-tier model creators as they do to the well-known entities.
The PR Angle
So far, private nonprofit research institutions, universities (Berkeley, Stanford, University of Washington), and for-profit platforms like Hugging Face have done a masterful job of obfuscating how they are little different from the tech giants when it comes to generative AI (GenAI). In article after article after article the press swoons over how this or that entity was able to replicate some model. Tech journalists seem allergic to questioning how the underlying models were created and instead focus on the outputs, raving that a model performs as well as Llama or DeepSeek or some other (ultimately inconsequential) comparison.
But these second-tier models share a common problem with the tech giants they attempt to emulate: they are all dependent on taking copyrighted works without authorization by scraping the web, and some of that scraping violates terms of service, meaning probably at least millions of contracts were breached in the name of producing another unneeded chatbot.
领英推荐
A Research Exemption?
There is no blanket exemption from the law for research. Look no further than how universities have lost copyright cases even when the copying and distributing was for nonprofit academic purposes. Breaching contracts is even more suspect, as there is no "fair use" defense for breaching.
That said, nonprofit scientific research can receive great leniency under some circumstances. For GenAI, that would probably mean releasing datasets and models under a noncommercial license, limit use to scientific research only, and gate access (much like the requirement for the text data mining research exemption in the US). Notably, this is not what research institutions, universities, or Hugging Face do. Rather, they release the models under an open source license like Apache (which was designed for code, not models, but that's another story), and datasets under ODC-BY (which isn't actually open source. You must adhere to the license terms of all the underlying content in the dataset. But, again, that's another story).
Some models, including prominent ones created by research institutions, use datasets that consist of content scraped from the web, but the scrapers then release the data under a CC-BY license. The dataset DCLM is a prime example. CC-BY, unlike ODC-BY, is open source, but what the researchers who created DCLM did is also likely illegal. You can't take something from someone and just proclaim it can be used for any purpose. Doing so would seem to encourage people to actually treat the underlying data as open source, which is another way of saying it implicitly encourages and facilitates copyright infringement.
It's also worth noting that Hugging Face does not seem to have an issue with facilitating this data laundering. It's how they make money. Anyone can create a dataset, put it under whatever license you want, and Hugging Face will host it, allowing others to easily access and use the content. Hugging Face does this even though they know or could easily confirm that the data is licensed inappropriately. This is especially true for well-known datasets like DCLM.
Hugging Face has a much ballyhooed "Society and Ethics" group, and some of the work they do is interesting and helpful, but they don't seem to have any appetite for tackling the greatest ethical challenge of their company: widespread copyright infringement. Instead, they'd prefer to talk about hypotheticals, like fully autonomous agents.
The Outcomes
The effect of hyping these alternative models might be that more of them are adopted for widespread use. The hyping also encourages those entities to continue pumping out second-tier models.
This could lead to at least two noteworthy outcomes:
For open source advocates, #2 above is a myopic victory. None of them seem to think about the downstream effects of undermining the law and how the internet economy functions. This is probably because the incentives are misaligned. Most of the creators of these models and datasets will be handsomely paid, gain prestige from papers based on their work, and vest high-valuation shares in meantime. It's too easy to champion ethics while promoting ethical conundrums when your bank account continues to grow every two weeks.