Why ChatGPT can't Plagiarize
Mike McAulay
AI Integrator & Enthusiast | Senior Subject Matter Expert - .NET, Azure, Cloud | Coding Consultant
It's an easy misunderstanding to have. But it is a misunderstanding, nonetheless. The scope of this article isn't to address the larger questions regarding these technologies and their impact on society. It's merely to show through an example and some technical aspects of Large Language Models (LLMs) that they aren't "plagiarism machines," nor are they capable of being so.
To see why that is, we need to consider what is primarily accomplished during training: generalization. The ChatGPT model doesn't retain any memory of the training material. It draws generalizations from the material by identifying patterns at the word and subword level. It then uses these patterns to provide unique responses.
Let's jump into a scenario to help us define what plagiarism is.
A young person is attending college. Their assignment is to write an essay on the nature of reality. This young person looks online and even found a few great sources. Reviewing the final draft, they reread one phrase they really liked and even felt a little proud they came up with it. Unfortunately, they hadn't remembered they had picked up that phrase verbatim in one of the articles they'd read.
A few days after turning in the essay the student is eager to see their score and maybe a positive note from the professor. Sadly, when the paper is returned it's marked with a 0 along with a note above the phrase they were so thrilled with, "this was plagiarized. Come see me during office hours." Our young student is distressed and confused. They knew they hadn't hand copied anything from the material they'd read.
Is this plagiarism? I know some would debate that because it wasn't intentional it doesn't really "count." While it wasn't intentional, this would run afoul most rules regarding plagiarism. The professor may show leniency, but as to the question itself. It is a copy of someone else's work.
When it comes to ChatGPT and other LLMs, there are some who believe that it is intentionally plagiarizing, others, having heard experts declare that it doesn't plagiarize, might conclude the experts don't grasp the more subtle forms of plagiarism such as the scenario I previously presented. That it doesn't "mean to," but it's still plagiarism
领英推荐
It took the material and generalized from it about the language. Remember this is fundamentally about language, not about the facts and figures it might contain. This is also part of the reason that LLMs are said to "hallucinate," which simply means it sometimes makes things up that sound right.
Let's reexamine the scenario, this time with ChatGPT standing in for our student.
The teacher grading the essay noted that it seems to contain a number of popular ideas about the nature of existence, and sometimes even "sounds like," some eminent voices on the subject. But it never really exhibits traceable connections with the source materials. In fact, if asked to rewrite it, the essay would have the same general characteristics but written with unique text each time.
Is this plagiarism? Certainly not by any definition that's commonly used today. If it were, virtually everything written by anyone would fall into that category.
There is one last aspect of LLMs that can help us put this question to rest. To do so we have to peek under the hood of ChatGPT a bit. These LLMs operate in a way that could be summed up as: "Based on the context of what's been written so far, select the next 'best' word to write." It's important to note it's not comprehending what it's writing. The algorithm uses probabilities based on the discovered patterns on a per word (and subword) basis. As you might imagine, it has a lot of words to work with when finding the right one. It uses a kind of mathematical representation of the words and patterns it produced during training to calculate the probabilities for the next word to write.
Interestingly, researchers found that if they always chose the word with the highest probability, it tended to sound "flat," and less human like. They then tried occasionally using words with lower probability scores and suddenly it sounded more "creative." How often the model uses these less likely words is now a parameter called "temperature" that is used when producing the AI's output. I've written a more detailed description of temperature here .
The method used to build up these strings of words simply doesn't support taking phrases or really even ideas wholesale and copying them.
Said another way, even if one had all the generalizations that were produced by its training, it would be impossible to rebuild its training material from those generalizations. You can't squeeze an actual instance of plagiarism from it any more than you could provide a stock's price at a given time on a given day based solely on patterns you identified in the entire market.
I understand the implications of these technologies are far-reaching and raise real concerns over the fairness of the current distribution of proceeds related to our work. I want to emphasize that the issues being raised are far more fundamental than whether this technology is plagiarizing or not. It comes down to the bigger question of how we try to resolve the disconnect between the full impact of the value we create and the compensation models we've worked under for centuries.
AI Integrator & Enthusiast | Senior Subject Matter Expert - .NET, Azure, Cloud | Coding Consultant
1 年Janet, I'm very curious to hear your take on whether you find my evidence and explanation persuasive. As I mentioned in the article, I do believe there are issues regarding the value we bring versus the proceeds we receive but I want to ensure people understand specifically what LLMs do so that they don't rely on arguments that feel right but ultimately miss the mark due to assumptions about the technology.
Staff Product Manager at Twilio
1 年Interesting Mike McAulay! After reading through this, I’d tend to agree that the challenge here isn’t actually plagiarism. While I’m certainly not the expert here, the concern I see most often isn’t plagiarism, but rather a concern for the data that the model was trained on itself. What are your thoughts on how we can provide proper reference or attribution when it is due? In my small amount of time using Bard and google’s new SGE search, they seem to do a decent job of providing references to the sites that helped to generate the answer. I wonder if we will start to see the same for other types of “non-search” queries across these platforms.