Did an AI Write This?
OpenAI has started its?new campaign?by opening its the ChatGPT model to the world. This deep learning neural network is a model developed with conversational interaction in mind. Due to the nature of the format, ChatGPT can respond to follow-up inquiries, apologise for errors, dispute faulty premises, and decline unsuitable requests.
We can say with confidence that it has broken the internet.
Human — ChatGPT — Interaction
The GPT-3 itself is fascinating, but what is more fascinating is how people are using this sophisticated model. Some of the more interesting ways I have seen people use ChatGPT in the past week are:
Using it instead of Google search:
Using it to explain technical concepts in the tone of a 1960s gangster
Using it to solve homework assignments:
Using it to write short stories:
Using it to write short text about specific topics (e.g., South African sea turtles):
And many more…
The CopyRight Question
After playing around with ChatGPT for hours, I started becoming suspicious of everything I read. I started asking myself, "Was that post written by AI?", "How much of the content and text I am reading was actually written by an actual person?"
Solipsism aside, there’s a very important question about plagiarism and AI. Who owns the content produced by AI—the creators of the AI or its users? Is the model just a more sophisticated version of a tool, such as pen and paper, or does the model itself carry the creative rights of its creators?
For example, if a human writes an article and uses AI to help them generate certain parts of the text, it might not be clear who should be given credit for those parts of the text. In cases like this, it’s important for the human creator to clearly state the role that the AI played in the creation of the content and to give the AI and its creators appropriate credit.
Aside from the moral question, there is also the technical question: how do you catch this sort of plagiarism in the first place? Normal plagiarism checkers will not be able to tell which texts are original and which ones were plagiarised by an AI such as ChatGPT.
To illustrate this, I put the South African Sea Turtles text into Quilbot's plagiarism checker and got the following results:
The text is, for all intents and purposes, plagiarism-free according to standard plagiarism checkers. So how can you tell if someone has plagiarised AI-generated content, specifically ChatGPT?
Plagiarism Check Solution
As is customary on LinkedIn, the first step is to ask the smart ChatGPT bot. Alas, the AI was too smart to reveal its secrets…
GPT3 would be hard to catch with general, open-ended questions because of the large amounts of data available for generic topics. Small differences in the query will result in massive differences in the result. But catching more niche topics would be easier. For specific niche topics, the model could only have been trained on a limited set of data.
An idea I had was that plagiarism checkers could easily incorporate AI plagiarism checks into their scoring by feeding the same queries back into the most commonly used AI text generators. For example, we could query ChatGPT and compare the results with the original text. The solution can be broken down into three separate problems:
We are lucky that ChatGPT is smart enough to construct text given some keywords, so we can simplify the first task to just finding the keywords in a text, which we can then feed into GPT-3 to reverse engineer the original resulting text.
For my own curiosity, I’ve implemented a simple version of this idea in Python. You can find the full code in my Medium article. Let’s use one of our examples above to walk through an implementation of each step:
TIP: You can use?Google Colab?to follow through on the code implementation examples.
Extracting keywords from the text
In a nutshell, how does keyword extraction work? This is a common problem that can be solved in three steps. Lemmatization, stopword removal, and ranking
It doesn’t make sense to list every word in the text’s vocabulary, especially words like “write,” “written,” and “wrote,” which all mean the same thing: “write.” So, we first?lemmatize text, which means we get each word back to its root form.
There are a lot of words in text passages, but not all of them are important. Most of them could be common words like “a,” “that,” “then,” etc. These words, which are called “stopwords,” must be?removed?from the output or else they will mess it up. Words that are close to each other in meaning must be put together.
领英推荐
Once you have a list of possible phrases, you need to?rank?them to see which one is the most important.
All this can be simply achieved with Python’s RAKE library (RAKE stands for Rapid Automatic Keyword Extraction). We’ll just talk about the implementation here. If you want to know what goes on behind the scenes, check out this?book?and read the?documentation.
# We first install our packages:
!pip install multi-rake
We then set up a variable with our text for checking:
text_in = """
South Africa is home to five species of sea turtles: the leatherback, the loggerhead, the green turtle, the hawksbill, and the olive ridley. These turtles are found along the country's coast and in the waters surrounding its offshore islands. Sea turtles play a crucial role in the marine ecosystem, as they help to maintain the health of seagrass beds and coral reefs. They are also an important food source for many other marine animals. However, sea turtles are threatened by a variety of human activities, including pollution, habitat destruction, and overfishing. It is important for us to protect these animals and their habitats in order to ensure their survival.
"""
Now let’s add our code for extracting keywords (check out?https://pypi.org/project/multi-rake/?for RAKE usage details). We want to extract two sets of keywords, a longer set and a shorter set, to feed back to the chatbot.
from multi_rake import Rak
rake = Rake()
raw_keywords = rake.apply(text_in)
keywords = []
for word in raw_keywords[:5]:
keywords.append(word[0])
print("go to " + "https://chat.openai.com/chat " + " and search for the following:" )
prompt = ', '.join(keywords)
print("Search 1: Write a text using these keywords: " + prompt)
for word in raw_keywords[:2]:
keywords.append(word[0])
prompt = ', '.join(keywords)
print("Search 2: Write a text using these keywords: " + prompt)
We get the following results:
go to https://chat.openai.com/chat and search for the following
Search 1: Write a text using these keywords: important food source, sea turtles play, sea turtles, south africa, green turtle
Search 2: Write a text using these keywords: important food source, sea turtles play, sea turtles, south africa, green turtle, important food source, sea turtles play
Voala! We’ve extracted keywords and can now query the bot.
Feeding this query to ChatGPT
Since ChatGPT does not have an open API that we can query yet and it would be too much hassle (and maybe somewhat illegal) to web scrape and automate, this step is going to be manual. Obviously, any future plagiarism checkers can do this in the backend via API calls.
The only thing we need to do is ask ChatGPT to generate a text given the keywords above. After we do so, we get the following results:
Result 1:
Result 2:
We can already see the similarities between the original text and the results.
Comparing the results to the original text to benchmark similarity
We now need to run a similarity check between the results and the original text. There’s a potential problem here, as similar queries to ChatGPT can return results of different sizes. In other words, it may give you a paragraph in one case and five paragraphs in another. So how do we compare the results for similarities? A cosine distance measure can be used to determine the number of similar words.
In summary, cosine similarity is a measure of similarity between two vectors that calculates the cosine of the angle between them. It is often used in information retrieval and text mining to compare the similarity of documents or search queries based on the vector space model. In this model, each document or query is shown as a vector of weights, with each term's weight depending on how often it appears in the document or query. The cosine similarity between two vectors is calculated by taking the dot product of the two vectors and dividing it by the product of their magnitudes. This results in a value between -1 and 1, where a value of 1 indicates that the two vectors are identical, and a value of -1 indicates that they are completely different. It works great for comparing texts with the same words and normalising the difference in size.
To implement this in Python, we’ll use the SkLearn library; We first start by installing the packages:
pip install scikit-learn
We then use the following code to assess the similarity of both results with the original text and spit out the higher score. Arbitrarily, if the score is higher than 0.6 (close to 1), we will conclude that the text is suspicious.
from sklearn.feature_extraction.text import TfidfVectorize
from sklearn.metrics.pairwise import cosine_similarity
compare1=[text_in,result1]
tfidf1= TfidfVectorizer()
tfidf_vector1 = tfidf1.fit_transform(compare1)
cosine_sim1=cosine_similarity(tfidf_vector1, tfidf_vector1)[0][1]
if result2 != "":
compare2=[text_in,result2]
tfidf2= TfidfVectorizer()
tfidf_vector2 = tfidf2.fit_transform(compare1)
cosine_sim2=cosine_similarity(tfidf_vector2, tfidf_vector2)[0][1]
if cosine_sim2 > cosine_sim1:
cosine_sim1 = cosine_sim2
print("Similarity score: " + str(round(cosine_sim1,2)))
if cosine_sim1 > 0.6:
print("Result is Sus - probably AI plagiarism")
else:
print("Result is OK... - probably not AI plagiarism")
Running this results in the following score:
Similarity score: 0.69
Result is Sus - probably AI plagiarism
CopyRight and AI
There is a very interesting copyright question about the nature of AI-created content.
Plagiarism is the act of using someone else’s work without giving them proper credit. This can apply to content created by AI as well, although the concept of plagiarism in the context of AI can be a bit more complex.
In the most basic sense, if an AI creates a piece of content and someone else uses it without giving proper credit to the AI or its creators, that could be considered plagiarism. However, it’s important to note that AI is often used as a tool by human creators, so determining who should be given credit for the work can be tricky.
For example, if a human writes an article and uses AI to help them generate certain parts of the text, it might not be clear who should be given credit for those parts of the text. In cases like this, it’s important for the human creator to clearly state the role that the AI played in the creation of the content and to give the AI and its creators appropriate credit.
Overall, the key to avoiding plagiarism when using AI to create content is to be transparent about the role that the AI played in the creation process and to give proper credit to all parties involved.
This whole article, for example, was written with the help of?https://chat.openai.com/. You can even run these last 4 paragraphs through the code and get a similarity score of 0.74.
Next time you read something, ask the question, “Was this written by a person?” Run it past this code and get a score!
You can grab the full code here: https://github.com/sunnydean/ChatGPTPlagiarismChecker