Some painful untold facts about AI-tools - explanation with forensic linguistic outlook
Akos Bardoczi
???Open-Source Intelligence Expert |???cybersecurity | ?? legal | ??Google Cloud Platform | ??threat hunting | ?? growth hacker | ?? Python | ???writer and lector
DISCLAIMER: the following article has rude, disturbing, creepy, scary and observations which based on my best knowledge. If you don't like similar contents, please, don't read this article! Really! And sorry again, nope, basically I write this article to professionals, NOT for average readers.
We often hear about the power of AI-generated content, and related critics as well, without any valuable explanation. I mean, evidence based, empirical observations-based elaborations.
The question is more difficult than "the AI tools is your friends or foes?"
I often see in my own LinkedIn timeline text posts which follows very similar structure, hidden, encapsulated meanings in semantically approach. Most of these posts almost every case seems:
0x100. Grammatically too perfect and similar nuances
Yep, the search engines and suggest algos' prefer the good grammar and style, and under rank the texts which shows poor grammar. But the world is continuously changing, the too perfect grammar is more than suspicious: a big red flag for me.
The most affected contents by languages, not surprisingly: English, Spanish and most probably the Russian and Chinese, which I didn't examined.
Bear in mind the follows: the perfect grammar is simply doesn't exist: the grammar is considered a descriptive discipline within the linguistic. And not one can write grammatically perfect texts, include the native speakers. The hidden logic under the hood can detect and lesser-known property of a text content, it's called idiolects. In another words: your sentences and communication acts are your unique fingerprint which is more than useful in forensic linguistic applications.
A given person idiolectic markers appears all levels of the linguistic "layers", I mean syntax, morphology, semantics and pragmatics. While the cutting-edge spell checkers and similar tools in applied language technology can fixing the typos, grammar mistakes easily, the typical logical structure of an sentence which originated from a person is in fact fully similar, and a representation of the individual logical and decision schemes. These attributes are almost persistent, we use the same structures during our entire life.
The distribution of punctuation marks in a text is also individual, but the authorship attribution needs bigger text sample to successful author identification.
Nope, you cannot be smarter than the forensic linguist experts even thought if you are familiar in language technology or you are working as forensic linguist.
First example: if somebody write a content anonymously with own style and some why the given person - who might be a criminal or activist - try with anti-forensic techniques, the most common attempts:
Broke the long sentences into shorter ones.
Put grammar mistakes into the text intentionally. Just figure it out: if you read a text which originated from a well-educated person based on used phrases and logical structure of the sentences, a grammar mistake is more than suspicious in the first look.
I try to create a similar one in my native language [Hungarian]:
Original:
"Komolyan falra mászok a félm?velt, álintellektuális kontentgyárosoktól, akik egy maroknyi k?nyvet nem ovlastak életükben és ?k osztják az eszet arról, hogy milyen a jó tartalom, holott tudjuk, hogy ami a jó tartalom készítésének egyik feltétele a megfelel? olvasási kultúra és a témában való jártasság egy általános intellektus mellett."
Manipulated #1, without proper punctuation:
"Komolyan falra mászok a félm?velt álintellektuális kontentgyárosoktól akik egy maroknyi k?nyvet nem ovlastak életükben és ?k osztják az eszet arról hogy milyen a jó tartalom holott tudjuk, hogy ami a jó tartalom készítésének egyik feltétele a megfelel? olvasási kultúra és a témában való jártasság egy általános intellektus mellett."
The most common conjunctions is the
de, mert, hogy, mint, ha
These words establish the logical connection between parts of sentences and almost always use these words with separated by a comma.
The anomaly is trivial: if someone write in (1) a semi-scientific topics, (2) educative content, (3) mixed terminus technicus and informal words, very unlikely if the author doesn't follow the elementary grammar rules.
Manipulated #2
"Komolyan falra mászok a félm?velt álintellektuális kontentgyárosoktól. Akik egy maroknyi k?nyvet nem ovlastak életükben és ?k osztják az eszet arról hogy milyen a jó tartalom. Holott tudjuk, hogy ami a jó tartalom készítésének egyik feltétele a megfelel? olvasási kultúra. és a témában való jártasság egy általános intellektus mellett."
The long sentence fragmented to shorter sentences, which at least unusual, unfamiliar. If these appears a longer written text, where many other sentences are longer and don't fragmented, this is in fact a seriously evidence about the text manipulated some why.
Another example
Just imagine the following sentence, I will write the example in Hungarian, again:
Original:
"Ezek annyira kóklerek, hogy a mérésnél nem jegyezték pontosan sem az átlagot, sem a szórásnégyzetet, sem sem a módusz, sem a mediánt, r?viden szólva dilettáns hülyék."
Manipulated:
"Ezek annyira kóklerek, hogy a mérésnél nem jegyezték pontosan sem az átlagot, sem a szórásnégyzetet, sem sem a módusz, sem a mediánt, r?viden szólva dilettáns hüjék."
If somebody pay attention to allegedly serious research mistakes and use multiple terminus technicus, more than uncertain in this context a hair-rising, serious grammar mistake in the end of the sentence: "hüjék" instead "hülyék" non plus ultra the author of the text used a word which strongly tied to literacy. The "dilettáns" sometimes by educated people as a synonym of "hülye", but only in educated people.
In summary: the grammar is never perfect even if it originated from a native speaker. The cheap generative AI solutions can generate only suspiciously perfect texts, without any mistakes AND often create weird, awkward, false logical connections between facts within the sentence. Creepy or isn't, the too perfect grammar hijack your cognition and you probably don't detect the meaningless, dumb content.
If you read many books earlier and you read an generative AI generated article, you feel something is unfamiliar, but you don't evaluate the text with systematically methods. If the linguistic is a crucial part of your daily job, that is a different situation.
Many times, I read suspiciously popular English post especially on LinkedIN, I known about these AI-generated posts, but I'm not able to prove it, because I'm not a linguistic expert. But I have some observations, which common properties, attributes of these posts. Some of these:
1. the posts begin a sentence which seems an insightful, grandiose, deep though, which offer solution to a common issue/problem which affect almost everybody. The perfect clickbait!
2. The post continues with some bullet points: structured, therefore easy to look over quickly which doesn’t depend on the previous knowledge, proficiency, literacy and intellectual capacity of the readers. In other words: the message targets to the average people, which is - based on definition on average - the wider set of audience.
Nobody will feel too dumb, because the text is easy to understand to everybody. In addition, the post isn't too dumb, the educated people also will read it. One of the causes is the "emotional hack" - while the visitors read the content, they will feel that the conveyed values in the post harmonizes with theirs. The content generators can create almost fully unbiased, politically correct content. Language philosophy and intercultural communication is deep water. Don't get me wrong, politically correct and unbiased communication matters. But! Many evidence-based research explain how can be the politically correct and unbiased communication contra productive or dangerous in many cases.
Take a break in this point - as a former volunteer, I talked to many people from the different places of the world. For example, citizens, who are living in most dangerous parts of the globe, e.g. in Palestine, Israel, Ukraine, Russia and I'm very proud because I was empathic in every case. In other cases, I talked to others who are living in deep poverty or horrific unequal culture. My suggestion might seem passive-aggressive, arrogant for many readers: we are humans. Please try to help directly for free to women who are living states where theirs don't have rights or help veterans who literally lost everything, before you say something about global issues.
Back to the original topic: fully unrealistic an article, post, or a strictly peer reviewed scientific publication which is fully accepted by the audience without any critics. Fun fact: the peer-reviewed publications in higher mathematics field aren't exception!
3. Another marker of the generative AI created posts, these post frequently ends with a sophisticated written call-to-action and offers a quick and easy solution to an general or a specific issue for others.
4. Awkward or not, these posts receive a lot of reacts and comments, I have some hypothesis about why. The psychologically pretexted reader will at least react or comment a post which might be overall fully dumb and 100% bullshit or contains a few trivial, but basically useless things. If the readers somehow feel addressed themselves, after the first comment, others will feel involved in commenting on the same article. Some readers will overthink the article, an write comments or continue the discourse from the originally contentless article with an another one, which some cases contain unique, valuable and meaningful thoughts. But frequently the new article will be similar meaningful.
Let's recognize another risk: while the cheap AI-tools available for everybody and most of the people use these tools without proper knowledge, this silently generate a hard bubble effect, which reorganizing the current communities, these will more separated than ever, and the inequality will be bigger than ever before in the mankind. The people will not recognize the unchained singularity before it happens. Many people and companies will get stuck with their unintentionally, free dumb-AI and many businesses will ruin. As I mentioned earlier, it's a bit similar to the dotcom bubble in 1999, just bigger, regarding the elevated impact of ICT in the entire economy.
Nope, a cheap artificial intelligence tool basically can't write better articles than a human. A discipline specialized artificial intelligence probably can write high quality contents, but never alone - this way of content creation still needs machine-human interaction. For example, an expert must be know how to configure the system to write an academic paper in great style, regarding the audience, etc. but it is not the final version. The professionals must review their almost-final paper to spot and fix the mistakes.
One more interesting historical outlook to zoomers: the "machine professors" older than you think: a notable example is the Scigen article generator from early 2000s. In 2005 the Scigen authored the scientific paper
"Rooter: A Methodology for the Typical Unification of Access Points and Redundancy"
which doesn't contain any meaningful information, but, the World Multiconference on Systemics, Cybernetics and Informatics program committee accepted the paper and invited the authors to the conference. IMHO the main cause of the success of the Scigen: the generated scientific paper perfectly followed the needed academic format, partially the terminology, the topic seemed too difficult and the members of the program committee some why didn't spotted the cheat. In addition, the publication pressure was interfered with this.
One more time: basically, don't try to be smarter than others and don't try smarter the machine brains. The useful platforms, not just the social media platforms will spot the AI-generated contents and silently under rank the AI-gen content after assigning some other risk score to the content. Obviously, the penalty will affect the user who published the AI-generated trash.
If you hear about a not too expensive tool which can stylize and boost your text before publishing somewhere, it's definitely sounds good. Again: but! In the worst case it might also be dangerous. My native language is the Hungarian, my grammar in my native language was just average in high school [average grade on 1-5 scale]. I published thousands of different writings in the past two decades, I was a co-author of a book many years ago - these I written in Hungarian, but never found the readers my readers my articles less valuable. Now I learn in E?tv?s Loránd University Faculty of Law in Budapest, I learnt a lot about legal terminology [Hungarian and English], but I think my grammar skills didn't improve. In my interpretation it's strong evidence about the authenticity, logic, individuality, added value are more valued than perfect grammar.
My suggestion: if you write something on of your learned foreign language, check the text with a dumb spell checker to spot the accidentally serious mistakes, but don't try to polish the original text. If the AI-gen detection mistakes on the platform where you publish your material, the penalty is more painful, than a article which seems a bit dumb style, but meaningful content.
Source of images: Wikipedia