NLP ? AI Text Detection Techniques

NLP ? AI Text Detection Techniques

DetectGPT

  • DetectGPT’s method leverages log-probabilities of the text. “If an LLM produces text, each token has a conditional probability of appearing based on the previous tokens. Multiply all these conditional probabilities to obtain the (joint) probability for the text.”(source)
  • What DetectGPT does is it perturbs the text and then compares the log probabilities of both, pre-perturbed and perturbed text. If the new log-probability is significantly lower, its AI generated, otherwise it is human generated.
  • The image below, (source), displays the perturbation, scoring and comparison process.

Stylometry

  • Stylometry is the study of linguistic style, and it’s often used to attribute authorship to anonymous or disputed documents.
  • In AI detection, stylometry can be used to identify the distinct ‘style’ of a particular AI model, based on certain features like word usage, sentence structure, and other linguistic patterns.
  • Stylometry is a field of study within computational linguistics and digital humanities that involves the quantification and analysis of literary style through various statistical and machine learning methods. The core premise of stylometry is that authors have a distinct and quantifiable literary “fingerprint” that can be analyzed and compared.
  • Stylometric analysis is performed using features such as word length, sentence length, vocabulary richness, frequency of function words, and usage of certain phrases or structures. These features are then subjected to various statistical analyses to identify patterns.
  • Here are a few applications and techniques used in stylometry:Authorship Attribution: Stylometry can help identify or confirm the author of a text based on stylistic features. This can be useful in literary studies, forensics, and even in cases of disputed authorship or anonymous texts.Author Profiling: By analyzing the stylistic features of a text, it’s possible to make predictions about the author’s demographics, including age, gender, native language, or even psychological traits.Text Categorization: Stylometry can also be used to classify texts into different genres, types (fiction vs non-fiction), or time periods based on stylistic features.Machine Learning Techniques in Stylometry: In recent years, machine learning techniques have been increasingly applied to stylometry. Methods such as Support Vector Machines (SVM), Random Forests, and Neural Networks are used to classify texts based on stylistic features.N-gram Analysis: This is a common technique in stylometry that involves counting sequences of ‘n’ words or characters. N-gram analysis can help capture patterns of language use that are distinctive to a particular author.Function Words Analysis: Function words are words that have little meaning on their own but play a crucial role in grammar (like ‘and’, ‘the’, ‘is’). Authors tend to use function words unconsciously and consistently, making them a valuable tool for stylometric analysis.
  • It should be noted that while stylometry can be powerful, it also has its limitations. The results can be influenced by factors such as the genre of the text, the author’s conscious changes in style, and the influence of co-authors or editors. Furthermore, while stylometry can suggest patterns and correlations, it can’t definitively prove authorship or intent.

GPTZero

  • “GPTZero computes perplexity values. The perplexity is related to the log-probability of the text mentioned for DetectGPT above. The perplexity is the exponent of the negative log-probability. So, the lower the perplexity, the less random the text. Large language models learn to maximize the text probability, which means minimizing the negative log-probability, which in turn means minimizing the perplexity.GPTZero then assumes the lower perplexity are more likely generated by an AI.Limitations: see DetectGPT above. Furthermore, GPTZero only approximates the perplexity values by using a linear model.” Sebastian RaschkaIn addition, as we an see in the image below, (source), it now takes into consideration burstiness to check how similar the text is to AI patterns of writing.Human written text as changes in style and tone whereas AI content remains more-so consistent throughout.


Altiam Kabir

AI Educator | Built a 100K+ AI Community | Talk about AI, Tech, SaaS & Business Growth ( AI | ChatGPT | Career Coach | Marketing Pro)

1 年

Great way to tackle the challenge of detecting AI models!

That's fascinating! DetectGPT's approach seems really promising. ??

要查看或添加评论,请登录

Srinivas Pradeep s的更多文章

  • "Unlocking the Secrets of Human Health: The Synergy of Epigenetics and AI"

    "Unlocking the Secrets of Human Health: The Synergy of Epigenetics and AI"

    In the ever-evolving landscape of healthcare, two groundbreaking fields are converging to unlock the mysteries of human…

    1 条评论
  • Technical terms in Gen AI.

    Technical terms in Gen AI.

    Are you familiar with the technical terms used in the context of Gen AI? It's important to understand basic concepts in…

  • Top 5 Vector database - Part 1

    Top 5 Vector database - Part 1

    Chroma Chroma is the open-source embedding database. Chroma makes it easy to build LLM apps by making knowledge, facts,…

  • Generative AI Models : Unimodal vs Multimodal

    Generative AI Models : Unimodal vs Multimodal

    Generative AI: Explanation: Generative AI refers to models that have the ability to generate new content, such as…

    5 条评论
  • Model lakes

    Model lakes

    Given a set of deep learning models, it can be hard to find models appropriate to a task, understand the models, and…

  • Selection of models

    Selection of models

    Task Alignment: Consideration: Ensure the model is specifically designed for your task (e.g.

    1 条评论
  • Qdrant - Open source Vector database

    Qdrant - Open source Vector database

    Qdrant is a vector database & vector similarity search engine. It deploys as an API service providing search for the…

  • The Top 5 Vector Databases

    The Top 5 Vector Databases

    A comprehensive guide to the best vector databases. Master high-dimensional data storage, decipher unstructured…

    4 条评论
  • Open Source Models with Hugging Face

    Open Source Models with Hugging Face

    Find and filter open source models on Hugging Face Hub based on task, rankings, and memory requirements. Write just a…

  • Olama - The future of language model.

    Olama - The future of language model.

    ??The project aims to manage large language models similar to the way Docker manages container images, which could make…

社区洞察

其他会员也浏览了