Comparing AI-Generated Text with Human Language
Illustration of the Self-Attention Mechanism from Vaswami et al. (2017) by Janosh Riebesell

Comparing AI-Generated Text with Human Language

Large Language Models (LLMs)

Recent advances in artificial intelligence (AI) have led to the development of large language models (LLMs) with the transformative ability to generate fluent and grammatical text by predicting the likelihood of word sequences based on enormous amounts of training data (Brown et al., 2020; Radford et al., 2019).?The tremendous progress in natural language generation (NLG) over a relatively short time poses a challenge for distinguishing whether content created across various tasks was produced by an AI algorithm or a human (Guo et al., 2023). OpenAI’s ChatGPT is very good at mimicking human language in response to user prompts, but can also confidently produce inaccurate information about people, places, or facts. Yet an important question remains about how to evaluate complex AI models (Celikyilmaz et al., 2021). The increase in AI-generated content has raised concerns about academic integrity, detection of artificial content, and the spread of inaccurate or misleading information (Uzun, 2023). Although LLMs can generate high-quality, grammatically correct text that resembles human language, a gap remains in the level of detail and overall quality of text produced by AI language models versus human language (Liao et al., 2023; Ma et al., 2023).?

Detecting AI-Generated Content?

A growing concern in educational settings is that students could misuse AI technologies to cheat on assignments and exams (Dou et al., 2022). Developing tools and strategies to detect artificially generated content is an active area of research with applications in education, journalism, and social media (e.g., GPTZero, Turnitin, metadata analysis, stylometric analysis). Methods for detecting artificial content may be limited by the possibility of manipulated metadata and the reliance on machine learning algorithms that require large amounts of training data. In addition, detector systems are based on systematic differences between human and machine text, though the goal of AI is to make machine generated text as close to human language as possible. LLMs have also been used to solve introductory level programming assignments while bypassing detection from plagiarism detection tools (Biderman & Raff, 2023). Methods for marking AI-generated content have been proposed, such as the use of ‘watermarks’ or ‘accents’, to facilitate the detection of artificial content and reduce the potential for misuse (Kirchenbauer et al., 2023).?

How Accurate is AI-Generated Content?

It is easy to get caught up in the excitement about powerful new AI models, but how well do LLMs perform on challenging tasks? Researchers at Purdue University analyzed ChatGPT’s answers to programming questions from Stack Overflow in terms of correctness, consistency, comprehensiveness, and conciseness (Kabir et al., 2023). Nearly half of the AI-generated responses were correct, but almost forty percent of human reviewers were persuaded by the lengthy and detailed responses generated by ChatGPT. The comprehensive and articulate responses and the polite, authoritative style of ChatGPT made some completely wrong answers seem as though they were correct. The human reviewers were better at identifying errors in ChatGPT responses when the error was obvious, but when the error was not easily identified, users often failed to detect or underestimated the errors in AI-generated responses. The confident way that ChatGPT conveys information gained the user’s trust, leading them to accept and even prefer answers that were incorrect. Jakesch and colleagues (2022) proposed that innate heuristics from human communication and self-presentation (e.g., first-person pronouns, contractions, family topics) can undermine judgments about AI-generated content that may be inaccurate or misleading.?

Comparable versus Equivalent

LLMs are encoder-decoder transformer models with a self-attention mechanism whose outputs are generated by processing terabytes of data to reach the most probable sequence of words. Currently, AI-generated text is comparable to human language in grammar and fluency, but does not seem to be equivalent in terms of factual accuracy or overall quality. A recent study by Chen et al. (2023) examined how ChatGPT is changing over time, finding that performance varied on different tasks, and gains in performance on one task occurred alongside decreased performance on another task. In part, this may be due to model tuning, or how learning on new tasks affected performance on previously learned tasks. Evaluation of AI-models is limited in that they are “black boxes” given that the complex computations of their deep learning architectures are uninterpretable at a human level. Model predictions on novel inputs can also retain the biases of data the model was trained on?LLMs may be overfit to the data used to train them. Furthermore, AI-models can generate content that is incorrect or misleading, without the knowledge required to understand why an error is wrong. AI applications in sensitive areas such as health-care and medicine have led to some calls to reconsider the use of LLMs until they have been more thoroughly evaluated (Armitage, 2023; Liao et al., 2023). AI is a powerful change agent that will have a lasting effect on human communication. It will continue to be important to understand how AI content is generated and evaluated as the distinction between human and AI content becomes less clear.?

References

Armitage, H. (2023). Rethinking large language models in medicine. Stanford Scope.?Blog post. https://scopeblog.stanford.edu/2023/08/07/rethinking-large-language-models-in-medicine/

Biderman, S., & Edward Raff, E. (2022). Fooling MOSS detection with pretrained language models.?arXiv Preprint, arXiv:2201.07406. https://doi.org/10.48550/arXiv.2201.07406.

Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A. … & Armodel, D. (2020). Language models are few-shot learners. Advances in Neural Information Processing?Systems, 33, 1877-1901. https://doi.org/10.48550/arXiv.2005.14165.?

Celikyilmaz, A., Clark, E, & Gao, J. (2021). Evaluation of text generation: A survey. arXiv:2006.14799? https://doi.org/10.48550/arXiv.2006.14799.

Chen, L., Zaharia, M., & Zou, J. (2023). How is ChatGPT’s behavior changing over time? arXiv:2307.09009. https://doi.org/10.48550/arXiv.2307.09009.??

Kirchenbauer, J., Geiping, J., We, Y., Katz, J., Miers, I., & Goldstein, T. (2023). A watermark for large?language models. arXiv:2301.10226. https://doi.org/10.48550/arXiv2301.10226.?

Dou, Y., Forbes, M., Koncel-Kedziorski,? R., Smith, N., & Choi, Y. (2022). Is GPT-3 text indistinguishable?from human text? Scarecrow: A framework for scrutinizing machine text. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Association for Computational Linguistics, pp. 7250–7274.? https://doi.org/10.48550/arXiv.2107.01294.??

Guo, B., Zhang, X., Wang, Z., Jiang, M., Nie, J., Ding, Y., Yue, J., & Wu, Y. (2023). How close is ChatGPT?to human experts? Comparison corpus, evaluation, and detection. arXiv: 2301.07597. https://doi.org/10.48550/arXiv.2301.0759.

Jakesch, M., Hancock, J., & Naaman, M. (2022). Human heuristics for AI-generated language are flawed. PNAS, 120(11),?e2208839120. https://doi.org/10.1073/pnas.2208839120

Kabir, S., Udo-Imeh, D.N., Kou, B., & Zhang, T. (2023). Who answers it better? An in-depth analysis of?ChatGPT and Stack Overflow answers to software engineering questions. arXiv: 2308.02312,?published online, 4 August 2023. https://doi.org/10.48550/arXiv.2308.02312.?

Liao, W., Liu, Z., Dai, H., Xu, S., Wu, Z., Zhang, Y., Huang, X., Zhu, D., Cai, H., Liu, T., & Li, X. (2023).?Differentiate ChatGPT-generated and human-written Medical Texts. arXiv:2304.11567.?https://doi.org/10.48550/arXiv.2304.11567.?

Ma,?Y.,?Liu,?J., &?Yi, F. (2023). Is this abstract generated by AI? A research for the gap between AI-generated scientific text and human-written scientific text. arXiv:2301.10416.?https://doi.org/10.48550/arXiv.2301.10416.?

Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., & Sutskever, I. (2019). Language models are unsupervised multitask learners. OpenAI Blog, 1, 9.

Robosell, J. Self-attention: Illustration of the transformer attention mechanism. TikZ. Accessed August 21, 2023. https://tikz.net/self-attention/.

Uzun, L. (2023). ChatGPT and academic integrity concerns: Detecting artificial intelligence generated?content. Language, Education, and Technology, 3(1), https://www.langedutech.com/letjournal/index.php/let/article/view/49/36.?

要查看或添加评论,请登录

Sean Shiverick, MS, PhD的更多文章

社区洞察

其他会员也浏览了