Ways in which AI can harm, and help, learning

Ways in which AI can harm, and help, learning

This week's top paper (partly chosen for it's clear headline and language, and partly for the significant results) uses a negative, almost click-baity headline. If we'd had the internet in 1986 when calculators were first appearing, I guess we'd have seen the same headlines, swapping "generative AI" for "calculators". But the insights it gives are important - it's not just about dumping AI into education and expecting long-term benefits. It's also about making good teaching and learning decisions to make sure it's not a 'sugar hit of AI', to be followed by a brain fade!

Research Paper of the Week

Generative AI Can Harm Learning

https://papers.ssrn.com/sol3/papers.cfm?abstract_id=4895486

This study from researchers at 美国宾夕法尼亚大学 - 沃顿商学院 and Budapest British International School ( Hamsa Bastani , Osbert Bastani , Alp Süngü , 葛灏森 ?zge Kabakc? and Rei Mariman ) investigates how generative AI, specifically GPT-4, impacts student learning in high school maths classes. It involved nearly a thousand students using two AI tutors: GPT Base, which mimics ChatGPT, and GPT Tutor, designed with safeguards to promote learning. While both AI tutors significantly improved performance during practice (48% improvement for GPT Base and 127% for GPT Tutor), students who used GPT Base performed worse on subsequent exams without AI assistance. This suggests that over-reliance on AI can hinder long-term learning. The study attributes this negative impact to students using GPT Base as a crutch, merely copying answers rather than understanding the material. In contrast, GPT Tutor, which provided incremental hints and avoided giving complete solutions, mitigated these negative effects. Students using GPT Tutor showed no significant drop in performance compared to the control group, demonstrating the importance of well-designed AI tools that support rather than replace the learning process.

This is interesting research to read despite the dramatic title (full marks for a citeable paper name!), and gives you pause for thought about the medium term impacts of various ways to use AI in education - and what steps you need to think about to make sure it's not just a short term 'sugar-hit' to learning. I can see the parallel to calculators in maths - you can see that it might reduce learning if you're offboarding learning to a machine. But in today's world we have a calculator (and a spreadsheet) as a universal tool, and so it's no longer necessary to be able to do all of the things manually/mentally. Nobody mourns the demise of the slide rule or log tables.

The rest of the AI in Education Research

ChatGPT, Copilot, Gemini, SciSpace and Wolfram versus higher education assessments: an updated multi-institutional study of the academic integrity impacts of Generative Artificial Intelligence (GenAI) on assessment, teachingand learning in engineering

https://www.tandfonline.com/doi/epdf/10.1080/22054952.2024.2372154

This study examines the performance and academic integrity implications of Generative Artificial Intelligence (GenAI) tools, including ChatGPT-4, Copilot, Gemini, SciSpace, and Wolfram, in higher education engineering assessments. The researchers carried out initial research a year ago, and have now updated it to review the latest results. In that year, these tools demonstrated substantial improvements, particularly in passing more diverse assessment types. On average, ChatGPT increased performance by 24% for online quizes and 41% for numerical assessments. What that means is that it went from typically achieving a Pass grade to a Credit grade. ChatGPT-4 emerged as a well-rounded performer with the best overall performance across diverse assessment types, while other tools showed specific strengths. The study underscores the dual-edged nature of GenAI: heightened cheating concerns but also potential for integrating these technologies to enrich learning experiences. The paper contains a GenAI Assessment Security and Opportunity Matrix? to guide educators in balancing integrity risks with innovative teaching enhancements.

Here's a scary quote from their conclusions:

"In our benchmark study in 2023, we stated that we had 12–24 months before GenAI’s capability created a serious threat. Twelve months have passed, and this paper has shown that GenAI capability has leapt ahead. While the precise advancements of the next year remain uncertain, GenAI is constantly evolving. Therefore, organisations that have not yet undergone a security audit are strongly encouraged to consider initiating one now to ensure their cur-rent safeguards remain effective"

I'd nominate this paper for two bonus awards:

(1) Longest paper name

(2) Longest author & university list - Sasha Nikolic Rezwanul Haque Scott Daniel (he/him) Sarah Grundy (PhD Chemical Engineering) Marina Belkina Sarah Lyden Dr. Ghulam Mubashar Hassan Jally and Peter Neal - collaborating from 7 Australian universities - University of Wollongong University of the Sunshine Coast University of Technology Sydney UNSW Western Sydney University University of Tasmania and The University of Western Australia

How critically can an AI think? A framework for evaluating the quality of thinking of generative artificial intelligence

https://arxiv.org/abs/2406.14769

A mix of Australian researchers (Luke Zaphir Jason M. Lodge Jacinta Lisec Dominic McGrath Hassan Khosravi ) from The University of Queensland wrote this paper that discusses the impact of generative AI, specifically ChatGPT, on educational assessments that measure critical thinking. With AI becoming prevalent, they point out it's crucial to ensure that assessments genuinely reflect a student's skills rather than the capabilities of AI. The paper introduces the MAGE framework. This framework provides a systematic approach for evaluating the vulnerabilities of assessment tasks to generative AI. MAGE stands for Mapping, AI vulnerability testing, Grading, and Evaluation. Each step helps educators understand how well their assessments can withstand AI intervention and maintain their integrity in measuring genuine student performance. To test the framework, the researchers applied it to different assessment tasks across various disciplines. They found that ChatGPT could often produce high-quality responses when provided with well-engineered prompts. However, these responses lacked the nuanced understanding and contextual relevance that are crucial for critical thinking. (For instance, AI could accurately describe the causes of the Titanic sinking but struggled to provide insightful, reflective, and coherent analyses without substantial prompt engineering). The idea is that by understanding these vulnerabilities, educators can design more effective assessments that emphasise critical thinking skills which AI cannot easily replicate. The study highlights the continuing need for innovative assessment strategies to maintain academic integrity.?

Understanding Students' Acceptance of ChatGPT as a Translation Tool: A UTAUT Model Analysis

https://arxiv.org/abs/2406.06254 ?

The paper, from researchers at 香港理工大学 ( Simin XU Kanglong Liu ), explores how university students in Hong Kong perceive and accept ChatGPT for translation tasks. The study involved 308 students, including both translation and non-translation majors. They found that performance expectancy (the belief that using ChatGPT will enhance translation performance) and facilitating conditions (the availability of resources and support) significantly influence students' intentions to use ChatGPT. Interestingly, social influence and effort expectancy were less impactful.

Supporting Self-Reflection at Scale with Large Language Models: Insights from Randomized Field Experiments in Classrooms

https://arxiv.org/abs/2406.07571 ?

This study explored the use of Large Language Models (LLMs) to support self-reflection in educational settings. Researchers conducted two experiments in undergraduate computer science courses to test LLM-guided reflection against traditional methods like questionnaires and lecture slide reviews. In the first experiment, students using LLMs for reflection reported higher self-confidence and performed better on subsequent exams than those who did not use LLMs. The second experiment showed that both LLM-guided and questionnaire-based reflections were more effective than just reviewing lecture slides. They say the paper results underscore the utility of LLM-guided reflection and questionnaire-based activities in improving learning outcomes. And conclude that their work highlights that focusing solely on the accuracy of LLMs can overlook their potential to enhance metacognitive skills through practices such as self-reflection. (A long list of 加拿大多伦多大学 , 美国卡内基梅隆大学 & Carleton College researchers worked on this paper, including Harsh Kumar Ruiwei Xiao Benjamin Lawson Ilya M. Jiakai Shi Huayin Luo Joseph Jay Williams (Experimentation) Anna Rafferty John Stamper and Michael Liut )

Evaluating ChatGPT-4 Vision on Brazil's National Undergraduate Computer Science Exam

https://arxiv.org/abs/2406.09671 ?

This research Nabor Mendon examines how well ChatGPT-4 Vision, performs on the ENADE 2021, Brazil's national exam for undergraduate computer science students. ChatGPT-4 Vision handled both open and multiple-choice questions, presented in their original visual format. The model scored within the top 10 percentile, outperforming most human participants, especially in visually-based questions. Despite its strong performance, the model faced difficulties with complex reasoning and interpreting some questions accurately.

This quote struck me: "The involvement of an independent expert panel to review cases of disagreement between the model and the answer key revealed some poorly constructed questions containing vague or ambiguous statements, calling attention to the critical need for improved question design in future exams" - basically I think it says that ChatGPT couldn't get some of the answers because the questions weren't clear - which sounds like a problem that would affect students. Maybe there's a use case for running every exam through ChatGPT before giving it to students, to make sure you've got good questions in there!

Analyzing Large Language Models for Classroom Discussion Assessment

https://arxiv.org/abs/2406.08680 ?

The research, from University of Pittsburgh researchers including Benjamin Pierce Lindsay Clare Matsumura, investigates the use of large language models (LLMs) for the automatic assessment of classroom discussion quality. The conclusion is that LLMs can be effectively used to assess classroom discussion quality, and the paper goes deeper into looking at the methods used with LLMs to improve performance and maintain consistency. By focusing on three key factors—task formulation, context length, and few-shot examples—the study evaluates the performance, computational efficiency, and consistency of two different LLMs in scoring discussions based on the Instructional Quality Assessment (IQA) framework. The findings they share recommend some LLM programming strategies - like breaking long transcripts into shorter contexts and using few-shot examples - improve performance.

Student Perspectives on Using a Large Language Model (LLM) for an Assignment on Professional Ethics

https://dl.acm.org/doi/abs/10.1145/3649217.3653624 ?

Virginia Grande , Maria Andreina Francisco at 瑞典乌普萨拉大学 produced this study that investigates the use of Large Language Models by master's students in computing for an assignment on professional ethics. Students participated in discussions and used an LLM to explore a case study about attending a conference with ethical implications. They found the LLM helpful in expanding their understanding and providing diverse perspectives. The LLM's inability to make decisions was seen as positive, ensuring students made the final choices. Downsides they saw were that the LLM offered unfeasible options and some students perceived the LLM as an authoritative source rather than a tool to challenge and analyse. The researchers concluded that while LLMs can enhance learning, students need more guidance to critically analyse the AI's responses. One of the recommendations is that if you're going to be using LLMs with your students, then you will need to provide some support for the students, so that they understand how to get the best from the experience - and what LLMs can and can't do?

ChatGPT as Research Scientist: Probing GPT's Capabilities as a Research Librarian, Research Ethicist, Data Generator and Data Predictor

https://arxiv.org/abs/2406.14765 ?

Aylin Kamelia Caliskan at 美国华盛顿大学 , Suneragiri Liyanage and Mahzarin Banaji at 美国哈佛大学 , and Steve Lehr from Cangrade produced this research paper on how ChatGPT can be used in research.?

The paper investigates how effective ChatGPT is in four roles crucial to scientific research: as a Research Librarian, Research Ethicist, Data Generator, and Novel Data Predictor. While ChatGPT shows promise in identifying ethical issues and generating data based on existing knowledge, it still struggles with creating accurate bibliographies and predicting novel data outside its training set. As a Research Librarian, ChatGPT-4 reduces but does not eliminate the generation of fictional references, a problem prevalent in ChatGPT-3.5. As a Research Ethicist, GPT-4 excels, identifying and correcting most ethical lapses in research protocols, unlike its predecessor. In the role of Data Generator, both versions replicate known biases in language but struggle to produce new, accurate predictions. When tasked as a Novel Data Predictor, neither model effectively predicts new empirical results, underscoring limitations in handling data beyond their training.

As they say "Together, these results suggest that GPT is a flawed but rapidly improving librarian, a decent research ethicist already, capable of data generation in simple domains with known characteristics but poor at predicting novel patterns of empirical data to aid future experimentation". Or perhaps I could summarise as "It's not the perfect research scientist….yet!"

70B-parameter large language models in Japanese medical question-answering

https://arxiv.org/abs/2406.14882 ?

Issey Sukeda at 日本东京大学 is the lead authoer of the paper. The study found that LLMs specifically tuned with Japanese data significantly outperformed their English-centric counterparts, achieving over 50% accuracy on the Japanese National Medical License Exam. Their approach was to train it on a translated US medical exam set, and then see how it would transfer that knowledge to a Japanese equivalent. The conclusion is that they did 'okay' in the exams, and a model that was optimised to work in Japanese rather than English did better.

We don't know the pass rate for the exam, but I'm guessing it's more than 50%, so this is different from previously shared research papers that have proudly trumpeted that LLMs can? pass medical exams (US, Korea and some Japanese exams), but perhaps we should add Yet… to this examples

I don't trust you (anymore)! -- The effect of students' LLM use on Lecturer-Student-Trust in Higher Education

https://arxiv.org/abs/2406.14871 ?

Simon Kloker Matthew Bazanya and kateete twaha wrote this study that explores the impact of students using LLMs on the trust relationship between lecturers and students. Trust is crucial for collaborative learning and research. With the rise of LLMs, lecturers face challenges in distinguishing between students' original work and AI-generated content. The study, conducted through a survey at Ndejje University in Uganda, found that lecturers are less worried about the use of LLMs itself but more about whether students are transparent about their use. The research suggests that guidelines promoting transparency can help maintain trust and improve team performance in educational setting

Large Language Models in Student Assessment: Comparing ChatGPT and Human Graders

https://arxiv.org/abs/2406.16510 ?

Magnus Lundgren at G?teborgs universitet investigated how well GPT-4 grades master's level political science essays compared to human graders. Using a sample of 60 essays, the study found that GPT-4's average grades align closely with human grades but tend to be more conservative, focusing on middle-range scores. However, GPT-4 shows low interrater reliability, indicating significant differences in how GPT-4 grades compared to humans. In simpler terms, this means that if you asked GPT-4 and a human teacher to grade the same essay, they would often come up with different grades.? If you look at the paper, you'll see a chart that shows that humans provide a wider spread of marks (they used a 1 -7 scale), whereas ChatGPT was more clustered just above the mid point. But, because it gave less poor or high marks, the average looked similar.

He experimented with different grading instructions but that didn't significantly change performance, suggesting the AI assesses based on generic essay features rather than nuanced criteria. But, and this is a big but, the prompts used didn't provide a specific marking rubric, nuanced critera or specific essay features to look for. Overall, I'd say this paper is useful for people starting the 'Can AI help me with grading?' journey, especially as Magnus included all the prompts he used, so that you can use them as a baseline to improve from.

Alp Süngü

Assistant Professor at Wharton

3 个月

Here is the link of our original paper: https://papers.ssrn.com/sol3/papers.cfm?abstract_id=4895486

Gary Liang

Founder @ Bloom AI | AI in Education

3 个月

The "Generative AI can harm learning" paper is one of the best papers I've read this year. I did a short write-up of it with my thoughts for those who want to understand it in more detail: https://garyliang.substack.com/p/generative-ai-can-harm-learning

Maxime Gabella

CEO @ MAGMA Learning | Creating AI Mentors for Humans in Business, Education, and Life

3 个月

Ray Fleming what are your thought on more fundamental uses of AI for learning, based on education sciences? https://rdcu.be/dNH9p

回复

要查看或添加评论,请登录

社区洞察

其他会员也浏览了