Research Roundup - AI vs. Humans: The Future of Explaining Complex Ideas
Ray Fleming
Global AI and Education Industry Leader | Extensive sales & marketing experience | AI solution strategist | Customer centred thinker | Speaker | Media | PR
It's been 3 weeks since the last issue of the newsletter, and yet again, more great insights from researchers around the world into the application of AI in education!
Research of the week
"Is ChatGPT a Better Explainer than My Professor?": Evaluating the Explanation Capabilities of LLMs in Conversation Compared to a Human Baseline
This research is all about seeing how well ChatGPT stacks up against human experts when it comes to explaining tricky concepts in a conversation. They used a dataset from WIRED's "5 Levels" series, where experts break down topics for people with different levels of understanding. They looked at how ChatGPT performed in two different ways, compared to the human responses: Firstly with a standard AI response, and secondly with an AI response that followed specific explanation strategies. What they found was pretty interesting—people generally liked ChatGPT’s standard responses better than the human ones, because they were clearer and got to the point quicker.
But, when ChatGPT followed specific instructions, it could dive even deeper into the conversation - although many people still preferred the ChatGPT standard response, as it was shorter. Given three descriptions, the human testers preferred the ChatGPT response 5x more often than the human response - and in 60% of cases the human came in last place over the two alternate ChatGPT responses.
The researchers - Grace Li , Milad Alshomary and Smaranda Muresan - are clear that AI it doesn’t completely replace the need for human touch when explaining things, but they argue it has a role to play in making scientific communication better.
I use ChatGPT frequently to help me understand complex topics and dense, academic writing, because it's a practical way to assess the value in the huge pile of research I review on AI in Education - and it does a good job for me to work out what the message is. It's not always 100%, and I do double check stuff it summarises for me, but it probably cuts my time in half. I think there's a huge opportunity for researchers to use AI to make their research more understandable by the audience outside of the specialist researchers they write for!
Best of the Rest
Generative AI in Real-World Workplaces
It's really important to say that this isn't a fully independent research paper, but instead comes from Microsoft Research. It means that there might be some Microsoft influence in the results - and I'll bet that if it didn't find productivity improvements, it wouldn't have been published. The paper names 8 Editors ( Sonia Jaffe , Neha Shah , Dr. Jenna Butler , Alex Farach , Alexia Cambon , Brent Hecht , Michael Schwarz and Jaime Teevan ) and 46 contributing researchers, which is more than 5 per page of the paper, and the longest list of any of the nearly 150 papers I've reviewed! So, with that proviso covered, let's dive in:
This is the second report on AI and productivity that explores how generative AI tools - specifically Microsoft Copilot - are being used in real-world workplaces to boost productivity. Although it's not about AI in education, it's very revealing about how businesses are using generative AI, and useful for educators to stay on top of.
The report summarises the outcomes of a number of different research studies, including the largest randomised controlled trial on this topic involving 60 companies and 6,000 employees. The research shows that AI can help workers save time, improve accuracy, and enhance overall job performance. On average they found users read 11% fewer emails and spent 4% less time interacting with them. It also looked at the impact on meeting attendance. This was fascinating to read in the report - 1 in 5 organisations showed a 13% decrease in the number of meetings people attended, whilst 1 in 3 saw an 18% increase. Users with Copilot also created 10% more documents.
However, these benefits are not uniform - productivity gains vary significantly by role, function, and organisation. And varies according to the specific tasks they perform, and how effectively they integrate AI into their workflows. So people who have communication-focused work that involves repetition and content creation saw most benefits, whereas people with more variable or complex tasks, like legal and R&D, reported fewer benefits.
The report emphasises that while AI is already making a significant positive impact, its full potential will be realised as workplaces continue to adapt and optimise their use of AI technologies. The largest productivity impacts are often seen in roles involving repetitive or content creation tasks.?
Can Large Language Models Make the Grade?
领英推荐
This study explored how well GPT-4 can grade short-answer questions for K-12 students in subjects like Science and History. Using a dataset from the Carousel quizzing platform containing 1,700 open-ended student responses, researchers compared the AI's performance to human teachers. They found that GPT-4, particularly with few-shot prompting (few shot prompting means you give it a few examples of how things are done), achieved a Cohen's kappa score of 0.70, close to the human raters' score of 0.75. Cohen's kappa score is a statistic that measures inter-rater agreement. This indicates that GPT-4 can almost match human accuracy in assessing student responses. The study highlights that AI could significantly reduce the time and resources required for formative assessments in education while maintaining reliability.
The researchers that worked on this were Owen Henkel from 英国牛津大学 , Libby Hills at Jacobs Foundation , Adam Boxer at Carousel Learning and Zachary Levonian at Digital Harbor Foundation
PlagBench: Exploring the Duality of Large Language Models in Plagiarism Generation and Detection
This research paper - from Joo Young Lee , TOSHINI AGRAWAL , Uchendu Uchendu ,Thai Le, Jinghui Chen & Dr Dongwon Lee - focuses on the dual role of large language models (LLMs) in generating and detecting plagiarism. They created PlagBench, a dataset containing over 46,000 instances of synthetic plagiarism, which includes verbatim, paraphrased, and summarized content generated by various LLMs like GPT-3.5, GPT-4, and Llama2. The study assesses how well these models can generate plagiarised content and how effectively they can detect it compared to traditional plagiarism checkers. Just to be clear, what they did was take original text (which would be direct plagiarism if you put it into another assignment or paper without credit), and asked the LLM to rewrite it by paraphrasing or summarising it. I guess students would hope that they could avoid a school or university's existing plagiarism checkers if they did this. What they found is that if you gave LLMs the original text, and the modified text, they could tell you if they were related more accurately than existing plagiarism checkers.
I'm going to say this is interesting, but not earth-shattering research for a couple of reasons. The first is that they talk about detection rates, but don't mention the issue of false positives (ie accusing innocent students of cheating), and without that we can't fairly judge the results. If a detector is 100% accurate for plagiarised materials, and also falsely labels 10% of other materials as plagiarised, then that's really, really important to know. The second reason is that you actually have to give it the original text to be able to check it. So you've got to have a very strong suspicions and lots of source material to be able to even start the checking process. So for those looking for the golden answer to plagiarism checking and LLMs, this is a small step on the journey not the destination! It also illustrates that we're probably chasing an arms race down a rathole - we've gone from worrying about students plagiarising, to students writing with ChatGPT, to students plagiarising with ChatGPT. Where next? And at what point will we create a situation where a student really can't write anything without being accused of cheating? Its whack a mole with an infinite number of moles, and what I've always learnt from playing whack a mole is that eventually you always lose…?
The global landscape of academic guidelines for generative AI and Large Language Models
Junfeng Jiao , Saleh Afroogh , Kevin Chen David Atkinson and Amit Dhurandhar produced this study that provides an in-depth analysis of the academic guidelines surrounding the use of Generative AI (GAI) and Large Language Models (LLMs) in educational settings across the globe. It covers a comprehensive survey of 80 university-level guidelines, emphasising the benefits of GAI and LLMs, such as enhancing creativity, improving access to education, and supporting personalised learning. However, it also discusses the ethical challenges, including biases, fairness, privacy concerns, and the risks of misinformation. What I found really useful is that they researched - and linked to - those 80 AI guidelines from universities and systems around the world covering 24 countries. People might find that useful for their own work and policies, and especially if you're involved in a debate on what other people are up to!
They also highlight that some universities have different policies on AI use across different faculty areas, but they suggest that isn't really needed, because the reason tends to be that students might use AI differently (eg in maths) rather than because there's a genuine need for different policies.?
Perceived Impact of Generative AI on Assessments: Comparing Educator and Student Perspectives in Australia, Cyprus, and the United States
Researchers from The University of Sydney Business School ( Elaine Huber , Andrew Cram , Sandris Zeivots & Corina Raduescu ), 美国康奈尔大学 ( Rene Kizilcec & Adele S. ), University of Nicosia ( Elena Papanastasiou ) and 美国亚利桑那州立大学 ( Christos Makridis ) produced this research that investigates the perceived impact of generative AI on assessments by surveying 680 students and 87 educators across three universities in Australia, Cyprus, and the United States. It found that both groups recognise that AI tools like ChatGPT significantly influence essay and coding assessments, with educators favouring adapted assessments that integrate AI to promote critical thinking. Students are concerned about the potential loss of creativity and authenticity in AI-adapted assessments. The study emphasises the need for assessment reforms that balance AI's benefits with the preservation of academic integrity and the encouragement of higher-order thinking. I thought it was really key that the findings highlighted the importance of involving both educators and students in the development of new assessment practices that are resilient in the face of advancing AI technologies.?
Jill Watson: Scaling and Deploying an AI Conversational Agent in Online Classrooms
The Georgia Institute of Technology team that wrote this paper ( Sandeep Kakar, PhD Pratyusha Maiti Karan Taneja Alekhya Nandula Gina N. Aiden Zhao Vrinda N. Ashok Goel ) have been working on Jill Watson - a scalable AI-powered virtual teaching assistant for a while. The aim is to improve the online learning experience by answering student questions and engaging in educational conversations. This AI tool is built on OpenAI’s ChatGPT. After I wrote about a previous paper in this newsletter , they let me know about this second paper from the 20th International Conference on Intelligent Tutoring Systems in April.
The challenge with tools like ChatGPT is that they can sometimes "hallucinate" — meaning they might generate answers that sound convincing but are actually incorrect or off-topic. To avoid this in a classroom, Jill Watson relies on something called dense passage retrieval and retrieval-augmented generation. This means that instead of answering based on a vast, general pool of knowledge, Jill Watson pulls its answers directly from the specific course materials provided by the instructor. Whether it’s a textbook, lecture transcript, or class syllabus, Jill Watson is trained to only use what the teacher has approved, ensuring that its responses are accurate and relevant. Deployed in several classes at Georgia Tech and two community colleges, Jill Watson has shown the ability to reduce the gap between online and in-person learning by providing timely, accurate (shown at 75% to 97% accurate), and safe answers to student queries. Early results suggest that it helps deepen students' understanding and may even positively impact their academic performance - early data suggests that students who interact more frequently with Jill Watson tend to perform better academically (correlation or causation?). So sounds like the team will be doing further research to confirm these findings.
Author of THE AI TOOLKIT FOR LIBRARIANS
3 个月I look forward to checking this out Ray Fleming, thanks for the invitation!
Language Scientist??Simplifying Wine | L&D Expert | Coach ICF ACC
3 个月The whack a mole analogy is spot on. Thx for sharing an excellent round up in light of 'current' AI and LLM capabilities Ray.
Co-Founder of Altrosyn and DIrector at CDTECH | Inventor | Manufacturer
3 个月It's fascinating how your selection delves into the evolving landscape of AI in education. The comparison between human and AI explanations of complex topics is particularly intriguing, as it echoes the ongoing debate about the nature of understanding itself. Historically, educators have grappled with the challenge of making abstract concepts tangible, a task now being tackled by AI systems. This raises a crucial question: if AI can effectively explain complex ideas, does that necessarily equate to true comprehension? Could we be witnessing a shift in the very definition of learning? What are your thoughts on incorporating AI-generated explanations into traditional pedagogical methods, and how might this impact the role of the human teacher in facilitating deeper understanding?