Lies, Damned Lies, and Generative AI
Federico Cesconi
Founder & CEO @sandsiv the number one CXM solution powered by ?? AI | Author | In love with NLP using transformers
I was attending the CXPA Apero webinar yesterday when my friend Gregorio Uglioni brought up the topic: 'Why Generative AI Projects Fail?' This wasn't the planned topic for the CXPA Apero, but it immediately captured everyone's attention.
As usual, he tried to catch me off guard, but I was well prepared because in recent weeks I had been immersed in extensive research for this article, reading numerous academic papers about the fundamental challenges facing Generative AI. This is certainly a hot topic, and its timing couldn't be more relevant - we're seeing a wave of Generative AI projects struggling or failing outright, despite the immense hype and investment surrounding this technology. The stark contrast between AI's promised potential and its practical limitations has become increasingly apparent. The reality of what Generative AI can and cannot do must be assessed precisely before launching any new venture, as the cost of misunderstanding these limitations can be substantial, both in terms of resources and organisational credibility.
Let me be clear from the start, especially to my friend Graham Hill (Dr G) - this article isn't meant to criticize the use of Large Language Models. In fact, using LLMs and experiencing failures is an essential part of the innovation process. However, what's crucial is understanding the real limitations and boundaries of this technology. Without this understanding, the risk of major project failures becomes incredibly high.
Think of it like learning to fly a new aircraft - you need to know both its capabilities and its limitations. Just because a plane has limitations in certain weather conditions doesn't mean we shouldn't fly at all. Instead, knowing these limitations helps us use the aircraft safely and effectively.
Similarly with LLMs, success depends on having a realistic understanding of what these models can and cannot do. This knowledge isn't meant to discourage innovation, but rather to guide it more effectively and avoid costly mistakes.
GSM-Symbolic: Understanding the Limitations of Mathematical Reasoning in Large Language Models
Scientists wanted to check how well AI language models (like ChatGPT) can actually solve math problems. They used a test called GSM8K, which contains math problems at a grade school level. While AI companies claim their models are getting better at solving these problems, the researchers weren't so sure these improvements were real.
To get to the bottom of this, they created a new and better testing method called GSM-Symbolic. Think of it like a template system - they could create many similar math problems by just changing the numbers or adding small details. This helped them understand what really confuses these AI models.
They discovered some interesting things:
Their conclusion? AI models aren't actually "thinking through" math problems like humans do. Instead, they're trying to copy patterns they've seen before in their training data. When these patterns get slightly changed or become more complex, the AI gets confused.
It's like the difference between a student who really understands math versus one who just memorizes how to solve certain types of problems. When the questions look slightly different from what they've memorized, they struggle to adapt.
LLM's still can't plan; can LRMS? Still can't plan; A preliminary evaluation of OpenAI's O1 on planbench
Imagine you're planning a trip - you need to think about booking flights, packing bags, arranging transportation, and more. This ability to plan steps to reach a goal is something we consider a key part of intelligence. Scientists have been trying to give computers this planning ability since the early days of AI.
When powerful AI language models (like ChatGPT) came along, researchers wanted to know: Can these AI models actually plan things properly? So in 2022, they created a testing tool called PlanBench to measure how well AI models can plan.
What's interesting is that even though we've seen many new AI models since then, they haven't gotten much better at planning. That is, until recently when OpenAI created a new model called 'o1' (nicknamed Strawberry). They claim it's different from regular AI models - instead of just predicting what words come next (like most AI models do), it's specifically designed to reason through problems. They're calling it a 'Large Reasoning Model' or LRM.
The researchers tested this new model against their PlanBench tool. The good news? It did much better than other AI models at planning tasks. The bad news? It still makes plenty of mistakes. This raises important questions:
It's like comparing different GPS navigation systems - while one might be better than others at finding routes, we still need to make sure it's reliable enough before using it to guide emergency vehicles.
Probench: Benchmark for Multi-step Reasoning and Following Procedure
Scientists wanted to understand how well AI language models can actually solve problems that require step-by-step thinking. Think of it like following a recipe - you need to do each step in order, and doing one step wrong can mess up the whole dish.
They created a special test (called a benchmark) that's different from other AI tests in a clever way. Instead of letting the AI use its general knowledge or try different approaches, they give it very specific instructions - like a detailed recipe or instruction manual. This helps them focus on one specific thing: Can the AI follow multi-step instructions correctly?
Here's how their test works:
It's like testing a student's ability to follow directions by:
The researchers made all their testing materials freely available online for others to use. Their findings help show where current AI models are strong and weak at following step-by-step instructions, which is important for making better AI systems in the future.
Think of AI models like students who are good at memorizing but bad at understanding new concepts. Even when you give them clear, step-by-step instructions, they start making mistakes - and the longer the instructions get, the more mistakes they make. They struggle to 'think through' new problems, even when all the answers are in the instructions you gave them.
领英推荐
Large language Models Can Be Zero-shot Anomaly Detectors for Time Series?
Imagine you have a security camera that counts how many people enter a store each hour. Usually, it might see 20-30 people per hour, but suddenly one day it shows 500 people at 3 AM. That's clearly unusual - we call this an 'anomaly.' Finding these unusual patterns automatically is called 'anomaly detection.'
Scientists wanted to see if AI language models (like ChatGPT) could spot these weird patterns in data that changes over time. This was tricky for two reasons:
They created a system called SIGLLM that tries to solve this in two ways:
They tested these methods on 11 different types of data. What did they find?
It's like comparing a general family doctor to a specialist. While a family doctor (like the language model) can spot many problems, a specialist (like the specialized AI) who only focuses on one type of problem will usually do better at that specific task.
"A Peek into Token Bias: Large Language Models Are Not Yet Genuine Reasoners"
Imagine you have a student who gets good grades on math tests. But are they actually good at math, or are they just memorizing common patterns in test questions? This study tries to answer a similar question about AI language models.
The researchers wanted to know: Are these AI models actually reasoning through problems, or are they just spotting familiar patterns in the words (what they call 'token bias')?
Here's what they did:
What did they find? Most AI models, even when they got the right answers, weren't really 'thinking' through the problems. Instead, they were mostly relying on recognizing patterns they'd seen before - like a student who memorizes that 'if you see this type of question, use this formula' without understanding why.
This is important because it shows that even when AI models give correct answers, they might not actually understand what they're doing. This means they might fail when faced with new problems that look different from what they've seen before.
My Conclusions: The Reality Check on Generative AI: Pattern Matching vs True Intelligence
These various studies reveal a consistent and important pattern about current AI models, including the most advanced ones. While these models can appear remarkably capable on the surface, they're consistently showing fundamental limitations in four critical areas:
Mathematical Reasoning
Planning Abilities
Following Instructions
True Reasoning vs Pattern Matching
The Key Takeaway: These findings don't mean we should avoid using Generative AI. Instead, they highlight the importance of understanding its true nature - these are sophisticated pattern-matching systems, not true reasoning engines. Success in implementing AI solutions depends on recognizing these limitations and designing applications that work within them, rather than expecting human-like reasoning capabilities.
For businesses and organizations, this means:
The path forward isn't about avoiding AI, but about using it wisely, with a clear understanding of both its capabilities and limitations.
Research & Engagement Manager | Building Strategic Partnerships to Drive Revenue & Client Success
1 周This is a much-needed perspective on the real strengths and limitations of generative AI. The distinction between pattern-matching and true reasoning feels like the key to managing expectations - and investments. I’m curious, which area do you think will see breakthroughs first: AI’s ability to reason or its capacity for better planning? How close are we to overcoming these gaps in real-world applications? 4o
??Helping B2B businesses Scale ?? without wasting ad spend on low-quality leads | Ex-Rocket Internet | Ex-CMO
1 周Insightful perspective, Federico Cesconi!
CEO @ Cognopia | Are you ready to launch your new business idea, or do you need a fresh pair of eyes before committing to the launch? | Let's connect and talk
1 个月One aspect I dislike about the GenAI tools I'm using is that they seem to all be programmed to agree with me rather than challenge my thinking. You can try getting around this by giving it instructions to "use the black hat" but even then it feels like the tools are pulling punches. When you're trying to come up with a real strategy then it's always better to have a human in the loop that's happy to tell you when you're being an idiot.
30 Years Marketing | 25 Years Customer Experience | 20 Years Decisioning | Opinions my own
1 个月Federico Cesconi a very thoughtful and welcome addition to the discussion on making Gen AI work. I am 100% aligned with you on the probabilistic nature of Gen AI and the opportunities and problems that brings. I am also 100% aligned with you that many, perhaps most, Gen AI projects have not been overwhelming successes. These are early days and we are still learning how Gen AI works, how to use them and what to best use them for. I think there is another challenge too, and that is the fact that like any general purpose technology, you need to develop a set of complementary capabilities - combinations of jobs, processes, data, other technologies, roles, collaboration, work system, and governance - to get all the value from the enablement provided by Gen AI. We have hardly started doing this. Until we have, Gen AI will remain a Horizon 3, or at best a Horizon 2 technology. As part of a consulting project, I carried out an extensive literature review and thematic analysis to identify these complementary capabilities. Interestingly, but not unexpectedly, many of the capabilities were connected with socio-technical integration, i.e. balancing how we work with the tools we use to do work. Best regards, Graham