登录查看更多内容

Lies, Damned Lies, and Generative AI

Federico Cesconi

Founder & CEO @sandsiv the number one CXM solution powered by ?? AI | Author | In love with NLP using transformers

发布日期: 2024年10月25日

I was attending the CXPA Apero webinar yesterday when my friend Gregorio Uglioni brought up the topic: 'Why Generative AI Projects Fail?' This wasn't the planned topic for the CXPA Apero, but it immediately captured everyone's attention.

As usual, he tried to catch me off guard, but I was well prepared because in recent weeks I had been immersed in extensive research for this article, reading numerous academic papers about the fundamental challenges facing Generative AI. This is certainly a hot topic, and its timing couldn't be more relevant - we're seeing a wave of Generative AI projects struggling or failing outright, despite the immense hype and investment surrounding this technology. The stark contrast between AI's promised potential and its practical limitations has become increasingly apparent. The reality of what Generative AI can and cannot do must be assessed precisely before launching any new venture, as the cost of misunderstanding these limitations can be substantial, both in terms of resources and organisational credibility.

Let me be clear from the start, especially to my friend Graham Hill (Dr G) - this article isn't meant to criticize the use of Large Language Models. In fact, using LLMs and experiencing failures is an essential part of the innovation process. However, what's crucial is understanding the real limitations and boundaries of this technology. Without this understanding, the risk of major project failures becomes incredibly high.

Think of it like learning to fly a new aircraft - you need to know both its capabilities and its limitations. Just because a plane has limitations in certain weather conditions doesn't mean we shouldn't fly at all. Instead, knowing these limitations helps us use the aircraft safely and effectively.

Similarly with LLMs, success depends on having a realistic understanding of what these models can and cannot do. This knowledge isn't meant to discourage innovation, but rather to guide it more effectively and avoid costly mistakes.

GSM-Symbolic: Understanding the Limitations of Mathematical Reasoning in Large Language Models

Paper

Scientists wanted to check how well AI language models (like ChatGPT) can actually solve math problems. They used a test called GSM8K, which contains math problems at a grade school level. While AI companies claim their models are getting better at solving these problems, the researchers weren't so sure these improvements were real.

To get to the bottom of this, they created a new and better testing method called GSM-Symbolic. Think of it like a template system - they could create many similar math problems by just changing the numbers or adding small details. This helped them understand what really confuses these AI models.

They discovered some interesting things:

When they gave AI the exact same problem but just changed the numbers, the AI often got confused and performed worse
When problems got longer or had more steps, the AI struggled much more
Most surprisingly, when they added just one extra piece of information that wasn't even needed to solve the problem, the AI's performance dropped dramatically - by up to 65%!

Their conclusion? AI models aren't actually "thinking through" math problems like humans do. Instead, they're trying to copy patterns they've seen before in their training data. When these patterns get slightly changed or become more complex, the AI gets confused.

It's like the difference between a student who really understands math versus one who just memorizes how to solve certain types of problems. When the questions look slightly different from what they've memorized, they struggle to adapt.

LLM's still can't plan; can LRMS? Still can't plan; A preliminary evaluation of OpenAI's O1 on planbench

Paper

Imagine you're planning a trip - you need to think about booking flights, packing bags, arranging transportation, and more. This ability to plan steps to reach a goal is something we consider a key part of intelligence. Scientists have been trying to give computers this planning ability since the early days of AI.

When powerful AI language models (like ChatGPT) came along, researchers wanted to know: Can these AI models actually plan things properly? So in 2022, they created a testing tool called PlanBench to measure how well AI models can plan.

What's interesting is that even though we've seen many new AI models since then, they haven't gotten much better at planning. That is, until recently when OpenAI created a new model called 'o1' (nicknamed Strawberry). They claim it's different from regular AI models - instead of just predicting what words come next (like most AI models do), it's specifically designed to reason through problems. They're calling it a 'Large Reasoning Model' or LRM.

The researchers tested this new model against their PlanBench tool. The good news? It did much better than other AI models at planning tasks. The bad news? It still makes plenty of mistakes. This raises important questions:

How accurate does the planning need to be for real-world use?
How efficient is it at making plans?
Can we trust it enough to use it in important situations?

It's like comparing different GPS navigation systems - while one might be better than others at finding routes, we still need to make sure it's reliable enough before using it to guide emergency vehicles.

Probench: Benchmark for Multi-step Reasoning and Following Procedure

Scientists wanted to understand how well AI language models can actually solve problems that require step-by-step thinking. Think of it like following a recipe - you need to do each step in order, and doing one step wrong can mess up the whole dish.

They created a special test (called a benchmark) that's different from other AI tests in a clever way. Instead of letting the AI use its general knowledge or try different approaches, they give it very specific instructions - like a detailed recipe or instruction manual. This helps them focus on one specific thing: Can the AI follow multi-step instructions correctly?

Here's how their test works:

They give the AI complete instructions for solving a problem
The AI needs to follow these instructions step by step
They check if the AI gets each step right, not just the final answer
They test problems that need different numbers of steps to solve
They use different types of problems to make sure the results are reliable

It's like testing a student's ability to follow directions by:

Giving them complete, detailed instructions
Not letting them use outside knowledge
Checking their work at each step
Using different kinds of problems
Seeing how they handle both simple and complex tasks

The researchers made all their testing materials freely available online for others to use. Their findings help show where current AI models are strong and weak at following step-by-step instructions, which is important for making better AI systems in the future.

Think of AI models like students who are good at memorizing but bad at understanding new concepts. Even when you give them clear, step-by-step instructions, they start making mistakes - and the longer the instructions get, the more mistakes they make. They struggle to 'think through' new problems, even when all the answers are in the instructions you gave them.

领英推荐

AI Explainability Inspired by the Lantern of Diogenes

Antonio Grasso 7 个月前

Generative AI’s Powers and Perils: How Biden's…

David Sweenor 1 年前

Generative AI and Artificial General Intelligence

Shyam Ramanathan 7 个月前

Large language Models Can Be Zero-shot Anomaly Detectors for Time Series?

Imagine you have a security camera that counts how many people enter a store each hour. Usually, it might see 20-30 people per hour, but suddenly one day it shows 500 people at 3 AM. That's clearly unusual - we call this an 'anomaly.' Finding these unusual patterns automatically is called 'anomaly detection.'

Scientists wanted to see if AI language models (like ChatGPT) could spot these weird patterns in data that changes over time. This was tricky for two reasons:

These AI models are built to work with text, not numbers and patterns
They needed to teach the AI to point out which parts of the data look strange

They created a system called SIGLLM that tries to solve this in two ways:

First method: They convert the number patterns into text and simply ask the AI "Which parts look weird to you?"
Second method: They ask the AI to predict what normal numbers should look like, then compare these predictions with the actual data to spot differences

They tested these methods on 11 different types of data. What did they find?

The second method (prediction) worked much better than just asking the AI directly
However, specialized AI systems designed specifically for finding anomalies still performed about 30% better than these language models

It's like comparing a general family doctor to a specialist. While a family doctor (like the language model) can spot many problems, a specialist (like the specialized AI) who only focuses on one type of problem will usually do better at that specific task.

"A Peek into Token Bias: Large Language Models Are Not Yet Genuine Reasoners"

Github
Paper

Imagine you have a student who gets good grades on math tests. But are they actually good at math, or are they just memorizing common patterns in test questions? This study tries to answer a similar question about AI language models.

The researchers wanted to know: Are these AI models actually reasoning through problems, or are they just spotting familiar patterns in the words (what they call 'token bias')?

Here's what they did:

They created special test problems where they could control for and spot when the AI was just using word patterns versus actually thinking
They focused on two types of logical problems: one involving common reasoning mistakes humans make ('conjunction fallacy') and another involving logical arguments ('syllogisms')
They designed their tests assuming the AI could actually reason, then tried to prove themselves wrong

What did they find? Most AI models, even when they got the right answers, weren't really 'thinking' through the problems. Instead, they were mostly relying on recognizing patterns they'd seen before - like a student who memorizes that 'if you see this type of question, use this formula' without understanding why.

This is important because it shows that even when AI models give correct answers, they might not actually understand what they're doing. This means they might fail when faced with new problems that look different from what they've seen before.

My Conclusions: The Reality Check on Generative AI: Pattern Matching vs True Intelligence

These various studies reveal a consistent and important pattern about current AI models, including the most advanced ones. While these models can appear remarkably capable on the surface, they're consistently showing fundamental limitations in four critical areas:

Mathematical Reasoning

Even simple changes to number values in problems can confuse AI models
Performance drops dramatically (up to 65%) with minimal added complexity
Models rely on pattern matching rather than true mathematical understanding

Planning Abilities

Even the newest 'reasoning-focused' models (LRMs) still struggle with basic planning tasks
While improvements are happening, they're not yet reliable enough for critical applications
Complex, multi-step planning remains a significant challenge

Following Instructions

Longer instructions lead to more mistakes
Models struggle to follow multi-step procedures accurately
Performance degrades when faced with unfamiliar patterns

True Reasoning vs Pattern Matching

AI models primarily rely on recognizing patterns from their training data
They lack genuine reasoning abilities
Performance drops significantly when dealing with novel situations

The Key Takeaway: These findings don't mean we should avoid using Generative AI. Instead, they highlight the importance of understanding its true nature - these are sophisticated pattern-matching systems, not true reasoning engines. Success in implementing AI solutions depends on recognizing these limitations and designing applications that work within them, rather than expecting human-like reasoning capabilities.

For businesses and organizations, this means:

Start small and test thoroughly
Focus on applications where pattern matching is sufficient
Be especially careful in areas requiring true reasoning or critical decision-making
Have robust validation processes in place
Set realistic expectations about what these systems can actually achieve

The path forward isn't about avoiding AI, but about using it wisely, with a clear understanding of both its capabilities and limitations.

Katerina Limanskaya

Driving Reveal and Client Success Through Strategic Partnerships and Data-Driven Insights | Research & Engagement Manager

4 个月

This is a much-needed perspective on the real strengths and limitations of generative AI. The distinction between pattern-matching and true reasoning feels like the key to managing expectations - and investments. I’m curious, which area do you think will see breakthroughs first: AI’s ability to reason or its capacity for better planning? How close are we to overcoming these gaps in real-world applications? 4o

1 次回应

Roberto Bertinetti

??Helping B2B businesses Scale ?? without wasting ad spend on low-quality leads | Ex-Rocket Internet | Ex-CMO

4 个月

Insightful perspective, Federico Cesconi!

1 次回应

Neil Burge

CEO @ Cognopia | Would you like to know why your customers buy from you? | I'll find out so you can grow to your full potential

4 个月

One aspect I dislike about the GenAI tools I'm using is that they seem to all be programmed to agree with me rather than challenge my thinking. You can try getting around this by giving it instructions to "use the black hat" but even then it feels like the tools are pulling punches. When you're trying to come up with a real strategy then it's always better to have a human in the loop that's happy to tell you when you're being an idiot.

1 次回应

Graham Hill (Dr G)

30 Years Marketing | 25 Years Customer Experience | 20 Years Decisioning | Opinions my own

4 个月

Federico Cesconi a very thoughtful and welcome addition to the discussion on making Gen AI work. I am 100% aligned with you on the probabilistic nature of Gen AI and the opportunities and problems that brings. I am also 100% aligned with you that many, perhaps most, Gen AI projects have not been overwhelming successes. These are early days and we are still learning how Gen AI works, how to use them and what to best use them for. I think there is another challenge too, and that is the fact that like any general purpose technology, you need to develop a set of complementary capabilities - combinations of jobs, processes, data, other technologies, roles, collaboration, work system, and governance - to get all the value from the enablement provided by Gen AI. We have hardly started doing this. Until we have, Gen AI will remain a Horizon 3, or at best a Horizon 2 technology. As part of a consulting project, I carried out an extensive literature review and thematic analysis to identify these complementary capabilities. Interestingly, but not unexpectedly, many of the capabilities were connected with socio-technical integration, i.e. balancing how we work with the tools we use to do work. Best regards, Graham

2 次回应

查看更多评论

要查看或添加评论，请登录

Federico Cesconi的更多文章

What Playing Chess Taught Me About AI's Hidden Talents (And Limitations)

2025年2月18日

What Playing Chess Taught Me About AI's Hidden Talents (And Limitations)

I've been fascinated by a peculiar discovery in the AI world lately. It turns out that when it comes to playing chess…
The DeepSeek Distillation Debate: Analysing OpenAI's Copyright Claims

2025年2月4日

The DeepSeek Distillation Debate: Analysing OpenAI's Copyright Claims

Disclaimer: The views and opinions expressed in this post are my own and do not represent the official position or…

2 条评论
NVIDIA's $600B Crash: A Deep Dive into Market Misunderstandings and the AI Computing Landscape

2025年1月31日

NVIDIA's $600B Crash: A Deep Dive into Market Misunderstandings and the AI Computing Landscape

On Monday, January 27th, NVIDIA experienced one of the largest single-company value drops in the history of capitalism…

2 条评论
The Market God Meets AI: A Crisis of Faith in Tech

2025年1月30日

The Market God Meets AI: A Crisis of Faith in Tech

Disclaimer: The views and opinions expressed in this post are my own and do not represent the official position or…
USA vs China: How AI is Ending Capitalism as We Know It

2025年1月20日

USA vs China: How AI is Ending Capitalism as We Know It

Disclaimer: The views and opinions expressed in this post are my own and do not represent the official position or…

2 条评论
The Prophecy Fulfilled? Open Source LLM Overtake Commercial AI Giants (DeepSeek V3)

2025年1月2日

The Prophecy Fulfilled? Open Source LLM Overtake Commercial AI Giants (DeepSeek V3)

For those of us working in Artificial Intelligence, it was never a question of if, but when open-source models would…

1 条评论
AI Agents: is SAAS really dead?

2024年12月31日

AI Agents: is SAAS really dead?

When Microsoft CEO Satya Nadella declared "SaaS is Dead," it sent shockwaves through the tech industry. But is…

9 条评论
Microsoft's Phi-4: Why Small Might Be the New Big in AI

2024年12月20日

Microsoft's Phi-4: Why Small Might Be the New Big in AI

Remember when everyone thought bigger was better in AI? Microsoft just flipped that script with Phi-4, and it's making…
OpenAI's Strategic Chess Game: From o1 to Orion (And Your $200/Month)

2024年12月16日

OpenAI's Strategic Chess Game: From o1 to Orion (And Your $200/Month)

Ever wondered what's really behind OpenAI's latest $200/month ChatGPT subscription? A recent deep dive into OpenAI's…
Understanding NPS Movement: A Deep Dive into Customer Experience Analytics

2024年12月4日

Understanding NPS Movement: A Deep Dive into Customer Experience Analytics

In today’s customer-centric business environment, the Net Promoter Score (NPS) has transcended its role as just another…

7 条评论

See all articles

Lies, Damned Lies, and Generative AI

Federico Cesconi

Founder & CEO @sandsiv the number one CXM solution powered by ?? AI | Author | In love with NLP using transformers

GSM-Symbolic: Understanding the Limitations of Mathematical Reasoning in Large Language Models

LLM's still can't plan; can LRMS? Still can't plan; A preliminary evaluation of OpenAI's O1 on planbench

Probench: Benchmark for Multi-step Reasoning and Following Procedure

领英推荐

Large language Models Can Be Zero-shot Anomaly Detectors for Time Series?

"A Peek into Token Bias: Large Language Models Are Not Yet Genuine Reasoners"

My Conclusions: The Reality Check on Generative AI: Pattern Matching vs True Intelligence

For businesses and organizations, this means:

The path forward isn't about avoiding AI, but about using it wisely, with a clear understanding of both its capabilities and limitations.

Federico Cesconi的更多文章

社区洞察

其他会员也浏览了

Embracing the AI Revolution: A Call to Action for Senior Executives

Unraveling AI: A Reflection of Our Collective Intellect

Riding the Generative AI Wave Beyond the Hype.

How do we get what we want from AI systems - and human systems?

Beyond the Hype: Realizing the Practical Applications of Generative AI in Today's Market

AI – Part 1: Wavelength and AI visionary Walter de Brouwer chat Generative AI, xy.ai, Starlab and more.

It Is Time to End the Generative AI Craze

Generative AI and the Future of Work

Reflecting on My Journey with Generative AI

Power of Generative AI

GSM-Symbolic: Understanding the Limitations of Mathematical Reasoning in Large Language Models

LLM's still can't plan; can LRMS? Still can't plan; A preliminary evaluation of OpenAI's O1 on planbench

Probench: Benchmark for Multi-step Reasoning and Following Procedure

领英推荐

Large language Models Can Be Zero-shot Anomaly Detectors for Time Series?

"A Peek into Token Bias: Large Language Models Are Not Yet Genuine Reasoners"

My Conclusions: The Reality Check on Generative AI: Pattern Matching vs True Intelligence

For businesses and organizations, this means:

The path forward isn't about avoiding AI, but about using it wisely, with a clear understanding of both its capabilities and limitations.

Federico Cesconi的更多文章

What Playing Chess Taught Me About AI's Hidden Talents (And Limitations)

The DeepSeek Distillation Debate: Analysing OpenAI's Copyright Claims

NVIDIA's $600B Crash: A Deep Dive into Market Misunderstandings and the AI Computing Landscape

The Market God Meets AI: A Crisis of Faith in Tech

USA vs China: How AI is Ending Capitalism as We Know It

The Prophecy Fulfilled? Open Source LLM Overtake Commercial AI Giants (DeepSeek V3)

AI Agents: is SAAS really dead?

Microsoft's Phi-4: Why Small Might Be the New Big in AI

OpenAI's Strategic Chess Game: From o1 to Orion (And Your $200/Month)

Understanding NPS Movement: A Deep Dive into Customer Experience Analytics

社区洞察

其他会员也浏览了

Embracing the AI Revolution: A Call to Action for Senior Executives

Unraveling AI: A Reflection of Our Collective Intellect

Riding the Generative AI Wave Beyond the Hype.

How do we get what we want from AI systems - and human systems?

Beyond the Hype: Realizing the Practical Applications of Generative AI in Today's Market

AI – Part 1: Wavelength and AI visionary Walter de Brouwer chat Generative AI, xy.ai, Starlab and more.

It Is Time to End the Generative AI Craze

Generative AI and the Future of Work

Reflecting on My Journey with Generative AI

Power of Generative AI