Can LLMs Truly Reason?
AIM Research
Strategic insights for Artificial Intelligence Industry. For Brand collaborations, write to [email protected]
Apple has gotten better at gaslighting AI companies that are spending all they have on making LLMs better at reasoning. A research team of six people at Apple recently published a paper titled Understanding the Limitations of Mathematical Reasoning in Large Language Models, which basically said that the current LLMs can’t reason.?
“…current LLMs cannot perform genuine logical reasoning; they replicate reasoning steps from their training data,” read the paper. This includes LLMs like OpenAI’s GPT-4o and even the much-touted “thinking and reasoning” LLM, o1. The research was done on a series of other models as well, such as Llama, Phi, Gemma, and Mistral.?
Mehrdad Farajtabar, a senior author of the paper, posted on X, explaining how the team came to the conclusion. According to him, LLMs just follow sophisticated patterns and even models smaller than 3 billion parameters are hitting benchmarks that only larger ones could do earlier, specifically the GSM8K score released by OpenAI three years ago.?
The researchers introduced GSM-Symbolic, a new tool for testing mathematical reasoning within LLMs because GSM8K was not accurate enough and thus, not reliable to test the reasoning capabilities of LLMs.?
Surprisingly, OpenAI’s o1 demonstrated “strong performance on various reasoning and knowledge-based benchmarks” according to the researchers, but the capabilities dropped by 30% when the researchers introduced the GSM-NoOp experiment, which involved adding irrelevant information to the questions.
This proves that the “reasoning” capabilities of OpenAI’s models are definitely getting better, and maybe GPT-5 would be a step ahead. Maybe it’s just Apple’s LLMs that don’t reason well, but the team didn’t test out Apple’s model.
Also, not everyone is happy with the research paper as it fails to even explain what “reasoning” actually means and just introduces a new benchmark for evaluating LLMs.
The Tech World Goes Bonkers
“Overall, we found no evidence of formal reasoning in language models…their behaviour is better explained by sophisticated pattern matching—so fragile, in fact, that changing names can alter results by ~10%!” Farajtabar further said that scaling these models would just result in ‘better pattern machines’ but not ‘better reasoners’.
Some people have been claiming all along that LLMs cannot reason and are an off road to AGI. Maybe, Apple has finally accepted this after trying out LLMs on their products and this is possibly also one of the reasons why it backed out of its investment in OpenAI.
Most researchers have been praising the paper and believe that it is important for others as well to accept that LLMs cannot reason. Gary Marcus, a long-standing critic of LLMs, shared several examples of LLMs not able to perform reasoning tasks such as calculation and playing Chess.
On the other hand, a problem with Apple’s paper is that it has confused reasoning with computation. “Reasoning is knowing an algorithm to solve a problem, not solving all of it in your head,” said Paras Chopra, an AI researcher, while explaining that most of the LLMs know the approach to solving a problem even though they end up with the wrong answer in the end. According to him, knowing the approach is good enough to check if the LLM is reasoning, even if the answer is wrong.
Discussions on Hacker News highlight that the Apple researchers were trying to do a “gotcha!” on the LLMs by including irrelevant information in some of the questions, which the LLMs would not be able to actively filter out. Reasoning is the progressive, iterative reduction of informational entropy in a knowledge domain. OpenAI’s o1-preview does that better by introducing iteration. It’s not perfect, but it does it.
领英推荐
But is This True? Do LLMs Not Reason?
Subbarao Kambhampati, a computer science and AI professor at ASU, agreed that some of the claims of LLMs being capable of reasoning are exaggerated. However, he said that LLMs require more tools to handle System 2 tasks (reasoning), for which, techniques like ‘fine-tuning’ or ‘Chain of Thought’ are not adequate.?
When OpenAI released o1, claiming that the model thinks and reasons, Clem Delangue, the CEO of Hugging Face, was not impressed. “Once again, an AI system is not ‘thinking’, it’s ‘processing’, ‘running predictions’,… just like Google or computers do,” said Delangue, when talking about how OpenAI is painting a false picture of what the company’s newest model can achieve.
While some agreed, others argued that it is exactly how human brains work as well. “Once again, human minds aren’t ‘thinking’, they are just executing a complex series of bio-chemical/bio-electrical computing operations at massive scale,” replied Phillip Rhodes to Delangue.
To test reasoning, some people also ask LLMs how many Rs are there in the word ‘Strawberry’, which does not make sense at all. LLMs can’t count letters directly because they process text in chunks called “tokens”. The tests for reasoning have been problematic in the case of LLMs ever since they were created.
Everyone seems to have strong opinions on LLMs. While some are grounded in research by experts such as Yann LeCun or Francois Chollet, arguing that LLM research should be taken a bit more seriously, others just follow the hype and criticise it. Some say they’re our ticket to the AGI, while others think they’re just glorified text-producing algorithms with a fancy name.?
Meanwhile, Andrej Karpathy recently said that the technique of predicting the next token that these LLMs, or Transformers, use might be able to solve a lot of problems outside the realm of where it is being used right now.?
While it seems true to some extent that LLMs can reason, when it comes to putting them to the test, they end up failing it.
Enjoy the full story here.?
Claude 3.5 Brushes Off Canvas with a Stroke of Code
Recently, OpenAI unveiled canvas, a new interface for working with ChatGPT on writing and coding projects. Many wonder if it’s better than Claude Sonnet 3.5 Artifacts. The answer is no.
Here’s why: canvas uses GPT-4o, and it is not better at coding than Claude Sonnet 3.5. While canvas has some good features for developers, like user collaboration and version control, it lacks critical features like code preview.?
Read the full story here.?
AI Bytes?