The Artificial Investor - Issue 28: Can AI models reason like humans?

The Artificial Investor - Issue 28: Can AI models reason like humans?

The most interesting story of the last few days was OpenAI’s launch of its new reasoning model series, o1.? The two AI models (preview and mini) are designed to spend more time thinking before they respond. They can reason through complex tasks and solve harder problems than previous models in science, coding, and maths.

How strong are the model’s reasoning capabilities? What does this mean about how far we are from Superintelligence??


?? Inside the model’s mechanics

First things first: what is o1 and how does it work?

o1 is a large language model that specialises in answering complex questions that require reasoning capabilities. When tested by OpenAI, the model performed similarly to PhD students in natural science tests (e.g. GPQA diamond, an intelligence benchmark for expertise in chemistry, physics and biology), scored as high as humans in several AI model benchmarks (e.g. scored 78% on MMMU vs. 69% for GPT-4o), excelled in maths (scores 83% in the American qualifying exam for the international maths olympiad vs. 13% for GPT-4o) and was found very strong in coding (reached the 89th percentile in the human rankings of the Codeforces’ coding competitions).?

The model was trained with reinforcement learning to perform complex reasoning. It thinks before it answers and can produce a long internal Chain of Thought before responding to the user.?

Reinforcement learning is a type of machine learning that enables a model to learn to make decisions to achieve a certain goal. The learning process is based on rewards and penalties that come from the model’s environment. In this case, the model is also called an agent. The concept was introduced by Richard Bellman and others in the mid-1950s and A simple example is a dog in a backyard that learns to fetch a ball. Initially the dog makes random movements, it explores. As the dog moves closer to the ball, the dog’s owner gives a biscuit or shouts positively to the dog, i.e. she reinforces the dog’s learning. The process eventually culminates in the dog’s fetching of the ball, i.e. reaching its goal. So, this OpenAI model was not trained on trillions of data points; rather, it was trained using challenges, alternative strategies and external feedback to reward it when it chose the right strategy.

Chain of Thought is one of the main techniques used on LLMs to develop autonomous “thinking” with no or limited human feedback. The technique involves generating a series of intermediate steps before arriving at a final answer and mimics the step-by-step thought process a human might use when working through a complex problem. A few months after GPT-3 was launched, we saw the first public implementations of the technique with AgentGPT and BabyAGI, on the back of a research paper that discussed the matter. A simple example of Chain of Thought is a situation where you are planning a day out with your family. First, you acknowledge the current situation: it’s sunny outside. Hence, an outdoor activity might be nice. Then, you consider additional factors: it’s quite hot, you should do something with shade or water to stay cool. Then, you may think of the alternatives: going to the park or visiting a swimming pool. After thinking through all factors (weather, temperature, activity) and comparing the alternatives, the final decision is to go to the swimming pool.?

Now, everyone is talking about reasoning. What is reasoning anyway? How do we measure it??

?? The invisible engine

Reasoning is generally defined as the process of thinking logically about something to form a conclusion or judgement. There are several types of reasoning:?

  • Deductive: All humans are mortal. Bill is a human. So, Bill is mortal.
  • Inductive: Roses have bloomed the last three springs. Roses will bloom next spring.
  • Abductive: This plate of food is half-eaten and hot. So, the person eating it will return soon.
  • Analogical: Learning to play the piano is like learning to type. At first, it feels awkward, but with practice, your fingers will know where to go without thinking.
  • Cause and effect: It rained last night, so the streets are wet this morning.

There are several ways to measure the reasoning ability of humans, including SHL’s deductive reasoning tests, Kenexa’s logical reasoning tests, the GMAT, certain sections of the various international maths and coding competitions, etc. With AI models, things are a bit different. Overall, LLM benchmarks are problematic. First of all, most of them focus on testing knowledge, either generalist (common sense) or specialist (physics, biology, etc.), which doesn’t correlate well with reasoning. A model that has been pre-trained (i.e. memorised) biology textbooks would score well in many of the LLM benchmarks, but this doesn’t mean that it can reason. An example of such a test is the MMMU benchmark that OpenAI references in its o1 launch page. Even the 2024 US maths olympiad test includes many questions where knowledge of mathematics theory plays a key role. Second, benchmark questions are typically public. It would be very tempting to train a model on similar and paraphrased questions, right? Thirdly, it has been proven that it’s relatively easy to game the benchmarks by fooling their antifraud mechanisms. Nvidia’s Dr. Jim Fan claims that the only two ways to trust a benchmark is if it’s either 1) based on votes of thousands of users, or 2) ran privately by an independent 3rd party in a well-curated and secure way.?

So, how can we tell that a model can reason? Is o1 a good reasoning model?

?? Putting o1 to the test

Since we don’t fully trust other benchmarks, particularly when it has to do with reasoning, we put OpenAI’s model to the test ourselves. We carried out a couple of tests: i) solving a challenging crossword puzzle, introduced originally by Ethan Mollick, and ii) answering two challenging US maths olympiad questions.?

?? Solving a difficult crossword?

We have taken a section of a crossword puzzle from stanxwords.com to speed up the exercise. The challenging part of this crossword in relation to reasoning is that many of the questions have a double meaning. As o1 does not process images currently, we have converted the image into text instructions using GPT-4’s image recognition capabilities.?

Already the first question, 1 Down, is quite challenging for a model as it’s very metaphorical. It talks about galaxies, but the answer is Apps, because it makes reference to Samsung Galaxy, the mobile phone. The model guessed that incorrectly and then went on to score a mere 25% (2/8). However, after helping it out a bit by giving it the answer to 1 Down, it tried again by revisiting all of its answers one by one and looking for alternatives. It took about 100 seconds following a very logical process and the result was very good: 7/8, i.e. 87.5% (interestingly, the missing word was due to hallucination). I was very impressed! Needless to say that GPT-4 did much worse, guessing correctly 25% of the words, even after giving it the first-question tip.

Below is the answer to the crossword puzzle in case you want to play with it:

?? Answering reasoning questions for the gold medal

I then looked up the 2024 US maths olympiad (2024 AIME) and focused only on the questions that require zero previous maths knowledge and are 100% focused on logical thinking. This is to avoid the noise of the competitive advantage a model can get by “memorising” hundreds of thousands of pages from maths textbooks. These were problem 3 and 6:

  1. Problem 3: Alice and Bob play the following game. A stack of n tokens lies before them. The players take turns with Alice going first. On each turn, the player removes either 1 token or 4 tokens from the stack. Whoever removes the last token wins. Find the number of positive integers n less than or equal to 2024 for which there exists a strategy for Bob that guarantees that Bob will win the game regardless of Alice's play.
  2. Problem 6: Consider the paths of length 16 that follow the lines from the lower left corner to the upper right corner on an 8x8 grid. Find the number of such paths that change direction exactly four times, as in the examples shown below.

In Problem 3, o1 started by understanding the game (positions, goal), then moved onto identifying an appropriate strategy, and then started making calculations. It provided the final answer within 20 seconds - the number 809. In contrast, GPT-4 was much faster but gave a wrong answer. In Problem 6, o1 followed a similar process, took again about 30 seconds and provided the correct answer - 294. GPT-4 was again much faster but gave a wrong answer. Once again, a very impressive outcome for o1.?

Our conclusion is that OpenAI o1 has strong reasoning skills and probably beats many humans in challenging problem-solving questions. Obviously, the second part of our test would be even more robust if we were testing the model on a recently-released maths olympiad set of questions, to ensure it doesn’t have any previous relevant knowledge.

?? The path towards Superintelligence

When OpenAI released GPT-3 to the public in November 2022 everyone started talking about how close we are to AGI (Artificial Generic Intelligence) or even Superintelligence. As scientists and analysts started spending more time with LLMs, they realised that they have the following shortcomings when compared to human intelligence:

  • Lack of forward planning: When we talk, we often have a goal in mind. We think about what we want to say next, sometimes planning several steps ahead. But LLMs don't work like this. They focus on the moment, choosing the next word based only on what's been said before.?
  • Lack of risk taking: LLMs tend to play it safe: as they're built to predict the most likely next word based on their training, they usually go for the option that's been seen most often. Humans, on the other hand, can decide to take a creative leap or introduce a twist that no one sees coming.?
  • Partial creativity: LLMs can generate text that is creative by putting words together in new ways, i.e. they are remixing bits and pieces of what they've been trained on. However, humans are capable of imagination, i.e. thinking of things that don't exist yet or that they've never experienced. This is the equivalent of a chef remixing a defined set of ingredients (LLMs) vs. a chef who is coming up with a completely new ingredient (humans).
  • Lack of true understanding of language: LLMs learn complex associations between words, but do not form the same structured, flexible, multimodal representations of word meaning like humans. It has been demonstrated that LLMs don’t really understand the meaning of the prompts.
  • Limited ways to learn: LLMs do not form memory the way humans do. LLMs generate text that sounds fine, grammatically, semantically, but they don’t really have some sort of objective other than just satisfying statistical consistency with the prompt. Humans operate on a lot of knowledge that is never written down, such as customs, beliefs, or practices within a community that are acquired through observation or experience. And a skilled craftsperson may have tacit knowledge of their craft that is never written down. As a result, LLMs hallucinate, which by the way also was the case in one of the tests we did on o1.
  • Lack of emotions: Probably no explanation needed here.
  • Susceptible to adversarial attacks: LLMs are subject to adversarial attacks, techniques designed to manipulate or exploit model vulnerabilities in order to produce undesired or malicious outputs. Humans aren’t subject to these attacks (although they are indeed subject to conversation-based fraud).

o1 seems to help AI take a step forward towards human intelligence by embedding the component of planning to enable better reasoning. Nevertheless, it is obvious that we are still very far from human intelligence, let alone Superintelligence. In addition to the remaining shortcomings vs. human intelligence, OpenAI’s latest reasoning model also lacks some of the capabilities of GPT-4, such as image and voice recognition, and it’s much slower in answering simple questions.?

We believe that we are so far that it becomes obvious that foundational models at their current form cannot be the basis for generic or super intelligence; new innovative approaches are needed. The fact that o1 is stronger than GPT-4 in some areas but weaker in others also makes us think that the “end game” will likely be a mix of models with an orchestrator model on top choosing what model to use depending on the situation. Similar to how the human brain chooses between short-term and long-term memory depending on the circumstances.


??? Fun things to impress at the dinner table

Too early to call victory? Waymo’s robotaxis have been responsible for less car accidents than humans in San Francisco and Phoenix.?

The Force never dies. James Earl Jones passed away at the age of 93 last week. But in 2022 he gave Lucasfilms permission to create an AI clone of Darth Vader’s voice.

See you next week for more AI insights.

Daniel Kempe

Co-Founder & CEO at Quuu.co

2 个月

You need to give superprompt a try in Claude system instructions. https://github.com/NeoVertex1/SuperPrompt

回复
Bastien Seignolles

Co-founder, COO I AI-run adaptive investing II Wine enthusiast, passionate about education

2 个月

Great article Aristotelis Xenofontos. Super interesting and insightful. Any idea on the cost invested to deliver o1?

Nicolò Carpaneda

Founder & CTO || adaptive investing

2 个月

Such an interesting read, thanks Aristotelis!! Really impressive to read about o1 advancements, and so much fun to see the results obtained from your direct testing with well-prepared and difficult questions to see how the system works. Loved it!! Best AI newsletter out there? ?????? * Breathe * AGI still not here ????

Godwin Josh

Co-Founder of Altrosyn and DIrector at CDTECH | Inventor | Manufacturer

2 个月

The emphasis on "chain of thought" in o1 suggests an attempt to mimic human reasoning processes. However, current AI models still struggle with complex, multi-step reasoning tasks that require common sense and world knowledge. Can o1's chain of thought approach be effectively integrated with external knowledge bases to enhance its reasoning capabilities in real-world scenarios like legal document analysis?

要查看或添加评论,请登录

社区洞察

其他会员也浏览了