The Artificial Investor - Issue 28: Can AI models reason like humans?
The most interesting story of the last few days was OpenAI’s launch of its new reasoning model series, o1.? The two AI models (preview and mini) are designed to spend more time thinking before they respond. They can reason through complex tasks and solve harder problems than previous models in science, coding, and maths.
How strong are the model’s reasoning capabilities? What does this mean about how far we are from Superintelligence??
?? Inside the model’s mechanics
First things first: what is o1 and how does it work?
o1 is a large language model that specialises in answering complex questions that require reasoning capabilities. When tested by OpenAI, the model performed similarly to PhD students in natural science tests (e.g. GPQA diamond, an intelligence benchmark for expertise in chemistry, physics and biology), scored as high as humans in several AI model benchmarks (e.g. scored 78% on MMMU vs. 69% for GPT-4o), excelled in maths (scores 83% in the American qualifying exam for the international maths olympiad vs. 13% for GPT-4o) and was found very strong in coding (reached the 89th percentile in the human rankings of the Codeforces’ coding competitions).?
The model was trained with reinforcement learning to perform complex reasoning. It thinks before it answers and can produce a long internal Chain of Thought before responding to the user.?
Reinforcement learning is a type of machine learning that enables a model to learn to make decisions to achieve a certain goal. The learning process is based on rewards and penalties that come from the model’s environment. In this case, the model is also called an agent. The concept was introduced by Richard Bellman and others in the mid-1950s and A simple example is a dog in a backyard that learns to fetch a ball. Initially the dog makes random movements, it explores. As the dog moves closer to the ball, the dog’s owner gives a biscuit or shouts positively to the dog, i.e. she reinforces the dog’s learning. The process eventually culminates in the dog’s fetching of the ball, i.e. reaching its goal. So, this OpenAI model was not trained on trillions of data points; rather, it was trained using challenges, alternative strategies and external feedback to reward it when it chose the right strategy.
Chain of Thought is one of the main techniques used on LLMs to develop autonomous “thinking” with no or limited human feedback. The technique involves generating a series of intermediate steps before arriving at a final answer and mimics the step-by-step thought process a human might use when working through a complex problem. A few months after GPT-3 was launched, we saw the first public implementations of the technique with AgentGPT and BabyAGI, on the back of a research paper that discussed the matter. A simple example of Chain of Thought is a situation where you are planning a day out with your family. First, you acknowledge the current situation: it’s sunny outside. Hence, an outdoor activity might be nice. Then, you consider additional factors: it’s quite hot, you should do something with shade or water to stay cool. Then, you may think of the alternatives: going to the park or visiting a swimming pool. After thinking through all factors (weather, temperature, activity) and comparing the alternatives, the final decision is to go to the swimming pool.?
Now, everyone is talking about reasoning. What is reasoning anyway? How do we measure it??
?? The invisible engine
Reasoning is generally defined as the process of thinking logically about something to form a conclusion or judgement. There are several types of reasoning:?
There are several ways to measure the reasoning ability of humans, including SHL’s deductive reasoning tests, Kenexa’s logical reasoning tests, the GMAT, certain sections of the various international maths and coding competitions, etc. With AI models, things are a bit different. Overall, LLM benchmarks are problematic. First of all, most of them focus on testing knowledge, either generalist (common sense) or specialist (physics, biology, etc.), which doesn’t correlate well with reasoning. A model that has been pre-trained (i.e. memorised) biology textbooks would score well in many of the LLM benchmarks, but this doesn’t mean that it can reason. An example of such a test is the MMMU benchmark that OpenAI references in its o1 launch page. Even the 2024 US maths olympiad test includes many questions where knowledge of mathematics theory plays a key role. Second, benchmark questions are typically public. It would be very tempting to train a model on similar and paraphrased questions, right? Thirdly, it has been proven that it’s relatively easy to game the benchmarks by fooling their antifraud mechanisms. Nvidia’s Dr. Jim Fan claims that the only two ways to trust a benchmark is if it’s either 1) based on votes of thousands of users, or 2) ran privately by an independent 3rd party in a well-curated and secure way.?
So, how can we tell that a model can reason? Is o1 a good reasoning model?
?? Putting o1 to the test
Since we don’t fully trust other benchmarks, particularly when it has to do with reasoning, we put OpenAI’s model to the test ourselves. We carried out a couple of tests: i) solving a challenging crossword puzzle, introduced originally by Ethan Mollick, and ii) answering two challenging US maths olympiad questions.?
?? Solving a difficult crossword?
We have taken a section of a crossword puzzle from stanxwords.com to speed up the exercise. The challenging part of this crossword in relation to reasoning is that many of the questions have a double meaning. As o1 does not process images currently, we have converted the image into text instructions using GPT-4’s image recognition capabilities.?
Already the first question, 1 Down, is quite challenging for a model as it’s very metaphorical. It talks about galaxies, but the answer is Apps, because it makes reference to Samsung Galaxy, the mobile phone. The model guessed that incorrectly and then went on to score a mere 25% (2/8). However, after helping it out a bit by giving it the answer to 1 Down, it tried again by revisiting all of its answers one by one and looking for alternatives. It took about 100 seconds following a very logical process and the result was very good: 7/8, i.e. 87.5% (interestingly, the missing word was due to hallucination). I was very impressed! Needless to say that GPT-4 did much worse, guessing correctly 25% of the words, even after giving it the first-question tip.
Below is the answer to the crossword puzzle in case you want to play with it:
?? Answering reasoning questions for the gold medal
I then looked up the 2024 US maths olympiad (2024 AIME) and focused only on the questions that require zero previous maths knowledge and are 100% focused on logical thinking. This is to avoid the noise of the competitive advantage a model can get by “memorising” hundreds of thousands of pages from maths textbooks. These were problem 3 and 6:
In Problem 3, o1 started by understanding the game (positions, goal), then moved onto identifying an appropriate strategy, and then started making calculations. It provided the final answer within 20 seconds - the number 809. In contrast, GPT-4 was much faster but gave a wrong answer. In Problem 6, o1 followed a similar process, took again about 30 seconds and provided the correct answer - 294. GPT-4 was again much faster but gave a wrong answer. Once again, a very impressive outcome for o1.?
Our conclusion is that OpenAI o1 has strong reasoning skills and probably beats many humans in challenging problem-solving questions. Obviously, the second part of our test would be even more robust if we were testing the model on a recently-released maths olympiad set of questions, to ensure it doesn’t have any previous relevant knowledge.
?? The path towards Superintelligence
When OpenAI released GPT-3 to the public in November 2022 everyone started talking about how close we are to AGI (Artificial Generic Intelligence) or even Superintelligence. As scientists and analysts started spending more time with LLMs, they realised that they have the following shortcomings when compared to human intelligence:
o1 seems to help AI take a step forward towards human intelligence by embedding the component of planning to enable better reasoning. Nevertheless, it is obvious that we are still very far from human intelligence, let alone Superintelligence. In addition to the remaining shortcomings vs. human intelligence, OpenAI’s latest reasoning model also lacks some of the capabilities of GPT-4, such as image and voice recognition, and it’s much slower in answering simple questions.?
We believe that we are so far that it becomes obvious that foundational models at their current form cannot be the basis for generic or super intelligence; new innovative approaches are needed. The fact that o1 is stronger than GPT-4 in some areas but weaker in others also makes us think that the “end game” will likely be a mix of models with an orchestrator model on top choosing what model to use depending on the situation. Similar to how the human brain chooses between short-term and long-term memory depending on the circumstances.
??? Fun things to impress at the dinner table
Too early to call victory? Waymo’s robotaxis have been responsible for less car accidents than humans in San Francisco and Phoenix.?
The Force never dies. James Earl Jones passed away at the age of 93 last week. But in 2022 he gave Lucasfilms permission to create an AI clone of Darth Vader’s voice.
See you next week for more AI insights.
Co-Founder & CEO at Quuu.co
2 个月You need to give superprompt a try in Claude system instructions. https://github.com/NeoVertex1/SuperPrompt
Co-founder, COO I AI-run adaptive investing II Wine enthusiast, passionate about education
2 个月Great article Aristotelis Xenofontos. Super interesting and insightful. Any idea on the cost invested to deliver o1?
Founder & CTO || adaptive investing
2 个月Such an interesting read, thanks Aristotelis!! Really impressive to read about o1 advancements, and so much fun to see the results obtained from your direct testing with well-prepared and difficult questions to see how the system works. Loved it!! Best AI newsletter out there? ?????? * Breathe * AGI still not here ????
Co-Founder of Altrosyn and DIrector at CDTECH | Inventor | Manufacturer
2 个月The emphasis on "chain of thought" in o1 suggests an attempt to mimic human reasoning processes. However, current AI models still struggle with complex, multi-step reasoning tasks that require common sense and world knowledge. Can o1's chain of thought approach be effectively integrated with external knowledge bases to enhance its reasoning capabilities in real-world scenarios like legal document analysis?