An Investigation into LLM Reasoning Capabilities (+ 'Zebra' puzzles & SMT Solvers)
With the (preview) release of GPT 4 Turbo, OpenAI has updated its Technical Report on GPT performance. The results are very impressive what with GPT test scores on LSAT (168/180) and SAT (700/800) in the 90th percentile of humans. These are but a few of the benchmarks mentioned in the report.
In my view, the human-level performance of LLMs is achieved with a conflation of 3 factors:
Each of these is described next.
Knowledge: Large Language Models (LLMs) operate with a level of complexity that is not yet fully comprehended. However, it is evident that they encapsulate an extensive repository of knowledge and facts, attributed to their training on a substantial corpus of existing written texts.
World Model: LLMs are next-token (~word) prediction models. They produce the next token given an input sequence of tokens. The next token is then added to the input to produce the next-next token. This process is applied recursively to generate a sequence. (Note: LLMs actually output a distribution over all possible tokens, from which a probable token is selected as the next token).
There is a debate as to whether such models are just statistical correlators ("stochastic parrots") or contain in them a learned representation of a "world model". There is some evidence that LLMs may be learning a word model. By probing the weights of a simple LLM trained to make legal moves for the game Othello, Li et. al. found evidence that the model learned identifiable concepts related to Othello. There are other studies to this effect. However, it is not really known how deep or authentic is the learned world representation.
My personal view is that it would be difficult for such LLMs to generate coherent sequences of long texts without some underlying 'understanding' of how the world works (i.e. they do contain a word model). However, given that LLMs are prone to hallucinating means that the world models are far from complete.
Reasoning: High performance on standardized tests and other datasets implies that LLM models also have intrinsic reasoning abilities. Reasoning is a vast topic. Here the focus is on logical reasoning which can be further categorized into deductive, inductive and abductive. Please refer to this LLM logical reasoning survey paper by Luo et. al., for details.
Context
In this post I explore the abstract logical reasoning capabilities of GPT 3.5 Turbo, GPT 4, and GPT 4 Turbo models on three abstract logical reasoning tasks of increasing complexity.
This is not a rigorous scientific experiment. My aim is to develop an expectation of the reasoning limits of SOTA LLMs. The background is that - in most organizational settings - these LLMs will be used in conjunction with internal company documents in a Retrieval Augmented Generation (RAG) type scenario. There are three aspects to consider:
The above aspects emphasize the capability of abstract logical reasoning. Instead of relying solely on prior knowledge, the LLM is to identify the facts / assertions / axioms in the provided content and reason henceforth to generate the answer.
Experiment
The three reasoning tasks selected for the experiment are listed below:
A) Easy - [source: Logi_glue, babi_task_15, id: 12]
1 Jessica is a wolf. 2 Gertrude is a sheep. 3 Sheep are afraid of wolves.
4 Wolves are afraid of mice. 5 Emily is a mouse. 6 Cats are afraid of
mice. 7 Mice are afraid of sheep. 8 Winona is a cat.
question: What is Winona afraid of?
B) Medim - [source: Logi_glue, natlang, id: AttNonegNatLang-CWA-85]
theory: Charlie is feeling cold -- he is also feeling quite rough, red,
nice and round. Fred seems to be round. A bit overweight, Gary is big and
round. He's a nice guy who is often cold. Folks who are rough and tumble
yet gentle and kind are blue. A rather big person is a red person. A kind
person will certainly be rough as well. Anyone having rough, cold and
green qualities will also have a big quality. People who are young and
blue are also red. People who are said to be big and nice are round. Cold
and red people are always kind to others.
statement: Charlie is kind.
question: Is the statement true given the theory?
C) Hard - [source: Wheels Around Town]
Four friends are cycling around the city on their bikes. Each one has a
different name, a different age, is riding a different bike, and have
brought a different sandwich.
Use the clues to find out their characteristics.
Clues:
The 12-year-old cyclist is in the third position.
The cyclist with the Yellow bike is immediately after Ivan.
The 14-year-old cyclist has a Cheese sandwich.
The boy with the Peanut Butter sandwich is directly before the boy with the Turkey sandwich.
Eric is the boy who has a Peanut Butter sandwich.
The 11-year-old boy is in the last position.
Cyclist Gregory is 14 years old.
The cyclist with the Red bike is next to the cyclist named David.
Ivan is the one riding the Blue bike.
The 14-year-old boy is in the second position.
The last is one (C) is a variation of the famous Einstein 'Zebra' puzzle.
Each of the abstract reasoning tasks above were run against the three GPT models listed earlier. The system message or prompt for all was:
You are a helpful AI assistant. Be brief but elaborate, if required. Let's
think step-by-step. Be sure about your answer, don't make things up.
Experiment Results
The results for each of the tasks above are given below:
领英推荐
A) Easy
Winona is a cat, and according to statement 6, "Cats are afraid of mice."
Therefore, Winona is afraid of mice.
Winona, being a cat, is afraid of mice as stated in statement 6.
Winona is afraid of mice, according to statement 6.
B) Medium
Based on the given theory, it can be inferred that a kind person will
certainly be rough as well. However, there is no direct mention of Charlie
having rough qualities. Therefore, it cannot be concluded with certainty
whether Charlie is kind or not.
Yes, the statement is true given the theory. According to the theory,
Charlie is feeling cold, rough, red, nice, and round. The theory also
states that cold and red people are always kind to others. Since Charlie
is both cold and red, he must be kind.
According to the theory provided:
1. Charlie is feeling cold.
2. Charlie is also feeling quite rough, red, nice, and round.
From the theory, we can deduce the following:
- Cold and red people are always kind.
- A kind person will certainly be rough as well.
Since Charlie is described as both cold and red, and the theory states
that anyone who is cold and red is always kind, the statement "Charlie is
kind" is true based on the given theory.
C) Hard
Let's solve this step by step using the given clues:
1. The 12-year-old cyclist is in the third position.
2. The cyclist with the Yellow bike is immediately after Ivan.
3. The 14-year-old cyclist has a Cheese sandwich.
4. The boy with the Peanut Butter sandwich is directly before the boy with the Turkey sandwich.
5. Eric is the boy who has a Peanut Butter sandwich.
6. The 11-year-old boy is in the last position.
7. Cyclist Gregory is 14 years old.
8. The cyclist with the Red bike is next to the cyclist named David.
9. Ivan is the one riding the Blue bike.
10. The 14-year-old boy is in the second position.
From clues 7 and 10, we know Gregory is 14 years old and in the second
position with a Cheese sandwich (clue 3).
[...]
To summarize:
- David is 10, riding a Pink bike, and has a Roast Beef sandwich.
- Gregory is 14, riding a Red bike, and has a Cheese sandwich.
- Ivan is 12, riding a Blue bike, and has a Turkey sandwich.
- Eric is 11, riding a Yellow bike, and has a Peanut Butter sandwich.
Analysis and Summary
The abstract logical reasoning tasks of varying complexity were attempted to be solved with the currently available versions of GPT (3.5 turbo, 4, 4 turbo).
The premise was to understand the reasoning capabilities of these models to gauge their question-answering performance on out-of-sample and unfamiliar datasets.
GPT 4 (both versions) correctly solved the easy and medium puzzle tasks. GPT 3.5 turbo solved only the easy task.
The hard zebra puzzle was first attempted with GPT 4 turbo - the newest model. And because GPT 4 turbo could not solve the puzzle, the puzzle task was not attempted with the other models.
For the hard task, GPT 4 turbo generated a response that looked as if the model had solved the puzzle. However, on closer inspection, and after solving the puzzle with an SMT Solver, it was discovered that the response was entirely incorrect. The output was a hallucination.
This is a danger that one has to be aware of, as the models get better over time. They can produce wrong but increasingly convincing looking output. This danger is also highlighted in the GPT Technical Report (referenced earlier).
When addressing question-answering tasks with unfamiliar datasets, particularly in a Retrieval-Augmented Generation (RAG) setting, models are generally able to handle questions of moderate complexity. However, they lack the ability to engage in complex reasoning. Users should exercise due diligence and not accept the answers provided by these models at face value, but rather verify their accuracy independently.