An Investigation into LLM Reasoning Capabilities (+ 'Zebra' puzzles & SMT Solvers)
Created by DALL-E Bing Image Creator

An Investigation into LLM Reasoning Capabilities (+ 'Zebra' puzzles & SMT Solvers)

With the (preview) release of GPT 4 Turbo, OpenAI has updated its Technical Report on GPT performance. The results are very impressive what with GPT test scores on LSAT (168/180) and SAT (700/800) in the 90th percentile of humans. These are but a few of the benchmarks mentioned in the report.

Factors contributing to LLM performance

In my view, the human-level performance of LLMs is achieved with a conflation of 3 factors:

  • Knowledge / facts
  • World model
  • Reasoning capabilities

Each of these is described next.

Knowledge: Large Language Models (LLMs) operate with a level of complexity that is not yet fully comprehended. However, it is evident that they encapsulate an extensive repository of knowledge and facts, attributed to their training on a substantial corpus of existing written texts.

World Model: LLMs are next-token (~word) prediction models. They produce the next token given an input sequence of tokens. The next token is then added to the input to produce the next-next token. This process is applied recursively to generate a sequence. (Note: LLMs actually output a distribution over all possible tokens, from which a probable token is selected as the next token).

There is a debate as to whether such models are just statistical correlators ("stochastic parrots") or contain in them a learned representation of a "world model". There is some evidence that LLMs may be learning a word model. By probing the weights of a simple LLM trained to make legal moves for the game Othello, Li et. al. found evidence that the model learned identifiable concepts related to Othello. There are other studies to this effect. However, it is not really known how deep or authentic is the learned world representation.

My personal view is that it would be difficult for such LLMs to generate coherent sequences of long texts without some underlying 'understanding' of how the world works (i.e. they do contain a word model). However, given that LLMs are prone to hallucinating means that the world models are far from complete.

Reasoning: High performance on standardized tests and other datasets implies that LLM models also have intrinsic reasoning abilities. Reasoning is a vast topic. Here the focus is on logical reasoning which can be further categorized into deductive, inductive and abductive. Please refer to this LLM logical reasoning survey paper by Luo et. al., for details.

Context

In this post I explore the abstract logical reasoning capabilities of GPT 3.5 Turbo, GPT 4, and GPT 4 Turbo models on three abstract logical reasoning tasks of increasing complexity.

This is not a rigorous scientific experiment. My aim is to develop an expectation of the reasoning limits of SOTA LLMs. The background is that - in most organizational settings - these LLMs will be used in conjunction with internal company documents in a Retrieval Augmented Generation (RAG) type scenario. There are three aspects to consider:

  • Likely this is data that the LLM has not seen before. The data, a) may not coincide well with the LLM's word model and b) may contain cryptic terms that are not frequent enough in LLM training for the LLM to have developed a good grasp of the semantics.
  • Further, the users - who though are experts in their domain - may not posses the skills to extract the most out of the models by using advanced prompting techniques such as few-shot learning. The reasoning abilities should thus be tested with a standard, generic prompt.
  • As domain experts, the users may ask complex questions, the answers to which may have to be assembled from the contents of multiple documents. The generation of correct answers may require the model to perform complex reasoning over the available content.

The above aspects emphasize the capability of abstract logical reasoning. Instead of relying solely on prior knowledge, the LLM is to identify the facts / assertions / axioms in the provided content and reason henceforth to generate the answer.

Experiment

The three reasoning tasks selected for the experiment are listed below:

A) Easy - [source: Logi_glue, babi_task_15, id: 12]

1 Jessica is a wolf. 2 Gertrude is a sheep. 3 Sheep are afraid of wolves. 
4 Wolves are afraid of mice. 5 Emily is a mouse. 6 Cats are afraid of 
mice. 7 Mice are afraid of sheep. 8 Winona is a cat. 

question: What is Winona afraid of?        

B) Medim - [source: Logi_glue, natlang, id: AttNonegNatLang-CWA-85]

theory: Charlie is feeling cold -- he is also feeling quite rough, red, 
nice and round. Fred seems to be round. A bit overweight, Gary is big and 
round. He's a nice guy who is often cold. Folks who are rough and tumble 
yet gentle and kind are blue. A rather big person is a red person. A kind 
person will certainly be rough as well. Anyone having rough, cold and 
green qualities will also have a big quality. People who are young and 
blue are also red. People who are said to be big and nice are round. Cold 
and red people are always kind to others. 

statement: Charlie is kind.

question: Is the statement true given the theory?        

C) Hard - [source: Wheels Around Town]

Wheels Around Town - Zebra Puzzle - Aha! Puzzles (ahapuzzles.com)

Four friends are cycling around the city on their bikes. Each one has a 
different name, a different age, is riding a different bike, and have 
brought a different sandwich. 

Use the clues to find out their characteristics.

Clues:
The 12-year-old cyclist is in the third position.
The cyclist with the Yellow bike is immediately after Ivan.
The 14-year-old cyclist has a Cheese sandwich.
The boy with the Peanut Butter sandwich is directly before the boy with the Turkey sandwich.
Eric is the boy who has a Peanut Butter sandwich.
The 11-year-old boy is in the last position.
Cyclist Gregory is 14 years old.
The cyclist with the Red bike is next to the cyclist named David.
Ivan is the one riding the Blue bike.
The 14-year-old boy is in the second position.        

The last is one (C) is a variation of the famous Einstein 'Zebra' puzzle.

Each of the abstract reasoning tasks above were run against the three GPT models listed earlier. The system message or prompt for all was:

You are a helpful AI assistant. Be brief but elaborate, if required. Let's 
think step-by-step. Be sure about your answer, don't make things up.        

Experiment Results

The results for each of the tasks above are given below:

A) Easy

  • Chat GPT 3.5 Turbo - CORRECT

Winona is a cat, and according to statement 6, "Cats are afraid of mice." 
Therefore, Winona is afraid of mice.        

  • Chat GPT 4 - CORRECT

Winona, being a cat, is afraid of mice as stated in statement 6.        

  • Chat GPT 4 Turbo - CORRECT

Winona is afraid of mice, according to statement 6.        

B) Medium

  • Chat GPT 3.5 Turbo - WRONG

Based on the given theory, it can be inferred that a kind person will 
certainly be rough as well. However, there is no direct mention of Charlie 
having rough qualities. Therefore, it cannot be concluded with certainty 
whether Charlie is kind or not.        

  • Chat GPT 4 - CORRECT

Yes, the statement is true given the theory. According to the theory, 
Charlie is feeling cold, rough, red, nice, and round. The theory also 
states that cold and red people are always kind to others. Since Charlie 
is both cold and red, he must be kind.        

  • Chat GPT 4 Turbo - CORRECT

According to the theory provided:

1. Charlie is feeling cold.
2. Charlie is also feeling quite rough, red, nice, and round.

From the theory, we can deduce the following:

- Cold and red people are always kind.
- A kind person will certainly be rough as well.

Since Charlie is described as both cold and red, and the theory states 
that anyone who is cold and red is always kind, the statement "Charlie is 
kind" is true based on the given theory.        

C) Hard

  • Chat GPT 3.5 Turbo - NOT ATTEMPTED
  • Chat GPT 4 0 - NOT ATTEMPTED
  • Chat GPT 4 Turbo - WRONG (HALLUCINATED)

Note: Used the Z3 SMT Solver to solve this task in order to save time.

Let's solve this step by step using the given clues:

1. The 12-year-old cyclist is in the third position.
2. The cyclist with the Yellow bike is immediately after Ivan.
3. The 14-year-old cyclist has a Cheese sandwich.
4. The boy with the Peanut Butter sandwich is directly before the boy with the Turkey sandwich.
5. Eric is the boy who has a Peanut Butter sandwich.
6. The 11-year-old boy is in the last position.
7. Cyclist Gregory is 14 years old.
8. The cyclist with the Red bike is next to the cyclist named David.
9. Ivan is the one riding the Blue bike.
10. The 14-year-old boy is in the second position.

From clues 7 and 10, we know Gregory is 14 years old and in the second 
position with a Cheese sandwich (clue 3).

[...]

To summarize:

- David is 10, riding a Pink bike, and has a Roast Beef sandwich.
- Gregory is 14, riding a Red bike, and has a Cheese sandwich.
- Ivan is 12, riding a Blue bike, and has a Turkey sandwich.
- Eric is 11, riding a Yellow bike, and has a Peanut Butter sandwich.        

(full response here)

Analysis and Summary

The abstract logical reasoning tasks of varying complexity were attempted to be solved with the currently available versions of GPT (3.5 turbo, 4, 4 turbo).

The premise was to understand the reasoning capabilities of these models to gauge their question-answering performance on out-of-sample and unfamiliar datasets.

GPT 4 (both versions) correctly solved the easy and medium puzzle tasks. GPT 3.5 turbo solved only the easy task.

The hard zebra puzzle was first attempted with GPT 4 turbo - the newest model. And because GPT 4 turbo could not solve the puzzle, the puzzle task was not attempted with the other models.

For the hard task, GPT 4 turbo generated a response that looked as if the model had solved the puzzle. However, on closer inspection, and after solving the puzzle with an SMT Solver, it was discovered that the response was entirely incorrect. The output was a hallucination.

This is a danger that one has to be aware of, as the models get better over time. They can produce wrong but increasingly convincing looking output. This danger is also highlighted in the GPT Technical Report (referenced earlier).

When addressing question-answering tasks with unfamiliar datasets, particularly in a Retrieval-Augmented Generation (RAG) setting, models are generally able to handle questions of moderate complexity. However, they lack the ability to engage in complex reasoning. Users should exercise due diligence and not accept the answers provided by these models at face value, but rather verify their accuracy independently.





要查看或添加评论,请登录

Faisal Waris的更多文章

社区洞察

其他会员也浏览了