登录查看更多内容

An Investigation into LLM Reasoning Capabilities (+ 'Zebra' puzzles & SMT Solvers)

Faisal Waris

Data Scientist/Gen. AI Strategist in the telecom industry

发布日期: 2024年1月15日

With the (preview) release of GPT 4 Turbo, OpenAI has updated its Technical Report on GPT performance. The results are very impressive what with GPT test scores on LSAT (168/180) and SAT (700/800) in the 90th percentile of humans. These are but a few of the benchmarks mentioned in the report.

In my view, the human-level performance of LLMs is achieved with a conflation of 3 factors:

Knowledge / facts
World model
Reasoning capabilities

Each of these is described next.

Knowledge: Large Language Models (LLMs) operate with a level of complexity that is not yet fully comprehended. However, it is evident that they encapsulate an extensive repository of knowledge and facts, attributed to their training on a substantial corpus of existing written texts.

World Model: LLMs are next-token (~word) prediction models. They produce the next token given an input sequence of tokens. The next token is then added to the input to produce the next-next token. This process is applied recursively to generate a sequence. (Note: LLMs actually output a distribution over all possible tokens, from which a probable token is selected as the next token).

There is a debate as to whether such models are just statistical correlators ("stochastic parrots") or contain in them a learned representation of a "world model". There is some evidence that LLMs may be learning a word model. By probing the weights of a simple LLM trained to make legal moves for the game Othello, Li et. al. found evidence that the model learned identifiable concepts related to Othello. There are other studies to this effect. However, it is not really known how deep or authentic is the learned world representation.

My personal view is that it would be difficult for such LLMs to generate coherent sequences of long texts without some underlying 'understanding' of how the world works (i.e. they do contain a word model). However, given that LLMs are prone to hallucinating means that the world models are far from complete.

Reasoning: High performance on standardized tests and other datasets implies that LLM models also have intrinsic reasoning abilities. Reasoning is a vast topic. Here the focus is on logical reasoning which can be further categorized into deductive, inductive and abductive. Please refer to this LLM logical reasoning survey paper by Luo et. al., for details.

Context

In this post I explore the abstract logical reasoning capabilities of GPT 3.5 Turbo, GPT 4, and GPT 4 Turbo models on three abstract logical reasoning tasks of increasing complexity.

This is not a rigorous scientific experiment. My aim is to develop an expectation of the reasoning limits of SOTA LLMs. The background is that - in most organizational settings - these LLMs will be used in conjunction with internal company documents in a Retrieval Augmented Generation (RAG) type scenario. There are three aspects to consider:

Likely this is data that the LLM has not seen before. The data, a) may not coincide well with the LLM's word model and b) may contain cryptic terms that are not frequent enough in LLM training for the LLM to have developed a good grasp of the semantics.
Further, the users - who though are experts in their domain - may not posses the skills to extract the most out of the models by using advanced prompting techniques such as few-shot learning. The reasoning abilities should thus be tested with a standard, generic prompt.
As domain experts, the users may ask complex questions, the answers to which may have to be assembled from the contents of multiple documents. The generation of correct answers may require the model to perform complex reasoning over the available content.

The above aspects emphasize the capability of abstract logical reasoning. Instead of relying solely on prior knowledge, the LLM is to identify the facts / assertions / axioms in the provided content and reason henceforth to generate the answer.

Experiment

The three reasoning tasks selected for the experiment are listed below:

A) Easy - [source: Logi_glue, babi_task_15, id: 12]

1 Jessica is a wolf. 2 Gertrude is a sheep. 3 Sheep are afraid of wolves. 
4 Wolves are afraid of mice. 5 Emily is a mouse. 6 Cats are afraid of 
mice. 7 Mice are afraid of sheep. 8 Winona is a cat. 

question: What is Winona afraid of?

B) Medim - [source: Logi_glue, natlang, id: AttNonegNatLang-CWA-85]

theory: Charlie is feeling cold -- he is also feeling quite rough, red, 
nice and round. Fred seems to be round. A bit overweight, Gary is big and 
round. He's a nice guy who is often cold. Folks who are rough and tumble 
yet gentle and kind are blue. A rather big person is a red person. A kind 
person will certainly be rough as well. Anyone having rough, cold and 
green qualities will also have a big quality. People who are young and 
blue are also red. People who are said to be big and nice are round. Cold 
and red people are always kind to others. 

statement: Charlie is kind.

question: Is the statement true given the theory?

C) Hard - [source: Wheels Around Town]

Wheels Around Town - Zebra Puzzle - Aha! Puzzles (ahapuzzles.com)

Four friends are cycling around the city on their bikes. Each one has a 
different name, a different age, is riding a different bike, and have 
brought a different sandwich. 

Use the clues to find out their characteristics.

Clues:
The 12-year-old cyclist is in the third position.
The cyclist with the Yellow bike is immediately after Ivan.
The 14-year-old cyclist has a Cheese sandwich.
The boy with the Peanut Butter sandwich is directly before the boy with the Turkey sandwich.
Eric is the boy who has a Peanut Butter sandwich.
The 11-year-old boy is in the last position.
Cyclist Gregory is 14 years old.
The cyclist with the Red bike is next to the cyclist named David.
Ivan is the one riding the Blue bike.
The 14-year-old boy is in the second position.

The last is one (C) is a variation of the famous Einstein 'Zebra' puzzle.

Each of the abstract reasoning tasks above were run against the three GPT models listed earlier. The system message or prompt for all was:

You are a helpful AI assistant. Be brief but elaborate, if required. Let's 
think step-by-step. Be sure about your answer, don't make things up.

Experiment Results

The results for each of the tasks above are given below:

领英推荐

The Latest on LLMs: Decision-Making, Knowledge Graphs,…

Towards Data Science 6 个月前

??Top ML Papers of the Week

DAIR.AI 9 个月前

??Top ML Papers of the Week

DAIR.AI 1 年前

A) Easy

Chat GPT 3.5 Turbo - CORRECT

Winona is a cat, and according to statement 6, "Cats are afraid of mice." 
Therefore, Winona is afraid of mice.

Chat GPT 4 - CORRECT

Winona, being a cat, is afraid of mice as stated in statement 6.

Chat GPT 4 Turbo - CORRECT

Winona is afraid of mice, according to statement 6.

B) Medium

Chat GPT 3.5 Turbo - WRONG

Based on the given theory, it can be inferred that a kind person will 
certainly be rough as well. However, there is no direct mention of Charlie 
having rough qualities. Therefore, it cannot be concluded with certainty 
whether Charlie is kind or not.

Chat GPT 4 - CORRECT

Yes, the statement is true given the theory. According to the theory, 
Charlie is feeling cold, rough, red, nice, and round. The theory also 
states that cold and red people are always kind to others. Since Charlie 
is both cold and red, he must be kind.

Chat GPT 4 Turbo - CORRECT

According to the theory provided:

1. Charlie is feeling cold.
2. Charlie is also feeling quite rough, red, nice, and round.

From the theory, we can deduce the following:

- Cold and red people are always kind.
- A kind person will certainly be rough as well.

Since Charlie is described as both cold and red, and the theory states 
that anyone who is cold and red is always kind, the statement "Charlie is 
kind" is true based on the given theory.

C) Hard

Chat GPT 3.5 Turbo - NOT ATTEMPTED
Chat GPT 4 0 - NOT ATTEMPTED
Chat GPT 4 Turbo - WRONG (HALLUCINATED)

Note: Used the Z3 SMT Solver to solve this task in order to save time.

Let's solve this step by step using the given clues:

1. The 12-year-old cyclist is in the third position.
2. The cyclist with the Yellow bike is immediately after Ivan.
3. The 14-year-old cyclist has a Cheese sandwich.
4. The boy with the Peanut Butter sandwich is directly before the boy with the Turkey sandwich.
5. Eric is the boy who has a Peanut Butter sandwich.
6. The 11-year-old boy is in the last position.
7. Cyclist Gregory is 14 years old.
8. The cyclist with the Red bike is next to the cyclist named David.
9. Ivan is the one riding the Blue bike.
10. The 14-year-old boy is in the second position.

From clues 7 and 10, we know Gregory is 14 years old and in the second 
position with a Cheese sandwich (clue 3).

[...]

To summarize:

- David is 10, riding a Pink bike, and has a Roast Beef sandwich.
- Gregory is 14, riding a Red bike, and has a Cheese sandwich.
- Ivan is 12, riding a Blue bike, and has a Turkey sandwich.
- Eric is 11, riding a Yellow bike, and has a Peanut Butter sandwich.

(full response here)

Analysis and Summary

The abstract logical reasoning tasks of varying complexity were attempted to be solved with the currently available versions of GPT (3.5 turbo, 4, 4 turbo).

The premise was to understand the reasoning capabilities of these models to gauge their question-answering performance on out-of-sample and unfamiliar datasets.

GPT 4 (both versions) correctly solved the easy and medium puzzle tasks. GPT 3.5 turbo solved only the easy task.

The hard zebra puzzle was first attempted with GPT 4 turbo - the newest model. And because GPT 4 turbo could not solve the puzzle, the puzzle task was not attempted with the other models.

For the hard task, GPT 4 turbo generated a response that looked as if the model had solved the puzzle. However, on closer inspection, and after solving the puzzle with an SMT Solver, it was discovered that the response was entirely incorrect. The output was a hallucination.

This is a danger that one has to be aware of, as the models get better over time. They can produce wrong but increasingly convincing looking output. This danger is also highlighted in the GPT Technical Report (referenced earlier).

When addressing question-answering tasks with unfamiliar datasets, particularly in a Retrieval-Augmented Generation (RAG) setting, models are generally able to handle questions of moderate complexity. However, they lack the ability to engage in complex reasoning. Users should exercise due diligence and not accept the answers provided by these models at face value, but rather verify their accuracy independently.

要查看或添加评论，请登录

Faisal Waris的更多文章

Revisiting Logic Puzzle with "o1"

2024年9月29日

Revisiting Logic Puzzle with "o1"

A few months ago, I tested the reasoning capabilities of then latest 'gpt' model. Here is a link to that article: An…
Phi-3 Vision is a Surprisingly Useful Gem

2024年7月8日

Phi-3 Vision is a Surprisingly Useful Gem

My work involves building RAG applications for question-answering over highly technical internal company documents. The…
Constrained and Provable LLM Code Generation

2024年3月11日

Constrained and Provable LLM Code Generation

LLMs are now good at generating code but human intervention is still required. Can't accept the generated code blindly…

1 条评论
An Elegant Web Application Architecture for Contemporary Times

2023年12月16日

An Elegant Web Application Architecture for Contemporary Times

It used to be that as data scientists we rarely built full-stack production applications. However, that is changing…

1 条评论
FsOpenAI: A GPT 'chat' app for Internal Organizational Data

2023年8月14日

FsOpenAI: A GPT 'chat' app for Internal Organizational Data

#chatgp #azureopenai #semantickernel #semanticsearch #fsharp Unsurprisingly, the demand for accessing Large Language…
Applying Some 'Unconventional Wisdom' to Improve Model Scoring wrt. High-Velocity Streaming Data

2023年1月17日

Applying Some 'Unconventional Wisdom' to Improve Model Scoring wrt. High-Velocity Streaming Data

Some data streams in the telecom industry can exceed the rate of 50,000 messages / second. And at about 300KB per…

7 条评论
Resource-efficient model deployment

2022年9月28日

Resource-efficient model deployment

AI/ML is now mainstream. Model scoring capacity requirements are ever-increasing.
Text Classification with BERT and .Net

2021年11月21日

Text Classification with BERT and .Net

Transformer based models are currently the state-of-the-art for text classification and other natural language related…
Graph Convolutional Network Model with a Strongly-typed Functional Language

2021年5月17日

Graph Convolutional Network Model with a Strongly-typed Functional Language

My present job requires me to work with network or graphical data formats. Graphical data are not readily amenable to…

1 条评论
Lessons learnt in moving a data science 'project' to 'product'

2020年10月11日

Lessons learnt in moving a data science 'project' to 'product'

Data science is complex and so is software engineering. The nature of contemporary technology work often requires…

1 条评论

See all articles

An Investigation into LLM Reasoning Capabilities (+ 'Zebra' puzzles & SMT Solvers)

Faisal Waris

Data Scientist/Gen. AI Strategist in the telecom industry

Context

Experiment

A) Easy - [source: Logi_glue, babi_task_15, id: 12]

B) Medim - [source: Logi_glue, natlang, id: AttNonegNatLang-CWA-85]

C) Hard - [source: Wheels Around Town]

Experiment Results

领英推荐

A) Easy

B) Medium

C) Hard

Analysis and Summary

Faisal Waris的更多文章

社区洞察

其他会员也浏览了

RAG (Retrieval-Augmented Generation): How It Works, Its Limitations, and Strategies for Accurate Results

Geneea's AI Spotlight #2

Five critical thoughts and a warning on “Situational Awareness: The Decade Ahead.”

? The In-Context Revolution

A.I. nears human-level forecasting quality

Beyond RAG in 2025: A technical deep dive

Top LLM Papers of the Week (October Week 4, 2024)

Formulation of Node Embeddings in Graphs: Node2Vec Algorithm - Part 6 of X of my notes

Multimodal LLMs

Retrieval Augmented Generation (RAG) v/s Long-Context (LC) reasoning tradeoffs in Transformer based Language Models

Context

Experiment

A) Easy - [source: Logi_glue, babi_task_15, id: 12]

B) Medim - [source: Logi_glue, natlang, id: AttNonegNatLang-CWA-85]

C) Hard - [source: Wheels Around Town]

Experiment Results

领英推荐

A) Easy

B) Medium

C) Hard

Analysis and Summary

Faisal Waris的更多文章

Revisiting Logic Puzzle with "o1"

Phi-3 Vision is a Surprisingly Useful Gem

Constrained and Provable LLM Code Generation

An Elegant Web Application Architecture for Contemporary Times

FsOpenAI: A GPT 'chat' app for Internal Organizational Data

Applying Some 'Unconventional Wisdom' to Improve Model Scoring wrt. High-Velocity Streaming Data

Resource-efficient model deployment

Text Classification with BERT and .Net

Graph Convolutional Network Model with a Strongly-typed Functional Language

Lessons learnt in moving a data science 'project' to 'product'

社区洞察

其他会员也浏览了

RAG (Retrieval-Augmented Generation): How It Works, Its Limitations, and Strategies for Accurate Results

Geneea's AI Spotlight #2

Five critical thoughts and a warning on “Situational Awareness: The Decade Ahead.”

? The In-Context Revolution

A.I. nears human-level forecasting quality

Beyond RAG in 2025: A technical deep dive

Top LLM Papers of the Week (October Week 4, 2024)

Formulation of Node Embeddings in Graphs: Node2Vec Algorithm - Part 6 of X of my notes

Multimodal LLMs

Retrieval Augmented Generation (RAG) v/s Long-Context (LC) reasoning tradeoffs in Transformer based Language Models