We need to Rethink Chain-of-Thought (CoT) prompting - AI&YOU #68
Should you use CoT Prompting?

We need to Rethink Chain-of-Thought (CoT) prompting - AI&YOU #68

Stat of the Week: Zero-shot CoT performance was only 5.55% for GPT-4-Turbo, 8.51% for Claude-3-Opus, and 4.44% for GPT-4. ("Chain of Thoughtlessness?" paper)


Chain-of-Thought (CoT) prompting has been hailed as a breakthrough in unlocking the reasoning capabilities of large language models (LLMs). However, recent research has challenged these claims and prompted us to revisit the technique.


In this week's edition of AI&YOU, we are exploring insights from three blogs we published on the topic:


  • What is Chain of Thought (CoT) Prompting?
  • AI Research Paper Breakdown: "Chain of Thoughtlessness?"
  • 10 Best Prompting Techniques for LLMs



What is Chain-of-Thought Prompting?

We need to rethink chain-of-thought (CoT) prompting AI&YOU #68

August 29, 2024


LLMs demonstrate remarkable capabilities in natural language processing (NLP) and generation. However, when faced with complex reasoning tasks, these models can struggle to produce accurate and reliable results. This is where Chain-of-Thought (CoT) prompting comes into play, a technique that aims to enhance the problem-solving abilities of LLMs.


An advanced prompt engineering technique, it is designed to guide LLMs through a step-by-step reasoning process. Unlike standard prompting methods that aim for direct answers, CoT prompting encourages the model to generate intermediate reasoning steps before arriving at a final answer.


At its core, CoT prompting involves structuring input prompts in a way that elicits a logical sequence of thoughts from the model. By breaking down complex problems into smaller, manageable steps, CoT attempts to enable LLMs to navigate through intricate reasoning paths more effectively.


Chain-of-Thought Prompting

How CoT Works

At its core, CoT prompting guides language models through a series of intermediate reasoning steps before arriving at a final answer. This process typically involves:


  1. Problem Decomposition: The complex task is broken down into smaller, manageable steps.
  2. Step-by-Step Reasoning: The model is prompted to think through each step explicitly.
  3. Logical Progression: Each step builds upon the previous one, creating a chain of thoughts.
  4. Conclusion Drawing: The final answer is derived from the accumulated reasoning steps.


Types of CoT Prompting

Chain-of-Thought prompting can be implemented in various ways, with two primary types standing out:


  1. Zero-shot CoT: Zero-shot CoT doesn't require task-specific examples. Instead, it uses a simple prompt like "Let's approach this step by step" to encourage the model to break down its reasoning process.****
  2. Few-shot CoT: Few-shot CoT involves providing the model with a small number of examples that demonstrate the desired reasoning process. These examples serve as a template for the model to follow when tackling new, unseen problems.


Zero-shot CoT


Zero Shot Chain-of-Thought Prompting

Few-shot CoT


Few Shot Chain-of-Thought Prompting



AI Research Paper Breakdown: "Chain of Thoughtlessness?"

Now that you know what CoT prompting is, we can dive into some recent research that challenges some of its benefits and offers some insight into when it is actually useful.


The research paper, titled "Chain of Thoughtlessness? An Analysis of CoT in Planning," provides a critical examination of CoT prompting's effectiveness and generalizability. As AI practitioners, it's crucial to understand these findings and their implications for developing AI applications that require sophisticated reasoning capabilities.



Chain of Thoughtlessness Paper?

The researchers chose a classical planning domain called Blocksworld as their primary testing ground. In Blocksworld, the task is to rearrange a set of blocks from an initial configuration to a goal configuration using a series of move actions. This domain is ideal for testing reasoning and planning capabilities because:


  1. It allows for the generation of problems with varying complexity
  2. It has clear, algorithmically verifiable solutions
  3. It's unlikely to be heavily represented in LLM training data



Prompting Strategies


The study examined three state-of-the-art LLMs: GPT-4, Claude-3-Opus, and GPT-4-Turbo. These models were tested using prompts of varying specificity:


  1. Zero-Shot Chain of Thought (Universal): Simply appending "let's think step by step" to the prompt.
  2. Progression Proof (Specific to PDDL): Providing a general explanation of plan correctness with examples.
  3. Blocksworld Universal Algorithm: Demonstrating a general algorithm for solving any Blocksworld problem.
  4. Stacking Prompt: Focusing on a specific subclass of Blocksworld problems (table-to-stack).
  5. Lexicographic Stacking: Further narrowing down to a particular syntactic form of the goal state.


By testing these prompts on problems of increasing complexity, the researchers aimed to evaluate how well LLMs could generalize the reasoning demonstrated in the examples.



Chain of Thought Imagined


Key Findings Unveiled


The results of this study challenge many prevailing assumptions about CoT prompting:


  1. Limited Effectiveness of CoT: Contrary to previous claims, CoT prompting only showed significant performance improvements when the examples provided were extremely similar to the query problem. As soon as the problems deviated from the exact format shown in the examples, performance dropped sharply.
  2. Rapid Performance Degradation: As the complexity of the problems increased (measured by the number of blocks involved), the accuracy of all models decreased dramatically, regardless of the CoT prompt used. This suggests that LLMs struggle to extend the reasoning demonstrated in simple examples to more complex scenarios.
  3. Ineffectiveness of General Prompts: Surprisingly, more general CoT prompts often performed worse than standard prompting without any reasoning examples. This contradicts the idea that CoT helps LLMs learn generalizable problem-solving strategies.
  4. Specificity Trade-off: The study found that highly specific prompts could achieve high accuracy, but only on a very narrow subset of problems. This highlights a sharp trade-off between performance gains and the applicability of the prompt.
  5. Lack of True Algorithmic Learning: The results strongly suggest that LLMs are not learning to apply general algorithmic procedures from the CoT examples. Instead, they seem to rely on pattern matching, which breaks down quickly when faced with novel or more complex problems.


These findings have significant implications for AI practitioners and enterprises looking to leverage CoT prompting in their applications. They suggest that while CoT can boost performance in certain narrow scenarios, it may not be the panacea for complex reasoning tasks that many had hoped for.


Benchmark results for Chain-of-Thought


Implications for AI Development

The findings of this study have significant implications for AI development, particularly for enterprises working on applications that require complex reasoning or planning capabilities:


  1. Reassessing CoT Effectiveness: AI developers should be cautious about relying on CoT for tasks that require true algorithmic thinking or generalization to novel scenarios.
  2. Limitations of Current LLMs: Alternative approaches may be necessary for applications requiring robust planning or multi-step problem-solving.
  3. The Cost of Prompt Engineering: While highly specific CoT prompts can yield good results for narrow problem sets, the human effort required to craft these prompts may outweigh the benefits, especially given their limited generalizability.
  4. Rethinking Evaluation Metrics: Relying solely on static test sets may overestimate a model's true reasoning capabilities.
  5. The Gap Between Perception and Reality: There's a significant discrepancy between the perceived reasoning abilities of LLMs (often anthropomorphized in popular discourse) and their actual capabilities as demonstrated in this study.


Recommendations for AI Practitioners:


  • Evaluation: Implement diverse testing frameworks to assess true generalization across problem complexities.
  • CoT Usage: Apply Chain-of-Thought prompting judiciously, recognizing its limitations in generalization.
  • Hybrid Solutions: Consider combining LLMs with traditional algorithms for complex reasoning tasks.
  • Transparency: Clearly communicate AI system limitations, especially for reasoning or planning tasks.
  • R&D Focus: Invest in research to enhance true reasoning capabilities of AI systems.
  • Fine-tuning: Consider domain-specific fine-tuning, but be aware of potential generalization limits.


For AI practitioners and enterprises, these findings highlight the importance of combining LLM strengths with specialized reasoning approaches, investing in domain-specific solutions where necessary, and maintaining transparency about AI system limitations. As we move forward, the AI community must focus on developing new architectures and training methods that can bridge the gap between pattern matching and true algorithmic reasoning.



10 Best Prompting Techniques for LLMs

This week, we also explore ten of the most powerful and common prompting techniques, offering insights into their applications and best practices.


10 Best Prompting Techniques


Well-designed prompts can significantly enhance an LLM's performance, enabling more accurate, relevant, and creative outputs. Whether you're a seasoned AI developer or just starting with LLMs, these techniques will help you unlock the full potential of AI models.


Make sure to check out the full blog to learn more about each one.




Thank you for taking the time to read AI & YOU!


Thank You

For even more content on enterprise AI, including infographics, stats, how-to guides, articles, and videos, follow Skim AI on LinkedIn


Looking to hire an AI Agent for a job to be done, or build a whole AI workforce? Schedule a demo of our no-code AI Agent Platform to make more money and tame your payroll costs forever!


We build enable Venture Capital and Private Equity backed companies in the following industries: Medical Technology, News/Content Aggregation, Film & Photo Production, Educational Technology, Legal Technology, Fintech & Cryptocurrency to automate work with AI.

This is a great approach, I love it

回复
Divya Atre

Building brand & demand through content marketing, social media marketing and campaigns

1 个月

It's great to see you shedding light on the limitations of CoT prompting in AI. Your insights are crucial for AI practitioners to consider as they navigate the complexities of reasoning tasks. Thank you for sharing such valuable knowledge!

回复
Garth Henderson

Currently seeking a job as an Enterprise Architect. Subsidiary skills include Business Architect, Data Architect, Solution Architect, Product Line Manager, Project Manager, and Sales/Marketing Manager.

1 个月

I totally agree with this perspective/coding approach. I'm an Enterprise Architect (EA) that is programming the evolution of EA, App Builder, Knowledge Management, AI, PLM, etc. AI is able to do what us programmers have done for decades: Converting words into programming code and programming code into words. This provides us with design specs and excellent workflow processes for all human activities.

要查看或添加评论,请登录

社区洞察

其他会员也浏览了