登录查看更多内容

LLMs Don't Think. Really Well.

Brad Edwards

Platform and SecOps Architect | Full-Stack Dev | Ex-RCMP | MSc CS (AI) Student | OSS Maintainer | Leader Who Ships

发布日期: 2023年9月25日

+ 关注

A question for you!
What is really going on when we prompt LLMs?
Chain-of-Verification Reduces Hallucination in Large Language Models
Google's Language Models as Optimizers

A Question for You!

Would you like a daily podcast that covers summaries of new AI research released on arxiv, and occasional full-text readings? Let me know!

I don't know about you, but I don't have as much time to read as I'd like. I have way more time where I can listen to a podcast, usually while I'm doing something else.

It occurred to me that other busy people who want to keep up on AI probably have the same problem, so I'm thinking of starting a podcast to solve it. If you would be interested, leave a comment or send me a DM.

What Is Really Going on When We Prompt LLMs?

I've been doing a lot of prompt engineering lately, mainly while working on retrieval-augmented generation apps. It's fun, it's useful... and if you're not careful, it utterly destroys your intuitions about what LLMs are actually doing.

Ever since ChatGPT was released and LLMs hit the mainstream, people have been anthropomorphizing LLMs harder than... Okay, my wit fails me. But harder than something people antropomorphize a lot.

At the root LLMs are statistical models doing next word prediction. They're exceptionally sophisticated predictive models of human text. They don't reason, they don't plan, they don't think you're failing at life because you asked the same question the third time this week. They just spit out the most probable next word based on the context, the words they already spat out, and what is in their training data.

It's incredible, and maybe disturbing, that these models can produce such convincing and useful text without thinking, reasoning, or awareness. LLMs don't think. But they do it really well.

Why Do I Bring This Up?

Two reasons.

First, because I listen to a lot AI podcasts and read a lot of articles and the uncritical anthropomorphization of AI is everywhere. I get that it's a handy turn of speech, and I'm guilty of it. But the way a lot of non-academic people I'm hearing speak about these models, they really do seem to be applying human theories of mind to what is going on under the hood with AI systems. And this is a fundamental misunderstanding of the technology. I really think it makes it harder for people to understand the benefits and limits of the technology we have.

The second reason is that I read Meta's Chain-of-Verification paper and Google DeepMind's Large Language Models as Optimizers paper (both summarized below) this week. Meta shows off a way to reduce LLM hallucinations by taking a structured approach to instructing an LLM to check its own work. Google shows us how to improve LLM accuracy in math by telling the LLM to "take a deep breath" in a prompt.

Our only real experience interacting with anything that produces coherent human speech is with other humans. So it's no surprise that we get sucked into applying all of our (human) theory of mind concepts to LLMs. But those models don't apply to dealing with LLMs, or at least not the ones you're thinking of unless you're an evil subliminal marketer or propagandist.

Manipulating Probability with Spells and Incantations.... I mean. Prompts.

When you send a message to a LLM you are not engaging in a semantic exchange of meaning and understanding . What you're really doing is conditioning the probabilities of the text completions the LLM will make in response to your prompt.

Think about the "take a deep breath" math example. And to be clear, this is not based on data about what is actually happening in response to that prompt, but it's the kind of thing that could be happening. As opposed to calming the LLM down, which is categorically not happening.

So what does "take a deep breath" sound like? It sounds like the kind of encouraging preamble that is at the start of half the math tutorials on the Internet. And hopefully, those math tutorials tend to be correct. So if our prompt makes distributions of text found in math tutorials more probable than not for the response, and tutorials are generally correct, then it's not that surprising the outputs of an LLM steered in that direction will be more statistically aligned with the math tutorials.

In essence, that is what we're doing with prompt engineering. We're just skewing the probabilities for next word prediction in our favour. While it's an accessible approach, being natural language, it isn't rigorous. I'm looking forward to when our understanding is deep enough we have a full theory of how to control next token prediction probabilities, or at least a really good formal understanding of constraining output distributions. It would let us put some quantitative assertions around expected model behaviour that would go a long way to giving people confidence. Especially regulators.

Paper Summaries

I read each of these papers and will give my thoughts. I did not have a lot of time to write this week, so the summaries are AI-generated.

Chain-of-Verification Reduces Hallucination in Large Language Models

My Take

I love the data this paper provides. A lot of prompt engineering is pure art and trial-and-error. Having real research to guide practice to guide prompt creation and model assessment is huge.

A big takeaway relates to what I said above. We do not have a rigorous, formal understanding of how to control the probabilities that govern the output of LLMs. In the absence of formal methods, we have these strange prompt techniques. It's actually a really interesting field. Psychology is to prompt engineering, as neuroscience is to...

领英推荐

Why Bill Gates believes AI superintelligence will…

Fast Company 8 个月前

Why Bill Gates believes AI superintelligence will…

Fast Company 8 个月前

Fake News Is Rampant, Here Is How Artificial…

Bernard Marr 4 年前

Summary

The paper proposes a method called Chain-of-Verification (CoVE) to reduce the occurrence of incorrect factual information generated by these models. The CoVE method consists of four main steps: drafting an initial response, planning verification questions, answering those questions independently, and finally generating a verified response. The paper shows that CoVE is effective in reducing hallucinations across various tasks.

Defining/Discussing Hallucination

Hallucination in the context of AI and language models refers to the generation of plausible but factually incorrect information. It’s not just a minor hiccup; it's a credibility killer. For instance, if you ask a model for the capital of Canada and it says "Toronto," you immediately know you can’t trust it for anything else, factual or not.

Importance to Application Builders, Business, and Users

Credibility: For applications like chatbots, news generators, or even decision support systems, credibility is paramount. A single hallucination undermines trust.
Safety: In high-stakes environments like healthcare or law, a hallucination could lead to severe repercussions.
Efficiency: For business applications, generating reliable information the first time around is crucial for streamlining workflows.

Chain-of-Verification Method and Results

The Chain-of-Verification (CoVE) method is a structured approach aimed at reducing hallucinations in Large Language Models (LLMs). The method is divided into four primary steps:

Initial Drafting: The LLM first drafts an initial response to a prompt or question. This is the model's best first shot, using its training to produce an answer.
Verification Planning: The model then generates verification questions aimed at checking the factual accuracy of its initial draft. These questions are designed to be independent and focused on checking specific aspects of the draft.
Independent Answering: The LLM answers these verification questions independently. This is crucial as it ensures that the answers are not biased by the content of the initial draft or other verification questions.
Final Verified Response: Based on the answers to the verification questions, the model then generates its final verified response. If the answers to the verification questions do not align with the initial draft, the model corrects its draft before finalizing it.

The authors conducted experiments on various tasks to assess the effectiveness of the CoVE method. Here are some of the key findings:

List-based Questions from Wikidata: CoVE reduced hallucination rates significantly when the LLM was tasked with generating lists of items based on Wikidata prompts. The hallucination rate dropped from 25.6% to 15.4%.
Closed Book QA: In a closed book question-answering task, the hallucination rate was reduced by 4.4% when using CoVE.
Factual Statements: When generating factual statements, the CoVE method led to a 5% reduction in hallucinations.
Comparison with Baseline Methods: The CoVE method outperformed other verification methods like "Back-and-Forth" and "Self-Talk" in reducing hallucinations.
Human Evaluation: Human evaluators found the responses generated using the CoVE method to be more reliable and less prone to factual errors compared to the baseline methods.
Computational Overhead: One of the challenges of implementing the CoVE method is the computational overhead. The process involves multiple steps of drafting and verification, which naturally take more time and computational resources. However, the authors argue that the benefits in terms of reducing hallucinations outweigh the costs.

Large Language Models as Optimizers

My Take

This paper has the "take a deep breath" prompt in it, and that got media attention. But what I really take from this paper is yet more evidence that LLMs are incredible intelligence multipliers and productivity enhancers. I think this holds especially true for people with enough breadth of knowlege, problem solving ability, and analytical skills to leverage LLMs as experts while still knowing enough to find ways to assess the quality of the work produced.

Optimization is not easy, and most people solving the Traveling Salesman Problem (TSP) are going to need to use reference materials, refer to notes, or look up algorithms. But papers like this show the promise of current generation LLMs as augmentation tools. Generally, most people who need an algorithm to solve the TSP or a related problem are not interested in TSP algorithms for their own sake. They want a solution to a problem. As long as they're competent to evaluate the solution, I think it is in everyone's interest to equip them with tools that let them get on with the job.

Summary

The paper "Large Language Models as Optimizers" by the Google DeepMind team introduces Optimization by PROmpting (OPRO), a method that leverages large language models (LLMs) to solve optimization tasks. The method describes the optimization problem in natural language, and the LLM iteratively generates new solutions, which are evaluated and updated. The paper demonstrates the efficacy of OPRO in solving various problems, from linear regression to the traveling salesman problem.

Defining/Discussing Optimization in LLMs

Optimization is not new, but using a language model for this task is. Typically, we use derivative-based algorithms for optimization. But what if you don't have a gradient? Enter OPRO. It turns LLMs into optimizers by asking them to solve problems described in plain English.

Importance to Application Builders, Business, and Users

Flexibility: You don't need to write complex algorithms; describe your problem in natural language.
Accessibility: Optimization becomes more approachable, even if you're not an expert in the field.
Adaptability: From supply chain management to personalized marketing, the applications are vast.

OPRO Method and Results

The OPRO (Optimization by PROmpting) method is a way to turn LLMs into optimizers. Here's how it works:

Problem Description: Describe the optimization problem in natural language and feed it to the LLM as a prompt.
Solution Generation: The LLM generates new solutions based on the prompt. These solutions are then evaluated based on the optimization problem at hand.
Prompt Update: The prompt is updated with the new solutions and their evaluations. This iterative process continues until an optimal or near-optimal solution is found.

Key Experiments and Results

Linear Regression: In a simple linear regression problem, OPRO outperformed traditional methods in terms of speed while achieving comparable accuracy.
Traveling Salesman Problem (TSP): In solving the TSP, OPRO generated solutions that were within 2–3% of the optimal solution, surpassing many heuristic algorithms.
Portfolio Optimization: When tasked with optimizing a financial portfolio, OPRO solutions had higher Sharpe ratios compared to traditional optimization methods.
Resource Allocation: In a simulated resource allocation problem, OPRO was able to find efficient allocations that were comparable to solutions from specialized algorithms.
Human-Readable Explanations: One unique advantage of OPRO is that the model can also generate human-readable explanations for the solutions it proposes. This is something that traditional optimization algorithms often lack.
Comparison with Baselines: OPRO consistently outperformed or matched the performance of traditional optimization algorithms and heuristics across a range of tasks.
Computational Efficiency: The paper also studied the computational efficiency of OPRO and found that it scales well with problem size, although it does have a computational cost associated with the iterative generation and evaluation of solutions.

Techcrafting AI

944 位关注者

要查看或添加评论，请登录

Brad Edwards的更多文章

Agent Exploit

2024年7月1日

Agent Exploit

In LLM Agents can Autonomously Exploit One-day Vulnerabilities Richard Fang, Rohan Bindu, Akul Gupta, and Daniel Kang…
KAN Do

2024年6月17日

KAN Do

When KAN: Kolmogorov-Arnold Networks by Ziming Liu, Yixuan Wang, Sachin Vaidya, Fabian Ruehle, James Halverson, Marin…

1 条评论
Word2Vec: The Basics

2024年3月22日

Word2Vec: The Basics

Word2Vec is a two-layer neural network that processes text by “vectorizing” words for use in several NLP tasks. Its…
Convolutional Neural Networks - Part 5 - Training and Evaluation

2024年3月20日

Convolutional Neural Networks - Part 5 - Training and Evaluation

Well, we’ve made it to the end. This is the last article of this series learning about CNNs from the ground up.
Convolutional Neural Networks - Part 4 - ReLU and Softmax

2024年3月18日

Convolutional Neural Networks - Part 4 - ReLU and Softmax

We made it to part four of our walk-through a scratch-built CNN implementation. You can find the complete code here.
Convolutional Neural Networks - Part 3 - Fully Connected Layer

2024年3月17日

Convolutional Neural Networks - Part 3 - Fully Connected Layer

Welcome to the next part of our series looking at CNNs. We're still working through the CNN implementation I wrote from…
Convolutional Neural Nets - Part 2 - Max Pooling

2024年3月15日

Convolutional Neural Nets - Part 2 - Max Pooling

Welcome to part two of walking through a toy example of a convolutional neural network. You can find the complete code…
Convolutional Neural Networks (CNN)s - Part 1

2024年3月15日

Convolutional Neural Networks (CNN)s - Part 1

In this series of articles, I will walk through the implementation of a convolutional neural network (CNN) from…
Test it. Test it Real Good.

2024年2月20日

Test it. Test it Real Good.

What have I been doing for the last month, you ask? A lot of DevOps and data engineering on AtomikLabs (and studying)…
CALM: Putting it Together

2024年1月15日

CALM: Putting it Together

The paper "LLM Augmented LLMs: Expanding Capabilities Through Composition" by Rachit Bansal, Bidisha Samanta, Siddharth…

See all articles

LLMs Don't Think. Really Well.

Brad Edwards

Platform and SecOps Architect | Full-Stack Dev | Ex-RCMP | MSc CS (AI) Student | OSS Maintainer | Leader Who Ships

A Question for You!

What Is Really Going on When We Prompt LLMs?

Why Do I Bring This Up?

Manipulating Probability with Spells and Incantations.... I mean. Prompts.

Paper Summaries

Chain-of-Verification Reduces Hallucination in Large Language Models

My Take

领英推荐

Summary

Defining/Discussing Hallucination

Importance to Application Builders, Business, and Users

Chain-of-Verification Method and Results

Large Language Models as Optimizers

My Take

Summary

Defining/Discussing Optimization in LLMs

Importance to Application Builders, Business, and Users

OPRO Method and Results

Key Experiments and Results

Techcrafting AI

944 位关注者

Brad Edwards的更多文章

社区洞察

其他会员也浏览了

#33 Is LoRA the Right Alternative to Full Fine-Tuning?

Large Concept Models: Thinking Beyond Tokens

Can AI Truly Think? Unmasking the Illusion of Machine Reasoning

Stop Getting Stuck in "AI" Theory—The Time to Build Is Now

GPTNext in November 2024 and should we pull the plug?!

Fear of AI is Overblown

Your AI's Memory (Still) Sucks

DeepSeek R1: Pioneering the New Frontier in AI Innovation

Deepseek R1 - The Good, The Bad, and The Cloudy

Tech Insights 2025 Week 7

A Question for You!

What Is Really Going on When We Prompt LLMs?

Why Do I Bring This Up?

Manipulating Probability with Spells and Incantations.... I mean. Prompts.

Paper Summaries

Chain-of-Verification Reduces Hallucination in Large Language Models

My Take

领英推荐

Summary

Defining/Discussing Hallucination

Importance to Application Builders, Business, and Users

Chain-of-Verification Method and Results

Large Language Models as Optimizers

My Take

Summary

Defining/Discussing Optimization in LLMs

Importance to Application Builders, Business, and Users

OPRO Method and Results

Key Experiments and Results

Techcrafting AI

944 位关注者

Brad Edwards的更多文章

Agent Exploit

KAN Do

Word2Vec: The Basics

Convolutional Neural Networks - Part 5 - Training and Evaluation

Convolutional Neural Networks - Part 4 - ReLU and Softmax

Convolutional Neural Networks - Part 3 - Fully Connected Layer

Convolutional Neural Nets - Part 2 - Max Pooling

Convolutional Neural Networks (CNN)s - Part 1

Test it. Test it Real Good.

CALM: Putting it Together

社区洞察

其他会员也浏览了

#33 Is LoRA the Right Alternative to Full Fine-Tuning?

Large Concept Models: Thinking Beyond Tokens

Can AI Truly Think? Unmasking the Illusion of Machine Reasoning

Stop Getting Stuck in "AI" Theory—The Time to Build Is Now

GPTNext in November 2024 and should we pull the plug?!

Fear of AI is Overblown

Your AI's Memory (Still) Sucks

DeepSeek R1: Pioneering the New Frontier in AI Innovation

Deepseek R1 - The Good, The Bad, and The Cloudy

Tech Insights 2025 Week 7