LLMs Don't Think. Really Well.
Made with Midjourney

LLMs Don't Think. Really Well.

  • A question for you!
  • What is really going on when we prompt LLMs?
  • Chain-of-Verification Reduces Hallucination in Large Language Models
  • Google's Language Models as Optimizers

A Question for You!

Would you like a daily podcast that covers summaries of new AI research released on arxiv, and occasional full-text readings? Let me know!

I don't know about you, but I don't have as much time to read as I'd like. I have way more time where I can listen to a podcast, usually while I'm doing something else.

It occurred to me that other busy people who want to keep up on AI probably have the same problem, so I'm thinking of starting a podcast to solve it. If you would be interested, leave a comment or send me a DM.

What Is Really Going on When We Prompt LLMs?

I've been doing a lot of prompt engineering lately, mainly while working on retrieval-augmented generation apps. It's fun, it's useful... and if you're not careful, it utterly destroys your intuitions about what LLMs are actually doing.

Ever since ChatGPT was released and LLMs hit the mainstream, people have been anthropomorphizing LLMs harder than... Okay, my wit fails me. But harder than something people antropomorphize a lot.

At the root LLMs are statistical models doing next word prediction. They're exceptionally sophisticated predictive models of human text. They don't reason, they don't plan, they don't think you're failing at life because you asked the same question the third time this week. They just spit out the most probable next word based on the context, the words they already spat out, and what is in their training data.

It's incredible, and maybe disturbing, that these models can produce such convincing and useful text without thinking, reasoning, or awareness. LLMs don't think. But they do it really well.

Why Do I Bring This Up?

Two reasons.

First, because I listen to a lot AI podcasts and read a lot of articles and the uncritical anthropomorphization of AI is everywhere. I get that it's a handy turn of speech, and I'm guilty of it. But the way a lot of non-academic people I'm hearing speak about these models, they really do seem to be applying human theories of mind to what is going on under the hood with AI systems. And this is a fundamental misunderstanding of the technology. I really think it makes it harder for people to understand the benefits and limits of the technology we have.

The second reason is that I read Meta's Chain-of-Verification paper and Google DeepMind's Large Language Models as Optimizers paper (both summarized below) this week. Meta shows off a way to reduce LLM hallucinations by taking a structured approach to instructing an LLM to check its own work. Google shows us how to improve LLM accuracy in math by telling the LLM to "take a deep breath" in a prompt.

Our only real experience interacting with anything that produces coherent human speech is with other humans. So it's no surprise that we get sucked into applying all of our (human) theory of mind concepts to LLMs. But those models don't apply to dealing with LLMs, or at least not the ones you're thinking of unless you're an evil subliminal marketer or propagandist.

Manipulating Probability with Spells and Incantations.... I mean. Prompts.

When you send a message to a LLM you are not engaging in a semantic exchange of meaning and understanding . What you're really doing is conditioning the probabilities of the text completions the LLM will make in response to your prompt.

Think about the "take a deep breath" math example. And to be clear, this is not based on data about what is actually happening in response to that prompt, but it's the kind of thing that could be happening. As opposed to calming the LLM down, which is categorically not happening.

So what does "take a deep breath" sound like? It sounds like the kind of encouraging preamble that is at the start of half the math tutorials on the Internet. And hopefully, those math tutorials tend to be correct. So if our prompt makes distributions of text found in math tutorials more probable than not for the response, and tutorials are generally correct, then it's not that surprising the outputs of an LLM steered in that direction will be more statistically aligned with the math tutorials.

In essence, that is what we're doing with prompt engineering. We're just skewing the probabilities for next word prediction in our favour. While it's an accessible approach, being natural language, it isn't rigorous. I'm looking forward to when our understanding is deep enough we have a full theory of how to control next token prediction probabilities, or at least a really good formal understanding of constraining output distributions. It would let us put some quantitative assertions around expected model behaviour that would go a long way to giving people confidence. Especially regulators.

Paper Summaries

I read each of these papers and will give my thoughts. I did not have a lot of time to write this week, so the summaries are AI-generated.

Chain-of-Verification Reduces Hallucination in Large Language Models

My Take

I love the data this paper provides. A lot of prompt engineering is pure art and trial-and-error. Having real research to guide practice to guide prompt creation and model assessment is huge.

A big takeaway relates to what I said above. We do not have a rigorous, formal understanding of how to control the probabilities that govern the output of LLMs. In the absence of formal methods, we have these strange prompt techniques. It's actually a really interesting field. Psychology is to prompt engineering, as neuroscience is to...

Summary

The paper proposes a method called Chain-of-Verification (CoVE) to reduce the occurrence of incorrect factual information generated by these models. The CoVE method consists of four main steps: drafting an initial response, planning verification questions, answering those questions independently, and finally generating a verified response. The paper shows that CoVE is effective in reducing hallucinations across various tasks.

Defining/Discussing Hallucination

Hallucination in the context of AI and language models refers to the generation of plausible but factually incorrect information. It’s not just a minor hiccup; it's a credibility killer. For instance, if you ask a model for the capital of Canada and it says "Toronto," you immediately know you can’t trust it for anything else, factual or not.

Importance to Application Builders, Business, and Users

  1. Credibility: For applications like chatbots, news generators, or even decision support systems, credibility is paramount. A single hallucination undermines trust.
  2. Safety: In high-stakes environments like healthcare or law, a hallucination could lead to severe repercussions.
  3. Efficiency: For business applications, generating reliable information the first time around is crucial for streamlining workflows.

Chain-of-Verification Method and Results

The Chain-of-Verification (CoVE) method is a structured approach aimed at reducing hallucinations in Large Language Models (LLMs). The method is divided into four primary steps:

  1. Initial Drafting: The LLM first drafts an initial response to a prompt or question. This is the model's best first shot, using its training to produce an answer.
  2. Verification Planning: The model then generates verification questions aimed at checking the factual accuracy of its initial draft. These questions are designed to be independent and focused on checking specific aspects of the draft.
  3. Independent Answering: The LLM answers these verification questions independently. This is crucial as it ensures that the answers are not biased by the content of the initial draft or other verification questions.
  4. Final Verified Response: Based on the answers to the verification questions, the model then generates its final verified response. If the answers to the verification questions do not align with the initial draft, the model corrects its draft before finalizing it.

The authors conducted experiments on various tasks to assess the effectiveness of the CoVE method. Here are some of the key findings:

  • List-based Questions from Wikidata: CoVE reduced hallucination rates significantly when the LLM was tasked with generating lists of items based on Wikidata prompts. The hallucination rate dropped from 25.6% to 15.4%.
  • Closed Book QA: In a closed book question-answering task, the hallucination rate was reduced by 4.4% when using CoVE.
  • Factual Statements: When generating factual statements, the CoVE method led to a 5% reduction in hallucinations.
  • Comparison with Baseline Methods: The CoVE method outperformed other verification methods like "Back-and-Forth" and "Self-Talk" in reducing hallucinations.
  • Human Evaluation: Human evaluators found the responses generated using the CoVE method to be more reliable and less prone to factual errors compared to the baseline methods.
  • Computational Overhead: One of the challenges of implementing the CoVE method is the computational overhead. The process involves multiple steps of drafting and verification, which naturally take more time and computational resources. However, the authors argue that the benefits in terms of reducing hallucinations outweigh the costs.

Large Language Models as Optimizers

My Take

This paper has the "take a deep breath" prompt in it, and that got media attention. But what I really take from this paper is yet more evidence that LLMs are incredible intelligence multipliers and productivity enhancers. I think this holds especially true for people with enough breadth of knowlege, problem solving ability, and analytical skills to leverage LLMs as experts while still knowing enough to find ways to assess the quality of the work produced.

Optimization is not easy, and most people solving the Traveling Salesman Problem (TSP) are going to need to use reference materials, refer to notes, or look up algorithms. But papers like this show the promise of current generation LLMs as augmentation tools. Generally, most people who need an algorithm to solve the TSP or a related problem are not interested in TSP algorithms for their own sake. They want a solution to a problem. As long as they're competent to evaluate the solution, I think it is in everyone's interest to equip them with tools that let them get on with the job.

Summary

The paper "Large Language Models as Optimizers" by the Google DeepMind team introduces Optimization by PROmpting (OPRO), a method that leverages large language models (LLMs) to solve optimization tasks. The method describes the optimization problem in natural language, and the LLM iteratively generates new solutions, which are evaluated and updated. The paper demonstrates the efficacy of OPRO in solving various problems, from linear regression to the traveling salesman problem.

Defining/Discussing Optimization in LLMs

Optimization is not new, but using a language model for this task is. Typically, we use derivative-based algorithms for optimization. But what if you don't have a gradient? Enter OPRO. It turns LLMs into optimizers by asking them to solve problems described in plain English.

Importance to Application Builders, Business, and Users

  1. Flexibility: You don't need to write complex algorithms; describe your problem in natural language.
  2. Accessibility: Optimization becomes more approachable, even if you're not an expert in the field.
  3. Adaptability: From supply chain management to personalized marketing, the applications are vast.

OPRO Method and Results

The OPRO (Optimization by PROmpting) method is a way to turn LLMs into optimizers. Here's how it works:

  1. Problem Description: Describe the optimization problem in natural language and feed it to the LLM as a prompt.
  2. Solution Generation: The LLM generates new solutions based on the prompt. These solutions are then evaluated based on the optimization problem at hand.
  3. Prompt Update: The prompt is updated with the new solutions and their evaluations. This iterative process continues until an optimal or near-optimal solution is found.

Key Experiments and Results

  • Linear Regression: In a simple linear regression problem, OPRO outperformed traditional methods in terms of speed while achieving comparable accuracy.
  • Traveling Salesman Problem (TSP): In solving the TSP, OPRO generated solutions that were within 2–3% of the optimal solution, surpassing many heuristic algorithms.
  • Portfolio Optimization: When tasked with optimizing a financial portfolio, OPRO solutions had higher Sharpe ratios compared to traditional optimization methods.
  • Resource Allocation: In a simulated resource allocation problem, OPRO was able to find efficient allocations that were comparable to solutions from specialized algorithms.
  • Human-Readable Explanations: One unique advantage of OPRO is that the model can also generate human-readable explanations for the solutions it proposes. This is something that traditional optimization algorithms often lack.
  • Comparison with Baselines: OPRO consistently outperformed or matched the performance of traditional optimization algorithms and heuristics across a range of tasks.
  • Computational Efficiency: The paper also studied the computational efficiency of OPRO and found that it scales well with problem size, although it does have a computational cost associated with the iterative generation and evaluation of solutions.

要查看或添加评论,请登录

Brad Edwards的更多文章

  • Agent Exploit

    Agent Exploit

    In LLM Agents can Autonomously Exploit One-day Vulnerabilities Richard Fang, Rohan Bindu, Akul Gupta, and Daniel Kang…

  • KAN Do

    KAN Do

    When KAN: Kolmogorov-Arnold Networks by Ziming Liu, Yixuan Wang, Sachin Vaidya, Fabian Ruehle, James Halverson, Marin…

    1 条评论
  • Word2Vec: The Basics

    Word2Vec: The Basics

    Word2Vec is a two-layer neural network that processes text by “vectorizing” words for use in several NLP tasks. Its…

  • Convolutional Neural Networks - Part 5 - Training and Evaluation

    Convolutional Neural Networks - Part 5 - Training and Evaluation

    Well, we’ve made it to the end. This is the last article of this series learning about CNNs from the ground up.

  • Convolutional Neural Networks - Part 4 - ReLU and Softmax

    Convolutional Neural Networks - Part 4 - ReLU and Softmax

    We made it to part four of our walk-through a scratch-built CNN implementation. You can find the complete code here.

  • Convolutional Neural Networks - Part 3 - Fully Connected Layer

    Convolutional Neural Networks - Part 3 - Fully Connected Layer

    Welcome to the next part of our series looking at CNNs. We're still working through the CNN implementation I wrote from…

  • Convolutional Neural Nets - Part 2 - Max Pooling

    Convolutional Neural Nets - Part 2 - Max Pooling

    Welcome to part two of walking through a toy example of a convolutional neural network. You can find the complete code…

  • Convolutional Neural Networks (CNN)s - Part 1

    Convolutional Neural Networks (CNN)s - Part 1

    In this series of articles, I will walk through the implementation of a convolutional neural network (CNN) from…

  • Test it. Test it Real Good.

    Test it. Test it Real Good.

    What have I been doing for the last month, you ask? A lot of DevOps and data engineering on AtomikLabs (and studying)…

  • CALM: Putting it Together

    CALM: Putting it Together

    The paper "LLM Augmented LLMs: Expanding Capabilities Through Composition" by Rachit Bansal, Bidisha Samanta, Siddharth…

社区洞察

其他会员也浏览了