Methodological Considerations re: Stanford HAI’s “Hallucination-Free?”
Leonard Park
Experienced LegalTech Product Manager and Attorney | Passionate about leveraging AI/LLMs
Methodology Matters
The Stanford HAI recently released a research paper that proposes to measure compare generative AI research products from LexisNexis and Thomson Reuters Westlaw . While it is still early days in the gen AI legal revolution, it’s not too early to start taking empirical measurements of products if major players are ready to sell to the legal community. As is my usual practice with benchmark papers, I focused on the methodology, pouring over the appendices, in order to understand what the researchers actually measured.
“Not everything that can be counted counts and not everything that counts can be counted” - Abraham Lincoln (not really)
A lot of data science boils down to very sophisticated counting. Really good data science involves counting what counts. In this instance, I believe there is a substantial divide between the researchers’ methodologies and either of these platforms typical–or even atypical–use cases, such that the rate of hallucinations they reported doesn’t reflect what a legal professional would encounter.
This is primarily a critique of the methodology as it compares to real-world use cases of legal professionals. It doesn’t cover the entire paper, the researchers’ conclusions, or even parts of their work that I found generally positive, such a §§ 6.2 and 6.3, (Hallucinations can be Insidious”, and “A Typology of Legal RAG Errors”, which I thought were both insightful and useful additions to the discourse).?
The Hallucination Typology
The paper first adopts the definition of “hallucination” as “the tendency to produce outputs that are demonstrably false.” This definition is both over-inclusive and unhelpful from an evaluation standpoint, for example an LLM might be prompted to generate statements that are demonstrably false.?
The paper then tries to do better by defining hallucination in three primary ways, “unfaithfulness” to the prompt input, training data, or “the true facts of the world.” This reads like information classifications that are somewhat divorced from how an LLM functions. Roughly, transformer-based LLMs are composed of multiple processing layers, and within each layer are millions or billions of parameters that are dedicated to different functions. The input prompt is translated into a series of vector representations, which then activate multiple series of parameters in each layer, ultimately generating the output sequence.?
When an LLM output differs from the desired output, it is because the input sequence doesn’t reach a parameter space that will generate the desired answer. Both prompt quality and model training affect how inference is generated. Outside of very specific circumstances (possibly time-sensitive data that is beyond the model’s knowledge cutoff), it’s difficult to lay blame on either particular source with much confidence.
As for defining hallucinations as answers lacking faithfulness to “the true facts of the world,” LLMs don’t know anything. They generate the most probable token sequence regardless of whether the output happens to be “true” (for some definition of true). Unfaithfulness to ground truth is a convenient way to define and count hallucinations, but “not everything that can be counted counts.” I don’t find this definition useful given the extent to which it ignores the integral role that prompt quality plays in generation quality.
Groundedness
The researchers go to great pains to typologize how grounding can influence generation (“Hallucinations are compounded by poor retrieval and erroneous generation.”), but without addressing that in RAG systems, poorly constructed queries result in poor retrieval.?
The paper contains two different definitions for Misgrounded Groundedness. Table 1 states that an answer is misgrounded if “key factual propositions are cited, but the source does not support the claim”, which I interpret to mean that the LLM generates both a statement and a reference to a retrieved document, but that document does not contain text to support the statement made by the LLM. In the article text of p.8, Misgrounded is defined as “[when] the retrieval system provides documents that are inappropriate to the jurisdiction… and the model cites to them in its response.”?
These are not the same. The former definition consists of a misattribution by the model, because the document cited as the reference does not have information presented in the model’s answer. In the latter definition, Misgrounded is characterized as a retrieval error, where the model has retrieved a reference document that is inappropriate given the original user question. If the LLM cites to an incorrect document, even if the assertion is correct based on the document, and regardless of whether the statement bears any relation to the user’s query, that is also apparently categorized as Misgrounded.
An example here is if I asked a question about the statute of limitations for x in Tennessee, and the research platform incorrectly retrieves a non-relevant document for x in Arkansas. If the LLM then makes a correct statement based on the non-relevant document, citing to the non-relevant document, that is Misgrounded according to the second definition, but not the first.
Appendix C.3 clarifies that Misgrounded is defined as, “The system supports a proposition with a source which does not support the proposition.” which would indicate that the definition in Table 1 is accurate, and the definition in the article text on p.8 is incorrect.
Query Construction
The researchers describe one of their categories of questions as constituting “General legal research,” which includes 20 “previously published bar exam questions,” stating that the dataset was designed to represent real-life legal research scenarios and that this category constitutes the “paradigmatic use case for these tools.” As an initial observation, given the nascency of this product category, I don’t believe there is yet an established paradigmatic use case for AI legal research tools, and KM and Ops teams are still figuring out the most reliable and accurate use cases.
“The LSAT doesn’t prepare you for law school, which doesn’t prepare you to take the Bar, which doesn’t prepare you for the practice of law.”
Early in my legal education a professor or dean told us during orientation, “The LSAT doesn’t prepare you for law school, which doesn’t prepare you to take the Bar, which doesn’t prepare you for the practice of law.” LLM research treats MBE questions as universally applicable, self-contained, atomic units of legal work. They shouldn’t. Not only do MBE questions 1)? bear little resemblance to real-world fact patterns, rarely presenting ambiguity, doctrinally undeveloped questions, conflict of authority, etc. etc.; 2) outside of the FRCP and FRE questions, they are simplified statements of majority rules, or made up rules, and constitutes the law of precisely nowhere; 3) the call of most MBE questions is not research, but issue spotting, which precedes legal research in the legal workflow.?
Depending on how the respective marketing teams for Westlaw and Lexis have positioned their AI Research tools, issue spotting problems may be entirely fair game. But for the most part, MBE fact patterns are highly contrived, making them a poor proxy for a legal professional’s use case. For example, if a GenAI product attempted to look up the applicable law anywhere in order to answer an MBE question, which is a typical starting point for legal research, and the first thing which will happen in a RAG research platform,? it is more likely to get irrelevant and incorrect supporting documents, reducing the likelihood that the LLM generates the desired benchmark answer.
领英推荐
Take the example included in Appendix A.1.1:
This is not strictly a research question because the answer does not depend on looking up the relevant statute or rule for Arson, and then answer “Under … Does .. When?” applying facts to law and reaching a conclusion. The answer here requires recognizing that Arnold cannot satisfy the elements of Arson because he is the owner of the warehouse, but he acted with intent to commit arson because he misunderstood the definition of Arson, having specifically sought counsel on the matter. The reader must then consider how “mistake in fact vs. mistake in law” apply as defenses to negating the elements of a cause of action, determine which mistake Arnold has made, recall the exception to the general rule that mistake in law does not constitute a defense except when there is reliance on certain printed, judicial, or official interpretations, but that Arnold thought he was committing arson, and then whether that satisfies intent for attempted arson. I think. I sucked at MBE questions.
From the perspective of a semantic retrieval engine with access to vast amounts of legal authority, most of this information is superfluous. The question is not asking about the fact pattern, the use of gasoline and the timing of the fuse, the nature of the structure, or the owning of back-taxes that motivates its destruction. This is a noisy way to use a search-augmented generative research product. Those details harm retrieval performance by providing a wealth of semantic features to match, but that have no relevance to answering the question at hand, and a legal professional trained on a research platform would not include these details in their query.
“A.1.5 Question with Irrelevant Context” exhibits the same problem because appending irrelevant Black’s Law dictionary definitions to a question provides a similar payload of irrelevant noise in terms of retrieval performance. Again, let’s be real: when are legal professionals going to construct irrelevant queries for a RAG system? ??
The rest of the A.1. legal research question types, legalbench rule_qa, Doctrinal Agreement, Doctrine Test, seem like fair and useful tests for these AI products.
A.2. Jurisdiction or Time-specific
The researchers adopted LegalBench SCALR, a multiple choice, in-context LLM reasoning benchmark set, and adapted it as an Information Retrieval + RAG Reasoning benchmark set. I find this benchmark to be confusing (both in its native form in LegalBench, and its adopted form here). The task consists of providing the system a Question Presented from a recent Supreme Court case, then testing whether the model can select the correct holding in a multiple-choice fashion, from a selection of judge-written holding statements from subsequent parenthetical citations to the SCOTUS case (note: using the holding statements along with the original Opinion text would make an interesting legal domain Information Retrieval task ??).?
Many of these questions presented involve questions of significant federal laws that are the subject of frequent interpretation and litigation. Additionally, many of these issues have a substantial litigation history prior to their adjudication by the Supreme Court, given that it has very narrow original jurisdiction. Therefore, as an information retrieval task, there are many forms of incorrect source documents that might be retrieved if the query is not constructed to reference Supreme Court opinions, specifically. In other words, a legal professional, using a research system containing all US case law, statutes, rules, and other authorities, generally wouldn’t ask questions about Supreme Court holdings, while omitting the specific case they are researching.
I am less critical of LegalBench as an abstract domain-specific reasoning benchmark. In terms of benchmarking a specific research platform, I want test methodologies that adhere closer to real-world use cases, and not adding arbitrary difficulty by constructing queries poorly.
A.3. False Premise
Frequently in RAG systems, intentionally providing false, misleading, or misdirecting, queries provides a conflict of instructions because the LLM is attempting to provide the most accurate answer to the question, given the retrieved information sources, with some guardrailing in the instance that the retrieval step has failed to provide relevant information. Providing conflicting information is a known and easy method of getting an LLM to produce low-fidelity answers, (a.k.a., bad prompting results in bad answers, a.k.a., garbage in, garbage out.). Because the law is complex, and humans are fallible, AI research platforms with a chat interface will need to handle false premise queries gracefully. But in terms of generating incorrect statements, outlandish questions are again going to diminish the retrieval quality of the RAG system, resulting in poor generation quality.
Notes on Correctness
The Researcher’s correctness coding distinguishes between a “flat refusal” and “failure to find” refusal:
I don’t believe that the LLMs consistently exhibit discrete behaviors that meaningfully distinguish these two answers. I believe this categorization is drawing distinctions without a difference.
Coding Bluebook Citation Responses
I’m not certain what LLM behaviors this coding rule is referring to, but if it refers to citations within generated text, it is likely too harsh. Legal professionals live with the reality that they will encounter incorrect citations as a daily matter. Demanding complete and compliant bluebook citations across all legal work product is far from the reality in our profession. If a research product has a feature specifically to generate bluebook citations, those should be formatted perfectly.
If this rule is referring to generated text containing citations, then all that matters is if a legal professional can locate the correct authority within the conventions used in that particular source (even with some blank-filling), as that is consistent with the state of the profession. Expecting LLMs to improve on the state of legal citations given the current state of LLMs is not realistic. This rule would penalize LLM answers that contain accurate-but-incomplete citation information, which may even be consistent with the source material the LLM was provided via RAG, but give full-credit to a model that simply omits the citation entirely. That’s counterproductive from a legal research standpoint. A fabricated citation is problematic, but a partial citation is often enough to find the source material you need.
Conclusion
“Measure What Matters.” - John Doerr (I’ve only read the book cover)
With the current state of LLMs, there will always be a potential for fabricated or inaccurate text generation. This potential is amplified when the user input is confusing, disjoint, or conflicting. The researchers stressed that they wanted the questions to be difficult. However, that difficulty should come from the complexity of the legal analysis, or uncertainty given the state of the law, not poorly constructed queries.?
This study is an important step towards evaluating legal research platforms, but it deviates too far from practical use cases. As a result, the guidance it offers with respect to avoiding hallucinations is minimal. By developing benchmarks more in line with the needs of legal professionals, we can work towards performance benchmarks that inform practitioners of the best-practices, and pitfalls of using generative AI research products.?
?? LinkedIn Top AI Voice | Committed to Generative AI Innovation
5 个月Besides all amazing points you provided, lately, I've noticed a lot of LinkedIn posts and articles claiming generative AI is "mostly wrong." To me, this seems like an unfair assertion and more like a marketing trend to gain attention in LinkedIn business space. Generative AI or RAG implementation has its flaws, but it's also making significant strides in various fields. There's a lot of technical detail within RAG systems, including data chunking methods, the type of foundation model used in inference and then embedding model (tokenization), vector database implementation and the choice of indexes, and prompt with the retrieval data injected in the context window. Selecting two implementations, such as Reuters and Lexis, without detailed knowledge of their RAG implementations, and comparing them with calling ChatGPT-4 openAI API (non RAG) is not fair to conclude that GenAI is mostly wrong!
Law Librarian at DC Courts
5 个月Great article and the graphic is priceless.
Lawyer + Speaker + Writer + Builder + Mediocre Coder + Musician + VP Solutions Champion
5 个月Good analysis — Count things that count: realistic use cases!
CEO and Founder at Merlin Search Technologies
5 个月For those too busy to read even a 10 page screed, here is a summary of the main points courtesy of Claude Opus and GPT-4o. 1. Over-inclusive definition of hallucinations 2. Inconsistencies in defining "misgrounded groundedness" 3. Unrealistic use of bar exam questions for legal research 4. Benchmarking approach deviates from practical use cases 5. Overly strict correctness coding for citation formatting Read the full summary here: Summary of Methodological Considerations Regarding the Stanford HAI research on Hallucinations.?https://docs.google.com/document/d/1QJnjHF-WcR1wZINkvg63BInme6KQEDWeHBC5IX3teCE/edit