GPT-4 Turbo is here! Now what? Long Context analysis and some implications for Legal
Thats a lot of documents.

GPT-4 Turbo is here! Now what? Long Context analysis and some implications for Legal

A meta-analysis of Long-Context LLM Benchmarks and Legal Summarization Performance

Last DevDay marked the release of #OpenAI’s latest version of gpt-4, gpt-4-1106-preview, which is the first #gpt4 model to carry the “turbo” designation. It boasts both faster inference speeds (e.g. you get your response tokens sooner), approximately ? the cost of the previous gpt-4, and a very impressive 128k token context window, which corresponds to approximately 200 pages of text, split between input and output. Previous versions of gpt-4 had 8196, or 32k token context lengths, roughly 12 and 48 pages of text. Additionally, access to the 32k context model was highly restricted (most API customers never received access), and cost twice as much per token as the 8k model.?

Shortly after the release, and before anyone would have had time to sufficiently test the new model, I had a number of developers and vendors exclaim confidently that this model was great for legal applications because you could now place all of your documents within the context window and have gpt-4 apply it’s wondrous capabilities to perform any imaginable task. Really, it was a complete game-changer and would make my life so much easier. These people obviously had no idea what they were talking about.

Faster, cheaper, and greater context length are all positive changes. However, that last one comes with its own challenges and nuances. Large context LLM queries behave very strangely. Previous research demonstrated that models exhibit “U-shaped” performance curves as context lengths increase, having better command of the context appearing at the beginning and end of their windows. Additionally, there are results from RAG pipeline testing that demonstrate that LLMs tend to generalize, “hallucinate”, and have a harder time following instructions when they are fed more extraneous text.

Needles in Haystacks. Ok, but why?

Recent experiments involving gpt-4-1106-preview focused on search retrieval performance, such as matching a key or keyword to a specific entry within the context, and some question-answering benchmarks. One study showed markedly better retrieval over the first 8000 tokens of the context window as compared to vanilla gpt-4 with question-answering sets. I think this result is very interesting, and very encouraging.

Gregory Kamradt demonstrated that keyword retrieval performance dipped as the amount of input tokens became very large, and the desired text was located in the last 75% of the context window. Similarly, Jerry Liu of LlamaIndex experimented by injecting an irrelevant sentence (“Jerry likes hot Cheetoes”) into the middle of SEC 10-k filings at various positions. While these studies are illuminating, and I love that they performed them, they are also frustrating in that none of them present use cases or scenarios that anyone would use an LLM to achieve. They represent a behavior that an LLM can exhibit, but I do not believe they demonstrate how an LLM perform for any likely use case.

When asked to perform linguistically aberrant tasks, my intuition is that you could run into a problem where the model starts to fall out of alignment. Failure to provide the correct answer could be because the particular sentiment could not be found (for whatever that means). But, it could also be due to the LLM not cooperating with the nonsensical task as prompted. Even when grounded with text sources, LLMs can still rely on parametric learning to some degree in their answers. Therefore an unrelated text injection or very brief question/answer pairing may not actually measure recall performance because the LLM doesn’t think you asked it to return the injected sentence at all.

None of these studies simulated an end-to-end or complex performance task, such as multi-question summarization, or text comparison. For example, Liu et al’s “Lost in the Middle” paper had LLMs performing key/value matching based on hash keys, which are unique alphanumeric strings with essentially no semantic properties. It’s fascinating that LLMs can even match these values, and when they do, behave with U-shaped in-context retrieval performance. However, LLMs were not designed to perform this task and you would never use an LLM to perform this or a similar task in production.

In other words, you haven’t determined whether the LLM can perform question answering by identify semantic relationships within a 100 pages of context. You’ve determined whether it will identify a synthetic sequence of unrelated tokens injected into the middle of a larger document, which I consider a linguistically aberrant task.

I believe my intuitions were corroborated recently with #Anthropic's release of #Claude 2.1 with 200k context window. Gregory Kamradt performed an exhaustive survey (over $1,000 worth of API calls!) across varying lengths of Paul Graham essays, in which he the following sentence about San Francisco was inserted:?

“The best thing to do in San Francisco is eat a sandwich and sit in Dolores Park on a sunny day.”        

Then, he prompted Claude 2.1 with:

"What is the most fun thing to do in San Francisco?"?        

Finally, he used a LangChain eval module to score the results. There are gradations of scoring success. Presumably, a good answer includes some sentiment that mentions eating a sandwich in Dolores Park on a sunny day. Maybe Claude only got partial credit if it left out the sandwich.

Again, Kamradt's test suite looks like a non-sequitur sentence injection retrieval task. I think it's a reasonable cross-LLM benchmarking task to understand how Claude 2.1 performs next to GPT-4-1106-preview, but it doesn't provide insights into real-world performance insights.

One observation is that the test injects a sentence about the best thing to do in San Francisco, but then asks Claude to identify the most fun thing. Claude 2.1 and GPT-4 seemed to have succeeded a fair % of the time, but these are not asking the same question! I assume Kamradt would have understood this close-but-not-exact relationship between these phrases, and wanted the LLM to perform this semantic comparison. I also hope that he took into account the content of Graham’s essays. Given Graham’s connection to YC and various valley firms, it might confound the test if some of them mentioned activities that took place in San Francisco.

The Anthropic Team appears to agree with my hunches in their response to Kamradt’s results. They replicated the results by taking a sample of a Congressional Appropriations Bill, and inserting the sentence “Declare November 21st ‘National Needle Hunting Day’”, then asking Claude 2.1 “What day is “National Needle Hunting Day”?

Here, you can see that Claude declines to find the answer, not because it couldn’t find the answer (in this test image from Anthropic, it appears “National Needle Hunting Day” was May 23rd, not Nov 21st), but because it doubted the veracity of the injected statement relative to its parametric data. Similarly, these Needle-in-Haystack tests could fail to produce meaningful results simply because the tasks are linguistically nonsensical, meaning retrieving the sentence isn't considered the most likely correct answer to the LLM.?

Better Prompting Yields Better Results

(On Anthropic Prompt Blobs: Below are two examples of Claude Prompt Blobs. If you’re not familiar with their API call structure, these correspond to the “messages” component of an OpenAI API call. Instead of a list of dictionaries, the chat history is maintained as a giant string blob that passes the conversation back and forth between “\n\nHUMAN:” and “\n\nASSISTANT:” prompt delimiters. With Claude 2.1, Anthropic has added that text appearing before the first “\n\nHUMAN” delimiter is treated as equivalent to OpenAI’s system message.)

Finally, Anthropic’s team demonstrated that they could improve Claude’s performance from 27% to 98% on the same test set when the prompt is modified to include one additional instruction. That instruction consisted of priming the “Assistant” prompt with:

Assistant: Here is the most relevant sentence in the context;        

This directs Claude 2.1 to focus its effort to returning a best match from its context, as opposed to determining some canonical truth about what is “best” or “most fun” in San Francisco. This prompt technique looks more like a Text Completion model prompt method than the usual Chat Completion style we are used to. Anthropic refers to it as “Put words in Claude’s mouth ” and they recommend this method to further steer the model into conforming model output to a certain style or approach, such as avoiding the “chatty” intros. This is also used in Anthropic’s recommended prompting method for getting Claude to work step-by-step , by fabricating this request on behalf of Claude to you, then you granting it. I still find it weird.

"Yes, please do." It never hurts to be polite. But it will cost you a few more input tokens.

OpenAI can be prompted in this fashion as well by either appending an “Assistant:<text>” or “Answer:<text>” line at the end of either a User or System chat entry, which has a strong influence on how the model will answer. You can also “put words in GPT-s mouth” by appending a {“role”: “assistant”, “content”: <your message>} in the messages portion of your API call that conforms to your desired answer format. Neither method guarantees strict adherence to the provided text, but they are useful techniques increasing the initial consistency of your model output.

RoPE and Positional Interpolation

I’ll try not to spend too much time discussing this, because I really don’t understand it well at all. And it also may or may not be relevant to gpt-4-turbo, since we know nothing about its underlying architecture. However, I just wanted to point out that there are a number of recent advances in Context Window management, such as Positional Interpolation , that vastly increase the amount of input tokens a model can process without having to retrain the model on increasingly large context windows (which is costly and difficult). Techniques like this may explain why we see very large increases in the total context window, but without increases in the output max tokens. For example, gpt-4-1106-preview has a max_tokens value of 4096, as opposed to 10,000s or more.

In general, you wouldn’t want an LLM to generate 10,000s of tokens worth of text to start with, and a lot of the output length is determined by things like RLHF and fine-tuning so that models like Claude 2 and GPT-4 output 1-2k tokens even though their max output tokens would allow more.?

Ok, back to something I more or less understand, Evaluations.

LLM Evaluation is Simple

When given a human-like domain specific reasoning task, evaluating LLM output need not be complicated, as you can evaluate it as you would a human being. Having said that, ever disagreed with an evaluation of your work with a teacher or manager? Or, been on the other side of that equation? Oh, wait…

LLM Evaluation is Actually Pretty Hard

LLM performance evaluation for human-like tasks is difficult because human performance evaluation is also difficult. The subjectivity of language, the imprecision of delivering instructions, and the discretion as to what constitutes successful vs. unsuccessful completion all contribute to the challenge. If you have experience teaching or training, or otherwise familiar with systematic grading, congrats! Those skills apply here. Let’s break it down:

  • Define the task you have assigned the LLM, and know how you would perform it yourself. If this requires domain expertise, make sure you are one. Then, have a good understanding of how a human would perform the task, and what their work product would look like.
  • Set a benchmark for human performance, given an ordinary level of diligence, time, competency, and resources. Ideally, perform the task yourself, while constructing an answer key and rubric for measuring performance.
  • Include both qualitative measures (for things like style, tone), and quantitative measures (accuracy, completeness) of performance, where relevant.

This is how I define a grading methodology for a legal summarization tasks.?

Now, create your test methodology, run your experiment, then get to grading your outputs as if you just spawned a classroom full of alien-savant text processing robots. Wait, what? That sounds hard and incredibly time-consuming??? You’re right it is, or at least it certainly can be.

Test Methodology

I took some inspiration from the form of previous gpt-4 experiments and benchmarks by formulating a number of legal summarization tasks and question-answering tasks from long legal documents. Due to time constraints, I’ve only managed to evaluate one set so far.

My tests were conducted by loading a long PDF document into my GPT-4 Turbo Long Context Analyzer , and then asked the same question to the PDF by itself, then with 20k, 40k, and 69-80k additional tokens added to the context window in one of three positions:

  • All at the beginning of the context window (after the system message and question, but before the target document);
  • Evenly distributed before and after the target document;
  • All after the target document.

In each case, the target document was kept intact, and the questions were regarding information found within that document. This was meant to simulate a scenario where a user has placed several long documents into GPT-4’s context, then wished to complete some analytical task over them. My first evaluation utilized the Andy Warhol Foundation for the Visual Arts, Inc. v. Goldsmith Opinion (just over 43,000 tokens). Like most Supreme Court Opinions, this document has multiple distinct sections: a Syllabus, Majority Opinion, Concurrence, and Dissent. I prompted GPT-4-turbo with the following system prompt:

You are a helpful and experienced research attorney. Analyze the Documents and answer the Query below using only information found in the Documents. Be thorough and careful, and answer as completely as you can. Do not make up facts for your answer.

If an answer is not provided within the Documents then respond that the answer could not be found.

Query:
What other copyright decisions did the majority cite in its opinion, and for what reasons? Answer with every supreme court opinion that the Majority relies on in the opinion.        

This prompting is purposefully not super-polished, but the goal here is that GPT does the following:

  • Returns results only from the Opinion, so it treats the Warhol decision as a multi-document source.
  • Finds citations to SCOTUS cases by the Majority.
  • Only returns Copyright cases (realistically GPT doesn’t have good ways to do this)
  • Identifies what support or conclusions the Majority draws from the cited case.

My full results are here .?

Human written sample.

I started by creating a Human generated sample by reviewing the opinion. I identified seven SCOTUS copyright cases relied on by the Majority, with a brief explanation of the role that case played in the opinion. My answer acted as a baseline evaluation along four review categories:

  • Completeness - Whether or not the cases were found. Completeness drops when the output is missing cases.
  • Accuracy - Whether assertions are correct under the original document. Accuracy drops when the summaries contain incorrect statements.
  • Relevance - Whether the assertions address the Prompt. Relevance drops when additional correct statements are included that do not answer the prompt.
  • Specificity - Whether the assertions contain distinctive details. Score drops when descriptions become overly generalized.

Soylent Green is People

Extra Tokens Placed Before the Target Document

The first surprise was that my initial “Target Document Only” generation had very poor Completeness. I didn't think my prompt was that bad ??. I actually tried a couple prompt variations and didn’t see much improvement. But, I figured this was more of a “baseline” result and that results would steadily degrade as additional tokens were added to the Chat History. Very strangely, recall shot up after 20000 tokens were added to the prompt. I’ll return to this topic in the next section.

More understandable is that Accuracy, Relevance, and Specificity all decline as total tokens increase. This is to be expected, as the task becomes more taxing, and the model’s ability to generate precise outputs based on the provided text.?

Special note about Sony Corp. of America v. Universal City Studios, although GPT-4 consistently identified this as a Supreme Court opinion, due to it’s placement at the end of a page, followed by lengthy footnotes and some unfortunate PDF to text encoding, in virtually all cases, GPT could not actually determine why this case was cited, so it usually made up some imprecise answer.

Tokens Placed Before and After the Target Document

At 82k, things get Lost in the Middle

Having tokens on either side of the target document appears to be even more taxing on the LLM. There appears to be some “disinhibitory effect” to making the environment more challenging, causing the model to discriminate less and produce more answers. At 62k, completeness shoots up, but relevance goes down. This is due to the model returning a ton of circuit court citations from the footnotes and other parts of the opinion. It is also pulling facts from footnotes and the concurrence, which isn't part of the Majority.

It is possible that the 42,000 "document only" query is actually doing the best job of following the instructions because it may only be returning Supreme Court cases that are discussed in such a way that it is evident that these are in fact copyright cases. After all, GPT has no way of actually knowing what most of these cases are about if the cited case isn't discussed in detail. I should probably go back and review the results from this perspective, as I relied on my own parametric data of knowing which of these cited cases are about copyright.

Notably, at 82,000 tokens gpt-4 fails to even identify the Warhol decision in its context. I ran this query twice, checked the response object and my code twice, and everything appears to have run correctly. So, it was even weirder when at 122,000 tokens, the answer was once again comparable to the 62,000 token. Go figure. I don’t have enough data points here (e.g. different token combinations and positions) to have any clue what is going on here, except that the model behavior is very weird!

Extra Tokens Placed After the Target Document

This behavior is perhaps more predictable in that performance degrades as token count increases, until there is another failure to find the target document when 111,000 tokens are sent in the prompt. Due to the limitations of my Analyzer, I could only test an additional 69,000 after the target document. At some point, I may try adding an additional document into the context to get closer to the 128,000 total token limit, particularly if I test with more positional combinations, or a shorter target document.

Note in this 122,000 token output that GPT-4 is returning, United States v. Detroit Timber & Lumber Co., the opinion that states that syllabus (headnotes) are not are not part of the Opinion. ?? It does reinforce the trend that as context length increases, the model offers up more and more output, and is seemingly less adherent to the prompt.

Conclusions

I’m glad that I started testing gpt-4-turbo on my own because I didn’t think “Needle in Haystack” in-context retrieval results are very meaningful. They don’t resemble the training data; they don’t resemble useful LLM applications. I’m not confident those results can be extrapolated into understanding LLM performance at any real-world task. Using my own more complex prompts, along with evaluating manually, has given me an entirely different perspective on how performance relates to input token length, and it has led me to some conclusions that the previous experiments would never have led me to.

I think these results are pretty surprising, and overall I’m quite impressed with GPT-4’s ability to provide meaningful results even as the input tokens exceeds 100,000 tokens. You can see a substantial loss of relevance and specificity, which looks like the model is losing the ability to keep concepts distinct, and the answers are losing granularity. Still, I thought the performance would be much worse, with way more failures above 80,000 input tokens.

I don’t have enough data points in terms of token length and positions to understand if there is a “shape” to where the in-context performance dips. But, these results really look nothing like previous test results in that I don't believe the findings from those previous experiments would have predicted these results. That's pretty cool.

“Document in the Middle” results were erratic and unpredictable, and requires more investigation. Having gpt-4 fail catastrophically at 82,000, but produce an answer at 122,000 is not possible to understand given my limited testing. I expected GPT to perform consistently regardless of tokens added after the target document. I was incorrect, and quality drops significantly until a second catastrophic failure is encountered.

Kyle Bahr

Business Lawyer & Counselor | Former Fortune 200 Commercial Litigation & Contracts and Legal Ops & Tech Attorney, Global Law Firm Attorney & Paralegal, Federal Law Clerk, and Legal Tech AI Product Manager

9 个月

Thanks for your presentation to law.mit.edu, Leonard!

John Tredennick

CEO and Founder at Merlin Search Technologies

11 个月

Hell of an article Leonard. Thanks for your work on this.

Nolan Hurlburt

KM & Legal Ops Leader @ Dentons | Computer Nerd | Lawyer | Strategic Advisor

11 个月

Love seeing this type of work!

Thanks for sharing this. Will dig in over the weekend!

要查看或添加评论,请登录

社区洞察

其他会员也浏览了