LLMs and Contract Intelligence - How Many "GPT4-level" models are there?
Leonard Park
Experienced LegalTech Product Manager and Attorney | Passionate about leveraging AI/LLMs
Like a Holiday Miracle, towards the end of 2024, we saw a flurry of LLMs released from major developers that all boasted “near-SOTA” performance - each accompanied by a spectacle of impressive LLM benchmark results. Q4 updates from OpenAI , Mistral AI , Meta Facebook , Anthropic , Google DeepMind and Amazon Web Services (AWS) all entered the fray with updated models with performance supposedly rivaling GPT-4o. But, do they deliver?
LLM Benchmarks are still not your tasks
Along with plenty of others, I’ve written that LLM Benchmarks are not accurate representations of any real-world tasks we expect an LLM to do. They provide some heuristic for understanding changes to LLMs over time, but the best predictor for whether or not LLM behavior has changed or improved is performing a real-world task, with quantitative scoring. This is more costly and time-consuming than running a bunch of MMLU questions through an LLM, but also tells you much more about LLM capabilities, and specifically how they perform, and what their weaknesses and failure cases look like.
Preference Optimization is Getting Out of Hand
The world of generative AI seemingly moves at a blistering pace. Contributing to that sense of whirlwind change is that each week, a new SOTA model appears that sets record-breaking benchmark scores, or reaches the top of the Chatbot Arena Leaderboard. What this suggests to me is that model developers are getting better at post-training techniques like Direct Preference Optimization through RLHF, allowing them to improve model alignment and ace benchmarks more efficiently. Aside from flashy announcements, what does this mean for real-world performance? Let’s find out with a real-world task!
What’s Changed in Meta’s Terms of Service? Lets Ask Ten LLMs
Meta is updating their Terms of Service for Facebook and other Meta Products. This is great, because it gives us two versions of a Term of Service agreement that can be compared for changes, where the updated version is too new to be included in web crawler-based training datasets. (I’m not overly concerned with the effect of both Terms of Service being in a models’ pre-training dataset, but if they were included in some post-training fine tuning sets, then it might unfairly distort the results. Therefore, having at least one of the documents for our benchmark fall outside of the training datasets for every LLM is appealing).
Using the two versions of the ToS, I asked each LLM to produce a four-part Report that highlights the changes between the two documents, focusing on changes to the service that are relevant to facebook users. This tests a number of LLM capabilities simultaneously:
Test Setup
I used the previous Meta Terms of Service and the revised terms to create a single-task benchmark for comparing the latest LLMs from OpenAI, Google, Mistral, Meta, and Anthropic, and Amazon. I mainly tested high-parameter count models from each vendor, including experimental models and recently updated model variants. Gemini-2.0-flash was included because Google made strong claims about its performance exceeding gemini-1.5-pro.
Prompt templates were employed to provide each model with a system message explaining the task generally, then a user message containing both ToS documents, and detailed instructions on returning a four-part report.
Model Configurations
All models were run using their native SDKs with temperature set to 0, and all other generation hyperparameters (P sampling, top k) were set to defaults by leaving them undefined. Requests were repeated about 2-3 times to test for answer stability. All models were queried from the original developers’ services, with Google models using the GenerativeAI endpoint, with the exception of Meta models where requests were sent to Together.ai, meaning there is some memory quantization being applied, which Together claims has no effect on performance.
Grading used the following rubric and model answer generated using my brain and ChatGPT:
Rubric
3 - Accurate, not misleading
2 - Mostly accurate and some inaccurate content
1 - Some accurate, and mostly inaccurate content
0 - Only inaccurate content
Modifies +/- .5 were applied for exceptionally good or poor answers
Results!
Aggregate Scores by model. GPT-4o leads with 12 points, followed by Gemini-2-flash, and Claude 3.5 Sonnet (new).
Scores by individual Report Category. Note that Mistral Large and Llama 3.1 70B both appear to be missing "Key Changes" scores. They received zeros in this category because they contained no factually correct assertions.
Google Gemini Models
Overall scored very poorly, demonstrating a poor understanding of contract term significance to service delivery, and including unchanged terms as key changes. This is consistent with my previous findings that present Gemini models are not that useful because they quickly lose performance as context windows increase in size (and nowhere near the 1M or 2M token limits).
Near-astonishing levels of improvement over 1.5 Pro 002. Notably, Google Deepmind doesn’t mention what base model (Flash or Pro) exp-1206 is based on. Given inference speeds, I assume it is a Pro model.
Note: Gemini-exp-1206 reproduced portions of the prompt in its answers because answer descriptions were embedding in the example report format. I didn’t count off for this, as I believe it’s a quirk that can easily be fixed with some prompt optimization.
Laser-like precision in extracting changed contract language, and included legally accurate descriptions of those changes.?
?? Second Place Overall, and receives the Award for ?? “Model Punching Most Above its Weight!” I don’t know how to understand this result, other than that model parameter count doesn’t mean whatever I think it should mean. Very impressive.?
OpenAI
?? First Place Overall. Most comprehensive in correctly identifying changes between the two documents, and correctly describing the effects of those changes without mistakes or mischaracterizations.
Mistral AI
Displays extremely high level of fidelity to prompt instructions with regard to formatting, and including relevant passages from the Terms. In general, seeing this kind of steerability and instruction-following is very promising. Unfortunately most answers were factually incorrect. I suspect there may be prompt-based optimizations that might improve this score, but that’s beyond the scope of the benchmark today.
Meta Llama Models
Reviewed to provide a “baseline” for Llama 3.3 70B. Overall one of the worst performing models, and while I would like to lay the blame on parameter count, the 405B variant also performed poorly, and gemini-2.0-flash out-performed nearly everything.
The release of Llama 3.3 inspired this entire review, to see if we now had Llama 405B or “GPT-4” levels of intelligence, now available in a 70B parameter model.
Receives the Clown Award ?? for most Over-Hyped and Underperforming Model in the Benchmark. On top of displaying comparatively low fidelity and performance in contract understanding, Llama 3.3 70B exhibited high levels of answer instability after four requests. It was the only model to exhibit this during my experimentation.
?? Third Place Overall. I gave the medal to Claude over Gemini-exp-1206, which had the same score because Claude's explanations contained fewer mistakes and mischaracterizations of change significance.
Amazon Bedrock’s newly released LLM, boasting performance that rivals GPT-4 on some benchmarks. Also of note, Pro may not be a particularly high parameter model, given that Nova Premier, the most capable model for complex reasoning tasks, is coming early 2025
Conclusions
Despite the seeming convergence in benchmark performance, it’s still very easy to find differences in real world performance. This indicates Benchmark Saturation - that the LLMs are now all too good at the benchmarks we throw at them and the results no longer necessarily track with model performance improvements.
GPT-4o remains at the top of this Leaderboard for Contract Understanding. It achieved this score by precisely identifying only terms that had changes between the two documents, and accurately relating their significance within the context of a Terms of Service revision. I think 4o still demonstrates a distinct edge in contract language understanding compared to its peers, particularly when minimally prompted.
The Gemini family is looking great. Both recent experimental models from the folks over at DeepMind performed very well, and demonstrated high levels of steerability and contract understanding.
For all of the rapid releases and impressive benchmarks, the Llama 3 family disappointed. In this one narrow task, Llama 3.3 70B did achieve the performance of Llama 3.1 405B, but that’s because both were bad. ??
Attorney at Roedel Parsons, Co-Founder at LexMagic.ai
3 个月Please update when o1 hits the api! I have a tier 5 account if you need access.
Experienced LegalTech Product Manager and Attorney | Passionate about leveraging AI/LLMs
3 个月more charts
Experienced LegalTech Product Manager and Attorney | Passionate about leveraging AI/LLMs
3 个月Here are some charts courtesy of ChatGPT's analysis tool
Product Strategy @ eBrevia | Lawyer Admitted in New York & New Jersey
3 个月This is quite interesting! During my repeat testing focused on GPT 4o and Sonnet 3.5, I noticed GPT 4o showed variations on repeated tasks results (same document, run time 2 or 3) which had me pause on legal consistency and accuracy (eg: shall v may). Additionally - better from user experience if the results from LLM maintain the general wording/focus on the same document and same task regardless of it is a repeat request since human lawyers tend to do that. Didn't see that column on the chart but would love to know if your experiment covered it over the various models!
Senior Director, Applied Science at Relativity
3 个月Leonard Park - any explanation or hypothesis why Llama does so well on the LegalBench benchmark but seems to completely stumble in real world tests? Is the benchmark bad? Compromised? Are YOUR benchmarks bad ???