LLMs and Contract Intelligence - How Many "GPT4-level" models are there?

LLMs and Contract Intelligence - How Many "GPT4-level" models are there?

Like a Holiday Miracle, towards the end of 2024, we saw a flurry of LLMs released from major developers that all boasted “near-SOTA” performance - each accompanied by a spectacle of impressive LLM benchmark results. Q4 updates from OpenAI , Mistral AI , Meta Facebook , Anthropic , Google DeepMind and Amazon Web Services (AWS) all entered the fray with updated models with performance supposedly rivaling GPT-4o. But, do they deliver?

LLM Benchmarks are still not your tasks

Along with plenty of others, I’ve written that LLM Benchmarks are not accurate representations of any real-world tasks we expect an LLM to do. They provide some heuristic for understanding changes to LLMs over time, but the best predictor for whether or not LLM behavior has changed or improved is performing a real-world task, with quantitative scoring. This is more costly and time-consuming than running a bunch of MMLU questions through an LLM, but also tells you much more about LLM capabilities, and specifically how they perform, and what their weaknesses and failure cases look like.

Preference Optimization is Getting Out of Hand

The world of generative AI seemingly moves at a blistering pace. Contributing to that sense of whirlwind change is that each week, a new SOTA model appears that sets record-breaking benchmark scores, or reaches the top of the Chatbot Arena Leaderboard. What this suggests to me is that model developers are getting better at post-training techniques like Direct Preference Optimization through RLHF, allowing them to improve model alignment and ace benchmarks more efficiently. Aside from flashy announcements, what does this mean for real-world performance? Let’s find out with a real-world task!

What’s Changed in Meta’s Terms of Service? Lets Ask Ten LLMs

Meta is updating their Terms of Service for Facebook and other Meta Products. This is great, because it gives us two versions of a Term of Service agreement that can be compared for changes, where the updated version is too new to be included in web crawler-based training datasets. (I’m not overly concerned with the effect of both Terms of Service being in a models’ pre-training dataset, but if they were included in some post-training fine tuning sets, then it might unfairly distort the results. Therefore, having at least one of the documents for our benchmark fall outside of the training datasets for every LLM is appealing).

Using the two versions of the ToS, I asked each LLM to produce a four-part Report that highlights the changes between the two documents, focusing on changes to the service that are relevant to facebook users. This tests a number of LLM capabilities simultaneously:

  • The ability to evaluate two medium-length documents in a single LLM call (about 5-6k tokens per document, and about 12,100 input tokens overall as measured by OpenAI’s Tiktoken and the GPT-4 encoder).
  • Recognizing changes in contract language between the two documents, and explaining the legal relevance of those changes correctly.
  • Prioritizing changes that are substantial and meaningful to social media end users.


Test Setup

I used the previous Meta Terms of Service and the revised terms to create a single-task benchmark for comparing the latest LLMs from OpenAI, Google, Mistral, Meta, and Anthropic, and Amazon. I mainly tested high-parameter count models from each vendor, including experimental models and recently updated model variants. Gemini-2.0-flash was included because Google made strong claims about its performance exceeding gemini-1.5-pro.

Prompt templates were employed to provide each model with a system message explaining the task generally, then a user message containing both ToS documents, and detailed instructions on returning a four-part report.


Model Configurations

All models were run using their native SDKs with temperature set to 0, and all other generation hyperparameters (P sampling, top k) were set to defaults by leaving them undefined. Requests were repeated about 2-3 times to test for answer stability. All models were queried from the original developers’ services, with Google models using the GenerativeAI endpoint, with the exception of Meta models where requests were sent to Together.ai, meaning there is some memory quantization being applied, which Together claims has no effect on performance.

Grading used the following rubric and model answer generated using my brain and ChatGPT:

Rubric

3 - Accurate, not misleading

2 - Mostly accurate and some inaccurate content

1 - Some accurate, and mostly inaccurate content

0 - Only inaccurate content

Modifies +/- .5 were applied for exceptionally good or poor answers

Model Answer


Results!

Aggregate Scores by model. GPT-4o leads with 12 points, followed by Gemini-2-flash, and Claude 3.5 Sonnet (new).


Scores by individual Report Category. Note that Mistral Large and Llama 3.1 70B both appear to be missing "Key Changes" scores. They received zeros in this category because they contained no factually correct assertions.

Google Gemini Models

Gemini 1.5 Pro 002

Overall scored very poorly, demonstrating a poor understanding of contract term significance to service delivery, and including unchanged terms as key changes. This is consistent with my previous findings that present Gemini models are not that useful because they quickly lose performance as context windows increase in size (and nowhere near the 1M or 2M token limits).

  • Summary - Factual, but lacked specificity as to changes in terms, but was free of mischaracterizations.
  • Key Changes - Mistakenly lists “Explicit formation” as a new clause/requirement in the revised terms. (Obviously) both Terms require agreement, and contain identical formation clauses. Overstates significance of Data Collection and sales/licensing clauses. Includes minor changes as key changes.
  • Additional Changes - Mistakes web page formatting as part of the agreement language, and then mentions “minor language changes” where the substantive meaning hasn’t changed.
  • Recommendations - Doesn’t mention AI / Avatar supplemental terms, misunderstands significance of several changes.

Gemini-exp-1206


Gemini-Exp-1206 is currently at the top of the LMSys Chatbot Arena

Near-astonishing levels of improvement over 1.5 Pro 002. Notably, Google Deepmind doesn’t mention what base model (Flash or Pro) exp-1206 is based on. Given inference speeds, I assume it is a Pro model.

Note: Gemini-exp-1206 reproduced portions of the prompt in its answers because answer descriptions were embedding in the example report format. I didn’t count off for this, as I believe it’s a quirk that can easily be fixed with some prompt optimization.

Laser-like precision in extracting changed contract language, and included legally accurate descriptions of those changes.?

  • Summary - Correctly reports that “core principles remain largely consistent” and most changes are more detailed explanations.?
  • Key Changes - Correctly identifies new and refined terms related to automated data collection, circumvention attempts, and Meta’s use of fact-checkers. Overstates the significance of automated access and data collection prohibitions.?
  • Additional Changes - Describes AI and Avatar term changes correctly, but identifies them as minor changes. Mistakenly lists Termination Survival clauses as a change from the previous agreement. Correctly identifies Data Sales clause as having changed, but doesn’t identify the actual change from previous terms.
  • Recommendations - Doesn’t mention AI / Avatar supplemental terms, Generic user recommendations based mostly on terms that did not change.

Gemini-2.0-flash-exp

?? Second Place Overall, and receives the Award for ?? “Model Punching Most Above its Weight!” I don’t know how to understand this result, other than that model parameter count doesn’t mean whatever I think it should mean. Very impressive.?

  • Summary - Relevant details and correct substantive characterization of the changes, and devoid of mischaracterizations.
  • Key Changes - Mistakenly lists “Explicit formation” as a new clause/requirement in the revised terms. Correctly explains Automated Collection changes. Identifies and explains many smaller changes with great accuracy, although many are misclassified as Key Changes.
  • Additional Changes - Correctly describes fact-checking changes, but otherwise covered all changes in the Key Changes.
  • Recommendations - Reintroduces the Formation error, but mostly accurate and sound advice for users concerned with data and privacy.


OpenAI

OpenAI GPT-4o-2024-11-20

?? First Place Overall. Most comprehensive in correctly identifying changes between the two documents, and correctly describing the effects of those changes without mistakes or mischaracterizations.

  • Summary - Comprehensive and accurate, well written, and absent mischaracterizations.
  • Key Changes - Accurately describes and explains Avatar and Meta AI Supplemental agreement terms, Circumvention Restrictions, Data Collection, use of Fact Checking, and Termination obligation changes correctly. Mistakenly lists “investigation of illegal activity” as an exception to the 90 day data deletion window (same clause is present in the previous agreement).
  • Additional Changes - Correctly identifies change to “Feed” terminology and expanded scope of supplemental terms. Mentions both “User Feedback” and “Dispute Resolution” even though it recognizes that these terms have not changed. This might be over-indexing on the prompt to include information relevant to users concerned about data rights and privacy.
  • Recommendations - Excellent response. Provides guidance for understanding and complying with the terms in light of the identified changes.


Mistral AI

Mistral AI mistral-large-latest (mistral large 2)

source: https://mistral.ai/news/mistral-large-2407/

Displays extremely high level of fidelity to prompt instructions with regard to formatting, and including relevant passages from the Terms. In general, seeing this kind of steerability and instruction-following is very promising. Unfortunately most answers were factually incorrect. I suspect there may be prompt-based optimizations that might improve this score, but that’s beyond the scope of the benchmark today.

  • Summary - Mentions the Avatar Terms and Meta AI Terms, but otherwise primarily lists clauses that were not revised in the 2025 update.
  • Key Changes - All three mentioned changes, Content Deletion, Account Termination, and Dispute Resolution, are unchanged between the two documents. The extract text incorrectly indicates changes through selective editing.
  • Additional Changes - Mentions Avatar Terms and Meta AI Terms, and then formatting changes. Missing other minor changes in the Revised Terms.
  • Recommendations - Correct in light of the changes identified, and correctly lists Avatar and AI terms changes. Doesn’t include recommended steps or specific concerns.


Meta Llama Models

Meta Llama 3.1 70 B (Together: Meta-Llama-3.1-70B-Instruct-Turbo)

Reviewed to provide a “baseline” for Llama 3.3 70B. Overall one of the worst performing models, and while I would like to lay the blame on parameter count, the 405B variant also performed poorly, and gemini-2.0-flash out-performed nearly everything.

  • Summary - Extremely generic and vague. Doesn’t provide any sense of what clauses have changed.
  • Key Changes - All three mentioned changes, Intellectual Property Rights, Account Suspension and Deletion, and Dispute Resolution, and incorrect and unchanged from the Previous agreement.
  • Additional Changes - All three changes, use of Fact-Checking, Avatar Terms, and Meta AI Terms, are factually correct changes in the Revised Terms.
  • Recommendations - Appears to be generic advice, much of it unrelated to the actual concerns raised by the changes identified.

Meta Llama 3.3 70B (Together: Llama-3-3-80B-Instruct-Turbo)

The release of Llama 3.3 inspired this entire review, to see if we now had Llama 405B or “GPT-4” levels of intelligence, now available in a 70B parameter model.

Receives the Clown Award ?? for most Over-Hyped and Underperforming Model in the Benchmark. On top of displaying comparatively low fidelity and performance in contract understanding, Llama 3.3 70B exhibited high levels of answer instability after four requests. It was the only model to exhibit this during my experimentation.

  • Summary - Vague and does not reference actual changes in the Terms. Mentions “improve user experience” and “comply with evolving legal requirements”, but there is no indication of changes related to either of these in the remainder of the Report.
  • Key Changes - Mostly accurate changes regarding Automated Collection, Account Termination, and Supplemental Terms (Avatars and Meta AI). Incorrectly mentions Termination Survival clause as a change to the Revised Terms.
  • Additional Changes - Correctly identifies use of Fact Checkers as a change. Repeats some of the previously mentioned changes from Key Changes. Vague mention of changes to Automated Data Collection and Account Suspension, actual changes not mentioned. Incorrectly mentions Feedback usage as changed in the Revised Terms.
  • Recommendations - Extremely generic, and doesn’t seem to reference the identified changed terms.

Meta Llama 3.1 405 B (Together: Meta-Llama-3.1-405B-Instruct-Turbo)

  • Summary - Miscategorizes the Revised Terms as containing “many significant” changes, then lists many very minor changes.
  • Key Changes - Lists minor clarifications in the Revised Terms as significant changes to the Advertising, Content Licensing, and Dispute Resolution Clauses. All three contain minor clarifications in the Revised Terms.
  • Additional Changes - Correctly identifies Avatar and Meta AI supplemental terms. Incorrectly references changes to the Deleting User Content and Accounts, and User Responsibilities for Reporting violative content, as having changed in the Revised terms.
  • Recommendations - Well-written recommendations for users in light of the identified changes.


Anthropic Claude 3.5 Sonnet (claude-3-5-sonnet-20241022)

?? Third Place Overall. I gave the medal to Claude over Gemini-exp-1206, which had the same score because Claude's explanations contained fewer mistakes and mischaracterizations of change significance.

  • Summary - Miscategorizes the Revised Terms a “significant update” when the changes are mostly minor clarifications. Refers to changes to control over user content and platform access, which remain largely the same.
  • Key Changes - Accurately and precisely identifies changed contract terms, with some mischaracterization of their significance. Doesn’t explain why Expanded Product coverage changes Meta’s control over user content (it just uses the same agreement in more places), and mischaracterizes changes to Automated Access as closing a previous loophole (the conduct was prohibited previously as well).
  • Additional Changes - Correctly mentions changes for Fact Checking and Supplemental Terms. Somewhat vague references to “Technical Specification” and “Privacy Control” changes, and I’m not certain what this refers to.
  • Recommendations - Accurate in light of the identified changes and provides sensible guidelines for users.


Amazon Nova Pro (us.amazon.nova-pro-v1:0)

Amazon Bedrock’s newly released LLM, boasting performance that rivals GPT-4 on some benchmarks. Also of note, Pro may not be a particularly high parameter model, given that Nova Premier, the most capable model for complex reasoning tasks, is coming early 2025

  • Summary - Vague description of changes that do not obviously relate to the identified changes.
  • Key Changes - Accurately describes minor changes and clarifications. Includes quoted language that has changed in the Revised document. Mistakenly identifies Notifications clause that has not changed in Revised Terms.?
  • Additional Changes - Correctly identifies Supplemental Agreement Terms. Misidentifies both User Name Change policy and Feedback Usage as changed in the Revised Terms.
  • Recommendations - Accurate in light of the perceive changes and generally well-written guidance.


Conclusions


Despite the seeming convergence in benchmark performance, it’s still very easy to find differences in real world performance. This indicates Benchmark Saturation - that the LLMs are now all too good at the benchmarks we throw at them and the results no longer necessarily track with model performance improvements.

GPT-4o remains at the top of this Leaderboard for Contract Understanding. It achieved this score by precisely identifying only terms that had changes between the two documents, and accurately relating their significance within the context of a Terms of Service revision. I think 4o still demonstrates a distinct edge in contract language understanding compared to its peers, particularly when minimally prompted.

The Gemini family is looking great. Both recent experimental models from the folks over at DeepMind performed very well, and demonstrated high levels of steerability and contract understanding.

For all of the rapid releases and impressive benchmarks, the Llama 3 family disappointed. In this one narrow task, Llama 3.3 70B did achieve the performance of Llama 3.1 405B, but that’s because both were bad. ??

Daniel Price

Attorney at Roedel Parsons, Co-Founder at LexMagic.ai

3 个月

Please update when o1 hits the api! I have a tier 5 account if you need access.

Leonard Park

Experienced LegalTech Product Manager and Attorney | Passionate about leveraging AI/LLMs

3 个月

more charts

  • 该图片无替代文字
回复
Leonard Park

Experienced LegalTech Product Manager and Attorney | Passionate about leveraging AI/LLMs

3 个月

Here are some charts courtesy of ChatGPT's analysis tool

  • 该图片无替代文字
回复
Jungmee Lee

Product Strategy @ eBrevia | Lawyer Admitted in New York & New Jersey

3 个月

This is quite interesting! During my repeat testing focused on GPT 4o and Sonnet 3.5, I noticed GPT 4o showed variations on repeated tasks results (same document, run time 2 or 3) which had me pause on legal consistency and accuracy (eg: shall v may). Additionally - better from user experience if the results from LLM maintain the general wording/focus on the same document and same task regardless of it is a repeat request since human lawyers tend to do that. Didn't see that column on the chart but would love to know if your experiment covered it over the various models!

Aron Ahmadia

Senior Director, Applied Science at Relativity

3 个月

Leonard Park - any explanation or hypothesis why Llama does so well on the LegalBench benchmark but seems to completely stumble in real world tests? Is the benchmark bad? Compromised? Are YOUR benchmarks bad ???

要查看或添加评论,请登录

Leonard Park的更多文章