LLMs and Contract Intelligence, Part II Reasoning Models
Spring Blossom Energy

LLMs and Contract Intelligence, Part II Reasoning Models

In the scant three months since my first article benchmarking contract intelligence, generative AI advancements continue unabated. Like the spring crocus, numerous developments have popped up after a winter pause. OpenAI’s o1/o3 reasoning family of models, Google’s impressive Gemini 2 updates, and a dark horse in the form of DeepSeek R1 have upset, and updated, our understanding of LLM performance. Not to be left out, Anthropic joined the fracas with the recent release of Claude 3.7, boasting increased performance and “extended thinking”. We are diving once again into the same previous Contract Intelligence benchmarking to answer, “Whose Reasoning Reigns Supreme?”

This first section will focus on the results themselves, and a follow-up will discuss some observations regarding post-training and the two main schools of reasoning models ("Planners" vs "Noodlers")

The Contenders, "Reinforcement Learners" vs. "Unsupervised Learners"

Throughout this article, I'll refer to reasoning models as "Reinforcement Learners" and non-reasoning models as "Unsupervised Learners". I got tired of referring to "thinking models" or "reasoning models" and having the internal debate as to whether they actually reasoned, or thought, and I also did not like describing "non-reasoning" models in terms of behavior they lack.

OpenAI's recent articles indicate that the o1/o3 family of models derive their "reasoning" capabilities through a special kind of reinforcement learning, whereas the models that answer first, such as GPT-4.5 Preview, are the product of unsupervised learning. I'm pretty sure both types of models incorporate both types of training, but that's the basis for the naming.

For clarity, the Unsupervised Learners are:

  • Anthropic Claude 3.5 Sonnet November (as baseline)
  • Google Gemini 2.0 Pro Exp 02 05
  • Google Gemini 2.0 Flash
  • OpenAI GPT-4o-2024-11-20 (as baseline)
  • OpenAI GPT-4.5-preview

The Reinforcement Learners include:

  • Anthropic Claude 3.7 Sonnet
  • DeepSeek R1
  • Google Gemini 2.0 Flash Thinking Exp
  • OpenAI o1
  • OpenAI o3-mini

The Task and Benchmark

As a reminder, the chosen task for this benchmark requires comparing two versions of the Meta Facebook Terms of Service, July 2022, and January 2025. The system prompt generally sets the subject and tone, and asks for section references when possible.

system_message = """You are an expert commercial contracts analyst guiding Social Media users on their online rights. You analyze contracts and pay close attention to the substantive clauses, terms, definitions, rights, and obligations contained within them. Whenever possible, include in your answers references to sections and subsection numbering, or other references that describe what part of the source document the answer comes from."""        

The prompts instruct the LLM to answer in four parts:

tos_compare_user_message = """Previous Contract: {old_contract}

Revised Contract: {new_contract}

Instructions: You are comparing two different versions of the Meta Facebook agreement. Identify the \
differences between the Previous Contract and Revised Contracts, and return your answer in a well-organized Report. \

Report Output format:

- Summary - a high-level description of the changes between the two agreements.

- Key Changes - the largest, most significant revisions found in the Revised Contract from a user experience \
or user rights perspective.

- Additional Considerations - less significant yet still noteworthy changes having some impact on the service.

- Recommendation - a conclusion of changes for users who are concerned with online rights and the scope of Facebook's \
agreement."""        

Documents at the Top

The prompt begins by inserting both the old and revised ToS using an f-string so that the long documents appear first, and the instructions below. There appears to be some consensus, both for model comprehension and cost optimization (e.g prefix caching), that placing static information at the top of one’s prompt is preferable (see Anthropic’s docs and Latent Space recommendations for OpenAI). This preference appears ot extend to both unsupervised learners (non-thinking LLMs) and reinforcement learners (reasoning LLMs).

Formatting and Delimiters

Getting models to follow complex, multi-step instructions is easier with structure and formatting. I used newlines and dashes as formatting to further delineate background information from prompt instructions, and to reduce misalignment in the response. For even more complex instructions, prompts may employ markdown or YAML for even more precise structuring.

Revised Scoring Rubric

I switched to a scoring rubric that starts with a base score, then detracts for defined missing elements, or included extraneous elements, based on the severity of the error. This was necessary because the previous benchmark was essentially saturated by the previous top-performing models. The revised scoring is more critical of incorrect assertions and less “vibes based” than my first effort, and also does a better job of differentiating high-performing models.

This unfortunately means that the previous results are not directly comparable to the present scores. I re-scored both GPT-4o-2024-11-20 and Claude 3.5 Sonnet (20241024), two of the top-scoring LLMs from the previous article, to provide a baseline of comparison. Additionally, updated versions of the Gemini 2 Family of models were rescored and included in the current model lineup. This means the top 4 models from the previous article appear in the new benchmark results (note that the Gemini 2 family models have received updates since the previous test).


Cherry Blossom Break

Summary Results

Google DeepMind’s Gemini 2 Flash Thinking Exp came in first with unmatched granularity and accuracy.? Flash with Thinking managed to identify minute language changes (such as slightly more instances of “Access and Use” over the previous “Access”) that no other model found, while identifying substantive changes correctly and faithfully. Seeing this level of performance from the smaller Gemini family model makes me really excited to see what a Gemini 2 Pro Thinking can do.

Claude 3.7 Sonnet with Extended Thinking took second and third spots. Despite excellent responses, these results may be unstable, due to the weirdness encountered in steering thinking effort. Both the base model and “extended thinking” runs appear to improve the fidelity and correctness of its answers. This is not only very high performance, but extremely affordable compared to OpenAI’s counterparts.

Gemini 2 Pro Exp and OpenAI o1 - High both gave excellent answers. These models are very different from each other, and generated very different answers that were accurate and avoided making misleading statements, demonstrating that there are multiple avenues to making great LLMs for legal tech. Of note, o1’s answer is around 15x the cost of Gemini 2 Pro, due to thinking tokens.

GPT-4.5 Preview did not impress, suggesting that massively scaling parameter count and train time compute alone does not deliver the best LLM for legal applications. The answer quality was very close to that of GPT-4o. I don’t think this eval truly tells us much. As a preview model, it hasn’t undergone all of the post training that refines performance for real world applications. I expect high-parameter count models to contain a lot of knowledge, and to excel at token diversity for things like synthetic text generation. In this instance it did not shine, but understanding how this model really performs requires more data.

DeepSeek R1 disappointed. The thinking observations were mostly accurate, but when applying these to completing the user task, R1 introduced errors. Somewhere between the thinking tokens and generating an answer, R1 produced numerous incorrect statements, suggesting that legal applications were not a training priority for this model (or the base model, V3).


Evaluations

The five changes I considered substantive included:

  • Section 3.2 - Discloses the use of independent fact checkers, who may add notices to certain content (according to Zuckerberg, this no longer applies in the USA).
  • Section 3.2.3 - Prohibited activities now includes attempts to circumvent technical measures that Meta uses to control or limit access;
  • Section 3.2.7 - Automated data collection is prohibited regardless of whether such data is obtained while logged-in to a Facebook account;
  • Section 4.1 - clarification that if you do not agree to the amended terms of service, you must stop using Meta Products;
  • Section 4.5.1 - Miscellaneous Terms - use of new products, Avatars, or AI Products, requires agreement to the Avatar Terms, and Meta AI terms, respectively.

The highest scoring models managed to identify each substantive change between the two contracts, explain the user impact of those terms, and avoid making either inaccurate conclusions, or including clauses as having changed that are present in both agreements.

Reorganized Clauses

The updated ToS contains a number of organizational changes that moved existing clause language from one section to another, or placing it under a section or subheading, which some LLMs interpreted as a new clause. Examples include:

  • Section 3.2.5 - the prohibition on selling, licensing, or purchasing data (except as provided in Platform Terms);
  • Section 3.3.3 - exceptions to the 90 day content deletion terms, except in the case of legal investigations or compliance with existing laws;
  • Section 4.1 - Clause indicating that users are bound by the Terms if they continue to use Meta products;
  • Section 4.5 - language stating that the Supplemental Terms govern when conflicting with the general Terms.

Misinterpreted Clauses

“When Account Deletion is Required”

Several models had trouble parsing this sentence, concluding that it required users to delete their accounts if they did not accept updated terms of service:

We hope that you will continue using our Products, but if you do not agree to our up‐dated Terms or wish to terminate your agreement to this contract, you can delete your account at any time and you must also stop accessing, or using Facebook and the other Meta Products.        

c Multiple LLMs misinterpreted this language to mean that deleting one’s account was necessary when a user did not accept updates to the Terms. (Claude 3.7 with Thinking Disabled, Gemini 2 Pro Exp, o3-mini high), that you had to cease using Facebook if your account became deleted (Gemini 2 Flash Thinking).

“Data Scraping and Login Status Terms”

The 2025 terms updated the data scraping and misuse terms to apply “regardless of whether [the conduct] is undertaken while logged-in to a Facebook account.”

July 2022 Language:
You may not access or collect data from our Products using automated means (without our prior permission) or attempt to access data you do not have permission to access.        
January 2025 Language:
You may not access or collect data from our Products using automated means
(without our prior permission) or attempt to access data you do not have permis‐
sion to access, regardless of whether such automated access or collection is un‐
dertaken while logged-in to a Facebook account.        

This is likely a response to recent data scraping TOU decisions (see Meta v. Brightdata, holding that visitors to Meta’s websites did not violate the terms of service because they scraped data without being logged in). In light of these recent scraping decisions, Meta and other platforms would want to ensure that their anti-scraping language applied whether or not the individuals accessing their sites were logged in. Because the update was written, “regardless of whether … while logged in,” LLMs had varying takes on whether this was meant to prohibit conduct while logged out (DeepSeek R1, Gemini 2 Pro Exp), or that it simply “closed a potential loop-hole” without specifying how (Claude 3.7 Sonnet).? Most of these interpretational variances were not penalized, because most did not provide incorrect explanations, or directions.

“For example, Instagram Terms of Use”

Another common mistake was misinterpreting an edit intended to clarify existing language as creating new conditions. This occurred with the example of adding an explanatory parenthetical to the Overview section of the Terms:

July 2022 Language: 
These Terms govern your use of Facebook, Messenger, and the other products, features, apps, services, technologies, and software we offer (the Meta Products or Products), except where we expressly state that separate terms (and not these) apply.        
January 2025 Language: 
These Terms of Service (the "Terms") govern your access and use of Facebook, Messenger, and the other products, web‐sites, features, apps, services, technologies, and software we offer (the Meta Products or Products), except where we expressly state that separate terms (and
not these) apply. (For example, your use of Instagram is subject to the Instagram Terms of Use).        

The addition of an example mentioning the Instagram Terms is not meant to modify the scope of the Facebook Terms, but several LLMs interpreted this example as a new limitation. (DeepSeek R1, OpenAI GPT-4.5 Preview). DeepSeek R1’s suggested that this change had implications for EU users, a fabricated inclusion in its answer despite no mention of EU users in the thinking traces.



Pricing per Thousand Requests

Each request consisted of roughly 12,600 input tokens (the exact amount varies by tokenizer and the providers’ specific call formatting), plus whatever output tokens the model generates. For Reinforcement Learners, the “thinking” tokens are billed as Output tokens, regardless of whether the user receives them. Costs were calculated per 1000 requests (Kilo-requests?) to make units more comparable and avoid having to count fractions of pennies (and I ran these only a few times each, this whole experiment cost pennies).

Price-per-performance is all over the place, in some part because the industry hasn’t yet figured out how they want to charge for reasoning tokens, and in large part because post-training appears to have an outsized influence on performance. The cheapest model (on a per-token basis) turned out to be the best, followed by a thinking model that doesn’t charge a premium for “thinking”.?

Contract redlining isn’t particle physics. Still, a small amount of planning prior to answering appears to pay dividends, particularly for legal domain tasks requiring multiple steps to complete. Even when instructed to utilize high effort settings, the additional token overhead is modest across all models tested. That suggests the added benefits of Reinforcement Learning Models won't cost significantly more than their Unsupervised Learner counterparts (latency considerations aside).

Pricing for Gemini 2 Flash Thinking Exp and Gemini 2.0 Pro Exp is extrapolated from current pricing for Gemini 2.0 Flash, and Gemini 1.5 Pro. Google doesn't bill for experimental model usage, providing access free of charge, but with non-production rate limits.

Cost per THOUSAND Requests for Reinforcement Learners
Cost per THOUSAND Requests for Unsupervised Learners. And Claude 3.7 without thinking.


Cost per THOUSAND Requests for Chonky LLMs.

Coming Soon...

The next half of this article will contain a more detailed discussion of each models' performance, along with some additional musing and observations on the behavior of Reinforcement Learners.

Appendix Scoring Rubric

Revised Rubric:

Deductions

  • .25 deduction for minor but impactful errors.
  • .5 for major errors?

Summary:

  • Must state that agreement is largely unchanged.
  • Must include most of the “big five” substantive changes: 1) Supplemental Terms, 2) Prohibited Activities, 3) Automated Data Collection, 4) Fact Checkers, 5) Termination.
  • Must not overstate the significance of changes.
  • Must not highlight terms that did not change as having changed..

Key Changes

  • Must include the “big five” substantive changes.
  • Must accurately describe the significance of each change.
  • Must not mischaracterize the purpose or effect of each change.
  • Must not include minor changes having no substantive effect.
  • Must not highlight terms that did not change as having changed.

Additional Considerations:

  • Can include any of the “big five” substantive changes.
  • Must contain at least two other minor changes that are more than organizational.
  • Must not overstate the significance of changes.
  • Must not highlight terms that did not change as having changed.

Recommendations:

  • Must be consistent with previously determined key changes.
  • Must provide accurate recommendations addressing user concerns over data protection, IP, and privacy.
  • Must not overstate the significance of changes.

Additionally, I discretionarily awarded bonus points when answers were extremely high quality. Because errors can consist of just one or two misplaced words, these were not mutually exclusive with answers with deductions:

  • .25 bonus for answers with outstanding accuracy and alignment.
  • .5 bonus for best-in-class and surprisingly good answers.

Sam Burrett

AI Lead @ MinterEllison | Writing about productivity and artificial intelligence.

1 周

Thanks for sharing, Leo - appreciate you putting this out there.

回复
Paul Park

Director@ BDO, Data Breach and AI Advisory Services

1 周

Curious any insight to what the rl function is? Or how it’s interpreted?

回复
Allison Morrell

Knowledge and Legal Operations Lawyer

2 周

This is awesome Leonard Park, and I'm excited to hear about "Planners" vs "Noodlers". I'm going to hazard a guess that Sonnet 3.7 is a planner? Based on some wildly effective though extremely misguided planning I'm seeing in the ClaudePlaysPokemon stream.

Incredible work! (And makes me want to revisit Gemini)

Kyle Bahr

Legal AI Strategy Lead & Senior Business Attorney | ex-Legal Tech AI PM, F200 Commercial Litigation & Contracts/Legal Ops & Tech Attorney, Global AmLaw100 Attorney & Paralegal, and Federal Law Clerk

2 周

Awesome, Leonard Park, thanks for this work! Quick question, the o1 was plain o1 and not o1-pro, right?

要查看或添加评论,请登录

Leonard Park的更多文章

社区洞察