LLMs and Contract Intelligence, Part II Reasoning Models
Leonard Park
Experienced LegalTech Product Manager and Attorney | Passionate about leveraging AI/LLMs
In the scant three months since my first article benchmarking contract intelligence, generative AI advancements continue unabated. Like the spring crocus, numerous developments have popped up after a winter pause. OpenAI’s o1/o3 reasoning family of models, Google’s impressive Gemini 2 updates, and a dark horse in the form of DeepSeek R1 have upset, and updated, our understanding of LLM performance. Not to be left out, Anthropic joined the fracas with the recent release of Claude 3.7, boasting increased performance and “extended thinking”. We are diving once again into the same previous Contract Intelligence benchmarking to answer, “Whose Reasoning Reigns Supreme?”
This first section will focus on the results themselves, and a follow-up will discuss some observations regarding post-training and the two main schools of reasoning models ("Planners" vs "Noodlers")
The Contenders, "Reinforcement Learners" vs. "Unsupervised Learners"
Throughout this article, I'll refer to reasoning models as "Reinforcement Learners" and non-reasoning models as "Unsupervised Learners". I got tired of referring to "thinking models" or "reasoning models" and having the internal debate as to whether they actually reasoned, or thought, and I also did not like describing "non-reasoning" models in terms of behavior they lack.
OpenAI's recent articles indicate that the o1/o3 family of models derive their "reasoning" capabilities through a special kind of reinforcement learning, whereas the models that answer first, such as GPT-4.5 Preview, are the product of unsupervised learning. I'm pretty sure both types of models incorporate both types of training, but that's the basis for the naming.
For clarity, the Unsupervised Learners are:
The Reinforcement Learners include:
The Task and Benchmark
As a reminder, the chosen task for this benchmark requires comparing two versions of the Meta Facebook Terms of Service, July 2022, and January 2025. The system prompt generally sets the subject and tone, and asks for section references when possible.
system_message = """You are an expert commercial contracts analyst guiding Social Media users on their online rights. You analyze contracts and pay close attention to the substantive clauses, terms, definitions, rights, and obligations contained within them. Whenever possible, include in your answers references to sections and subsection numbering, or other references that describe what part of the source document the answer comes from."""
The prompts instruct the LLM to answer in four parts:
tos_compare_user_message = """Previous Contract: {old_contract}
Revised Contract: {new_contract}
Instructions: You are comparing two different versions of the Meta Facebook agreement. Identify the \
differences between the Previous Contract and Revised Contracts, and return your answer in a well-organized Report. \
Report Output format:
- Summary - a high-level description of the changes between the two agreements.
- Key Changes - the largest, most significant revisions found in the Revised Contract from a user experience \
or user rights perspective.
- Additional Considerations - less significant yet still noteworthy changes having some impact on the service.
- Recommendation - a conclusion of changes for users who are concerned with online rights and the scope of Facebook's \
agreement."""
Documents at the Top
The prompt begins by inserting both the old and revised ToS using an f-string so that the long documents appear first, and the instructions below. There appears to be some consensus, both for model comprehension and cost optimization (e.g prefix caching), that placing static information at the top of one’s prompt is preferable (see Anthropic’s docs and Latent Space recommendations for OpenAI). This preference appears ot extend to both unsupervised learners (non-thinking LLMs) and reinforcement learners (reasoning LLMs).
Formatting and Delimiters
Getting models to follow complex, multi-step instructions is easier with structure and formatting. I used newlines and dashes as formatting to further delineate background information from prompt instructions, and to reduce misalignment in the response. For even more complex instructions, prompts may employ markdown or YAML for even more precise structuring.
Revised Scoring Rubric
I switched to a scoring rubric that starts with a base score, then detracts for defined missing elements, or included extraneous elements, based on the severity of the error. This was necessary because the previous benchmark was essentially saturated by the previous top-performing models. The revised scoring is more critical of incorrect assertions and less “vibes based” than my first effort, and also does a better job of differentiating high-performing models.
This unfortunately means that the previous results are not directly comparable to the present scores. I re-scored both GPT-4o-2024-11-20 and Claude 3.5 Sonnet (20241024), two of the top-scoring LLMs from the previous article, to provide a baseline of comparison. Additionally, updated versions of the Gemini 2 Family of models were rescored and included in the current model lineup. This means the top 4 models from the previous article appear in the new benchmark results (note that the Gemini 2 family models have received updates since the previous test).
Summary Results
Google DeepMind’s Gemini 2 Flash Thinking Exp came in first with unmatched granularity and accuracy.? Flash with Thinking managed to identify minute language changes (such as slightly more instances of “Access and Use” over the previous “Access”) that no other model found, while identifying substantive changes correctly and faithfully. Seeing this level of performance from the smaller Gemini family model makes me really excited to see what a Gemini 2 Pro Thinking can do.
Claude 3.7 Sonnet with Extended Thinking took second and third spots. Despite excellent responses, these results may be unstable, due to the weirdness encountered in steering thinking effort. Both the base model and “extended thinking” runs appear to improve the fidelity and correctness of its answers. This is not only very high performance, but extremely affordable compared to OpenAI’s counterparts.
Gemini 2 Pro Exp and OpenAI o1 - High both gave excellent answers. These models are very different from each other, and generated very different answers that were accurate and avoided making misleading statements, demonstrating that there are multiple avenues to making great LLMs for legal tech. Of note, o1’s answer is around 15x the cost of Gemini 2 Pro, due to thinking tokens.
GPT-4.5 Preview did not impress, suggesting that massively scaling parameter count and train time compute alone does not deliver the best LLM for legal applications. The answer quality was very close to that of GPT-4o. I don’t think this eval truly tells us much. As a preview model, it hasn’t undergone all of the post training that refines performance for real world applications. I expect high-parameter count models to contain a lot of knowledge, and to excel at token diversity for things like synthetic text generation. In this instance it did not shine, but understanding how this model really performs requires more data.
DeepSeek R1 disappointed. The thinking observations were mostly accurate, but when applying these to completing the user task, R1 introduced errors. Somewhere between the thinking tokens and generating an answer, R1 produced numerous incorrect statements, suggesting that legal applications were not a training priority for this model (or the base model, V3).
Evaluations
The five changes I considered substantive included:
The highest scoring models managed to identify each substantive change between the two contracts, explain the user impact of those terms, and avoid making either inaccurate conclusions, or including clauses as having changed that are present in both agreements.
Reorganized Clauses
The updated ToS contains a number of organizational changes that moved existing clause language from one section to another, or placing it under a section or subheading, which some LLMs interpreted as a new clause. Examples include:
Misinterpreted Clauses
“When Account Deletion is Required”
Several models had trouble parsing this sentence, concluding that it required users to delete their accounts if they did not accept updated terms of service:
We hope that you will continue using our Products, but if you do not agree to our up‐dated Terms or wish to terminate your agreement to this contract, you can delete your account at any time and you must also stop accessing, or using Facebook and the other Meta Products.
c Multiple LLMs misinterpreted this language to mean that deleting one’s account was necessary when a user did not accept updates to the Terms. (Claude 3.7 with Thinking Disabled, Gemini 2 Pro Exp, o3-mini high), that you had to cease using Facebook if your account became deleted (Gemini 2 Flash Thinking).
“Data Scraping and Login Status Terms”
The 2025 terms updated the data scraping and misuse terms to apply “regardless of whether [the conduct] is undertaken while logged-in to a Facebook account.”
July 2022 Language:
You may not access or collect data from our Products using automated means (without our prior permission) or attempt to access data you do not have permission to access.
January 2025 Language:
You may not access or collect data from our Products using automated means
(without our prior permission) or attempt to access data you do not have permis‐
sion to access, regardless of whether such automated access or collection is un‐
dertaken while logged-in to a Facebook account.
This is likely a response to recent data scraping TOU decisions (see Meta v. Brightdata, holding that visitors to Meta’s websites did not violate the terms of service because they scraped data without being logged in). In light of these recent scraping decisions, Meta and other platforms would want to ensure that their anti-scraping language applied whether or not the individuals accessing their sites were logged in. Because the update was written, “regardless of whether … while logged in,” LLMs had varying takes on whether this was meant to prohibit conduct while logged out (DeepSeek R1, Gemini 2 Pro Exp), or that it simply “closed a potential loop-hole” without specifying how (Claude 3.7 Sonnet).? Most of these interpretational variances were not penalized, because most did not provide incorrect explanations, or directions.
“For example, Instagram Terms of Use”
Another common mistake was misinterpreting an edit intended to clarify existing language as creating new conditions. This occurred with the example of adding an explanatory parenthetical to the Overview section of the Terms:
July 2022 Language:
These Terms govern your use of Facebook, Messenger, and the other products, features, apps, services, technologies, and software we offer (the Meta Products or Products), except where we expressly state that separate terms (and not these) apply.
January 2025 Language:
These Terms of Service (the "Terms") govern your access and use of Facebook, Messenger, and the other products, web‐sites, features, apps, services, technologies, and software we offer (the Meta Products or Products), except where we expressly state that separate terms (and
not these) apply. (For example, your use of Instagram is subject to the Instagram Terms of Use).
The addition of an example mentioning the Instagram Terms is not meant to modify the scope of the Facebook Terms, but several LLMs interpreted this example as a new limitation. (DeepSeek R1, OpenAI GPT-4.5 Preview). DeepSeek R1’s suggested that this change had implications for EU users, a fabricated inclusion in its answer despite no mention of EU users in the thinking traces.
Pricing per Thousand Requests
Each request consisted of roughly 12,600 input tokens (the exact amount varies by tokenizer and the providers’ specific call formatting), plus whatever output tokens the model generates. For Reinforcement Learners, the “thinking” tokens are billed as Output tokens, regardless of whether the user receives them. Costs were calculated per 1000 requests (Kilo-requests?) to make units more comparable and avoid having to count fractions of pennies (and I ran these only a few times each, this whole experiment cost pennies).
Price-per-performance is all over the place, in some part because the industry hasn’t yet figured out how they want to charge for reasoning tokens, and in large part because post-training appears to have an outsized influence on performance. The cheapest model (on a per-token basis) turned out to be the best, followed by a thinking model that doesn’t charge a premium for “thinking”.?
Contract redlining isn’t particle physics. Still, a small amount of planning prior to answering appears to pay dividends, particularly for legal domain tasks requiring multiple steps to complete. Even when instructed to utilize high effort settings, the additional token overhead is modest across all models tested. That suggests the added benefits of Reinforcement Learning Models won't cost significantly more than their Unsupervised Learner counterparts (latency considerations aside).
Pricing for Gemini 2 Flash Thinking Exp and Gemini 2.0 Pro Exp is extrapolated from current pricing for Gemini 2.0 Flash, and Gemini 1.5 Pro. Google doesn't bill for experimental model usage, providing access free of charge, but with non-production rate limits.
Coming Soon...
The next half of this article will contain a more detailed discussion of each models' performance, along with some additional musing and observations on the behavior of Reinforcement Learners.
Appendix Scoring Rubric
Revised Rubric:
Deductions
Summary:
Key Changes
Additional Considerations:
Recommendations:
Additionally, I discretionarily awarded bonus points when answers were extremely high quality. Because errors can consist of just one or two misplaced words, these were not mutually exclusive with answers with deductions:
AI Lead @ MinterEllison | Writing about productivity and artificial intelligence.
1 周Thanks for sharing, Leo - appreciate you putting this out there.
Director@ BDO, Data Breach and AI Advisory Services
1 周Curious any insight to what the rl function is? Or how it’s interpreted?
Knowledge and Legal Operations Lawyer
2 周This is awesome Leonard Park, and I'm excited to hear about "Planners" vs "Noodlers". I'm going to hazard a guess that Sonnet 3.7 is a planner? Based on some wildly effective though extremely misguided planning I'm seeing in the ClaudePlaysPokemon stream.
Incredible work! (And makes me want to revisit Gemini)
Legal AI Strategy Lead & Senior Business Attorney | ex-Legal Tech AI PM, F200 Commercial Litigation & Contracts/Legal Ops & Tech Attorney, Global AmLaw100 Attorney & Paralegal, and Federal Law Clerk
2 周Awesome, Leonard Park, thanks for this work! Quick question, the o1 was plain o1 and not o1-pro, right?