登录查看更多内容

Fine Tune like a Lawyer

Leonard Park

Experienced LegalTech Product Manager and Attorney | Passionate about leveraging AI/LLMs

发布日期: 2024年8月14日

Originally, I had titled this f"Fine Tune like a {job_title}", but I thought that it might be confusing, and also I am only speaking to domain expertise of one kind, namely training an LLM on a legal taxonomy of contract terminology to improve performance classification performance. I do hope that this provides some inspiration to others who are seeking to fine-tune for other domain-specific tasks.

Intro

Like a lot of good experiments, the idea behind this article was born of a series of failures and mistakes. Seeing how one of my fine-tune jobs went horribly wrong gave me ideas about how one might actually do it right.

In my previous experiment, I tried to figure out if fine-tuning gpt-4o-mini on The Contracts Understanding Atticus Dataset might improve its performance on a different contracts benchmark. After several iterations, I came up with the idea to train an LLM using a detailed Memo to generate high-quality synthetic training data.

Identifying the Problem - CUAD Contracts Classification

A first step to fine-tuning is identifying a task where model performance needs improvement, such as conforming to a specific output behavior, or generalizing knowledge in order to perform new tasks. In my case, I wanted to improve LegalBench benchmark performance in some way. This is a somewhat artificial task, as it isn't improving a workflow or attacking a real-world problem. But, it did conceivably involve training a model to recognize legal concepts through new token associations, as LegalBench is full of domain specific association tasks (such as labeling legal language), and legal reasoning tasks, such as interpreting the effects of certain kinds of language.

CUAD Cap on Liability was a dataset where GPT's performance was pretty atrocious. The base model's performance when tested against a 400 row training set was that about 71% of rows were accurately labeled. In terms of identifying true negatives, GPT was around 99% accurate. however, in identifying true positives, it only recognized 87/200, or a Recall around 44%. Notably, GPT seemed to have very high precision in identifying liability cap clauses that involved numerical or aggregate maximum amounts. It could not identify any other limitation of liability clauses, such as damages exclusions. This was great, because it gave me a very focused and narrow fine-tuning target - train GPT to recognize more types of Cap on Liability contract clauses.

Fine Tuning Approaches that don't work

At first, I took a "naive approach" to fine-tuning that consisted of giving a model several hundred correct question/answer pairs, and then seeing if it improved when benchmarked against another question/answer set. It took only a few attempts to identify that fine-tuning on benchmark datasets did not work well. I suspect that the problem is that benchmark datasets contain too few output token examples. Benchmark datasets typically have a task / query / answer format where LLM is asked to perform a benchmark task, provided a context passage and question, where the answer is typically a "Yes / No" format or very brief answer, such as a single letter or word.

This dramatic imbalance between input and output tokens doesn't seem to positively shift token weights and create desired associations for token generation, and as a result, the benchmark scores and model behavior do not change. Since the Benchmark datasets by themselves were not good at reinforcing the desired answers, I thought, "what would create a lot of desired input / output token associations for a clause classification task?" Perhaps, lots of over-explaining of why a particular classification answer was correct?

A second failed fine-tune occurred when I accidentally included a system-message in my training data from the original CUAD task (instructing the model to answer "Yes or No") but then providing an assistant answer that was a five-part analysis. The result was a fine-tune model that scored a 0/400 on the benchmark because it was incapable of producing "Yes" or "No" answers! Oops.

This "catastrophic failure" actually gave me some insights into how the training data triplet formatting mattered for fine-tune model output. The fact that my assistant outputs contained bullet-point formatting, and the fact that the system message disagreed with the assistant output format had an overwhelming influence on the model's behavior, and I couldn't make it produce anything useful. But I did learn something from it.

The "Memo Approach"

What would legal professional do if they had a team of legal reviewers who were consistently missing certain key concepts over and over? They would write a detailed, explanatory memo that reinforces the legal concepts, explains the "why" behind each tasks, and hopefully brings structure and method to the review process. I adopted a "memo approach" to generating synthetic data for fine-tuning a model. (Of course, lots of other folks in many other professions also write explanatory memos, too. Maybe we can fine-tune LLMs with those as well!)

The following approach was inspired by previous experiments I'd done with LlamaGuard, which is a fine-tuned version of the smallest Llama model from Meta, and part of the Llama family of LLMs. LlamaGuard uses a long "taxonomy" prompt in order to perform content moderation classifications. What I noticed, and the Meta team indicated, is that LlamaGuard will perform the taxonomy labeling with or without including the taxonomy prompt due to its fine-tuning. The model had learned the unsafe categories and labeling codes through the fine-tune process. So, perhaps teaching a model to label contracts could work similarly.

Steps to the Memo Approach - Generating Synthetic Data

Build/Acquire a dataset of question / answer pairs. Unfortunately there's usually not an easy way to do this. If you have a specific legal topic in mind, there may be practice guides, multiple choice questions, or existing legal datasets to help get you started. I had LegalBench - CUAD Question Answers, but otherwise gathering this data may take a long time.
Evaluate your LLM. Run Question through an LLM to find where it's performance is lagging. This is most efficiently performed when the questions have boolean "Yes / No" answers, or there is an answer key that can be used with LLM evaluations to determine when correct and incorrect answers are being generated. Again, LegalBench provided this part.
Write a memo. Based on the subjects and questions where the LLM shows deficiencies, generate an outline or practice guide that covers these topics in detail. Include features that will assist training the concepts you would like the model to learn, such as background facts, theory and doctrine, rules, examples, or exceptions. Be exhaustive and thorough here because the information will selectively appear in the training set when it's relevant to a particular question / answer pair.
Select a Training Model. This is typically a model that is more capable and that possess superior generation quality we are trying to train our lightweight model to emulate. One of the touted benefits of fine-tuning is that you can train a lightweight LLM to produce output comparable to a high-powered model for some specific task, all the while paying a fraction of the inference cost. In this case, both gpt-4o and gpt-4o-mini have comparable (poor) performance at CUAD classifications, so what are really seeking to do is surpass both models' performance by enabling new task performance.
Generate Synthetic Data. I provided the Training Model with my Memo, along with Contract + Label Answers, and prompted the LLM to write an explanation for the labeling choice using information based on the memo. I actually wasn't too excited with the quality of explanations generated by GPT-4o. They were a bit repetitive and didn't quite have the "IRAC" structure I'd wanted. But, I figured the repetitive answers might create the token associations we needed for contact clause recognition.

The cap_guide training memo can be viewed here in plain text form.

System Prompt used to Generate Synthetic Data for step 5.

Below is a sample training triplet of system message, CUAD clause, and the explanation generated by the Training Model. While it provides a justification for its answer that includes citations to the clause, it's rules statement is actually a policy justification because it the system prompt isn't quite fully baked. It also repeats itself a lot, but I figured that might be grist for the mill.

Steps to the Memo Approach - Fine Tune the Model

Google Colab Notebook is Here

Kick of Fine-Tune job. Once I had the training data and validation sets properly formatted and uploaded to OpenAI, I fired off a fine-tune job. OpenAI provides extensive documentation on how to fine-tune a model, which are fairly easy to follow if you are familiar with basic Python programming and a bit of data manipulation.
Evaluate the Fine Tune Model. In this case, I just ran the fine-tune model against the same validation set I used to score the base model. The results blew me away! Below, I have some theories as to why it performed so well. Remember when I noted that GPT's recall for true positives was only 44%? The fine-tune model more than doubles recall, with performance in the low 90s.

Remember when I noted that GPT's recall for true positives was only 44%? The fine-tune model more than doubles recall, with performance in the low 90s.

Take aways and Conclusions for fine tuning

I don't want to overstate any conclusion because this is one positive result. I could have gotten lucky for a lot of reasons, it may not generalize for various reason, and therefore, I don't know how well any of this works. I do think it's a worthwhile avenue for experimentation because it's a fairly low-effort method for boosting LLM performance, and PEFT fine tuning is really cheap!

The potential limitations are many and unknown here, so don't leave thinking we've solved LegalBench or cracked fine-tuning, because none of that is the case.

领英推荐

?? AI & Lawyers

Luiza Jarovsky 8 个月前

Chain-of-Thought Reasoning from o1 on the Futures of…

Jon Neiditz 5 个月前

Spring Forward: Emerging AI Trends in Legal Practice

Olga V. Mack 1 年前

Acing LegalBench as a Fine Tuning task

In many cases, optimizing a model to crush a benchmark is not a particularly useful endeavor. LegalBench has some exceptions, as certain tasks within it are labeling or reasoning tasks involving legal language that have some applications in legal tech / legal informatics. In this instance, improving CUAD labeling performance has direct application in contract lifecycle and clause analysis software, so I think this actually approaches usefulness.

I also wonder if there are tradeoffs. Understanding LLM performance is so complex that it's difficult to determine if a fine-tune model that has mastered contracts clauses has now forgotten how to talk like a pirate, or explain how hula hooping works. I kid, but messing with low-rank matrices in the convolutional layers means deprioritizing some pathways and capabilities and presumably this means a hit to performance elsewhere.

In terms of implications for benchmarking, I don't think this really changes how I think about benchmarking. It's still important for evaluating off-the-shelf LLMs, and there are still limitations to what you can learn from them based on the benchmark design and composition. I am, however, much more optimistic about the kinds of performance deficiencies that can be addressed through fine-tuning.

The Memo Approach

I like this "double rag" approach of applying a research memo to a variety of contract clauses because it bakes a lot of domain specific knowledge into the training data, while providing token diversity at the same time. Providing many different types of examples and explanations, hopefully, results in fine-tuning that embeds the concepts in a more generalizable fashion, such that the LLM can not only regurgitate the rules or facts, but apply them correctly in labeling.

Base Model Performance

The base model performance was evenly divided between tasks it performed near perfectly (true negatives, and certain cap on liability clauses), and tasks it performed uniformly incorrectly (other cap clauses). I didn't know whether or not my fine-tuning needed to reinforce the tasks the base model already did well, but I did not train on true negatives -or- the cap on liability clauses that the base model already labeled correctly. I still need to check to verify whether there are instances where the ft model is now getting things wrong that the base model labeled correctly. However, the ft model score was so high that I don't think very many previously correct answers could have flipped post-fine-tuning.

Epilogue - Peering into the Robot's Brain

I got curious about my new fine-tuned LLM and wanted to try and understand why it performed so much better than the base model gpt-4o-mini. I headed over to OpenAI's playground to test the models side-by-side...

There's a Twist

When I asked both models to explain "Cap on Liability", I was surprised - the base model gpt-4o-mini knew all about it, and in fact provided more examples of limitation of liability clauses than my fine-tune model:

It isn't necessarily a lack of knowledge or exposure to the contract concepts in the training data, but that the base model likely was not trained to perform the task of recognizing kinds of contract clauses when a clause was presented to it. Amusingly the ft model was producing answers that had more detailed examples as well as explaining the purpose of the clause in the context of commercial contracting, reflecting the content learned from the Memo.

When asked to label CUAD clauses, gpt-4o-mini was suddenly using a different definition of "cap on liability". I spoon-fed several CUAD Cap on Liability clauses where gpt-4o-mini failed at labeling them correctly to see if a long-form answer provided the same results. Consistent with the benchmark results, the base model failed to identify various types of limitations of liability as cap on liability clauses, despite having identified them when asked about 'cap on liability' generally. This demonstrates how weirdly brittle LLM performance can be with respect to generalizing knowledge for successful task completion.

I was particularly curious about an example involving a "notice and cure" clause, because in my mind, that's sort of at the far-reaches of a limitation on liability clause (I don't even understand a Notice and Cure to be a cap on liability, but CUAD does), and I expected the models to have trouble with that. The fine-tune model was trained to recognize these, so I expected it to get it right. However, notice how both models identify the clause as a condition precedent, but then the ft model proceeds to apply the Memo logic to why it is a limitation of liability. This demonstrates some interesting inter-mingling of the pre-training parametric knowledge, and the newly imprinted fine-tune knowledge into a single answer analyzing this contract clause. Neat!

Compare the ft model's answer above to the "Notice and Cure" description from the memo:

This clause serves to cap liability by creating procedural requirements that must be satisfied before a claim can be pursued or damages can be sought. Its effectiveness lies in its ability to provide a structured approach to addressing potential breaches, thereby potentially averting formal legal action and associated liabilities. By mandating notice within a specific timeframe, it can limit the period during which breaches can be claimed, effectively capping historical liability. The cure period offers an opportunity to remedy issues before they escalate, potentially reducing or eliminating damages altogether. This provision acts as a buffer against immediate litigation or other formal dispute resolution processes, encouraging communication and problem-solving between the parties. The clause's impact on liability can vary based on factors such as the length of notice and cure periods, the specificity required in the notice, and whether certain types of breaches (e.g., payment defaults) are excluded from these requirements. Ultimately, this clause can significantly mitigate liability by promoting early identification and resolution of issues, potentially preserving business relationships and avoiding costly legal proceedings.

带有此图标的链接由领英创建，不带此图标的链接由作者添加。

Mariia Shcherbakova

AI & Project Management | International Telecommunication Union (ITU) | Master of Science in Applied AI

7 个月

This is a great experiment, thank you for sharing this! I wonder if the tradeoff here that you mention may happen when you ft the model for other types of contracts as well, would that mean it’s not possible to generalize

1 次回应

Enam Hoque

7 个月

Yup

Michael McGinn

Senior Manager, Artificial Intelligence Solutions @ Fasken | AI Strategy, Innovation ????

7 个月

Nice! Thanks for sharing this.

Joshua Schoen

Investor | Advisor | Ex-founder

7 个月

Yes!

查看更多评论

要查看或添加评论，请登录

Leonard Park的更多文章

LLMs and Contract Intelligence, Part II Reasoning Models

2025年3月7日

LLMs and Contract Intelligence, Part II Reasoning Models

In the scant three months since my first article benchmarking contract intelligence, generative AI advancements…

13 条评论
LLMs and Contract Intelligence - How Many "GPT4-level" models are there?

2024年12月16日

LLMs and Contract Intelligence - How Many "GPT4-level" models are there?

Like a Holiday Miracle, towards the end of 2024, we saw a flurry of LLMs released from major developers that all…

14 条评论
Google vs. Perplexity, Whose API Reigns Supreme?

2024年11月8日

Google vs. Perplexity, Whose API Reigns Supreme?

Google Gemini team recently updated the Gemini API with a new tool, “Grounding with Google Search.” This augments your…

6 条评论
Perplexity got sued. What does that mean for OpenAI and Anthropic and You?

2024年10月25日

Perplexity got sued. What does that mean for OpenAI and Anthropic and You?

Warning: The following contains a lot of strong opinions and #hottakes, and may contain errors of fact and…

27 条评论
Reflections on LegalTech Benchmarking

2024年9月12日

Reflections on LegalTech Benchmarking

Another Day, Another Story about Benchmarking, Another Perspective on How to Measure LLM Performance A recent article…

9 条评论
Let's Fine Tune ?? GPT-4o mini using LegalBench ?????? datasets!

2024年7月27日

Let's Fine Tune ?? GPT-4o mini using LegalBench ?????? datasets!

Adventures in Fine Tuning OpenAI recently released support for fine-tuning GPT-4o mini, along with free credits for…

1 条评论
Prompt-Hacking Meta LlamaGuard 2 into a PII Classifier - part 1

2024年6月27日

Prompt-Hacking Meta LlamaGuard 2 into a PII Classifier - part 1

This is not an intro to using LlamaGuard for its intended purpose as a Content Moderation layer in LLM stacks. This is…

2 条评论
Methodological Considerations re: Stanford HAI’s “Hallucination-Free?”

2024年6月10日

Methodological Considerations re: Stanford HAI’s “Hallucination-Free?”

Methodology Matters The Stanford HAI recently released a research paper that proposes to measure compare generative AI…

10 条评论
Visualizing Legal Text Embeddings with Gradient Maps

2024年2月1日

Visualizing Legal Text Embeddings with Gradient Maps

Warning: If you didn't already have a headache, this article might give you one. Yes, I wrote an article about colorful…

3 条评论
On Large Language Models and "Regurgitation"

2024年1月8日

On Large Language Models and "Regurgitation"

Today, OpenAI published a response piece to the #nytimes lawsuit. What I'm most interested here is OpenAI's description…

6 条评论

See all articles

Fine Tune like a Lawyer

Leonard Park

Experienced LegalTech Product Manager and Attorney | Passionate about leveraging AI/LLMs

Intro

Identifying the Problem - CUAD Contracts Classification

Fine Tuning Approaches that don't work

The "Memo Approach"

Steps to the Memo Approach - Generating Synthetic Data

Steps to the Memo Approach - Fine Tune the Model

Take aways and Conclusions for fine tuning

领英推荐

Acing LegalBench as a Fine Tuning task

The Memo Approach

Base Model Performance

Epilogue - Peering into the Robot's Brain

There's a Twist

Leonard Park的更多文章

社区洞察

其他会员也浏览了

AI Vs. Lawyers: the race is still on!

Why LLMs Are Not the Place to Manage Your Legal Data and Workflows

The Transformative Impact of Generative AI on the Legal Sector

RAG: Why Does It Matter, What Is It, and Does It Guarantee Accuracy?

Transforming Legal Workflows: The Role of AI in Document Review

Generative AI in Legal: Solution Providers

How to Draft Great AI Terms

The Transformative Influence of Intelligent Automation and A.I. on Law Firms

AI in Transforming the Legal Industry: Opportunities, Challenges, and Navigating Change

How to Summarize Legal Orders and Opinions Using the GPT-3.5

Intro

Identifying the Problem - CUAD Contracts Classification

Fine Tuning Approaches that don't work

The "Memo Approach"

Steps to the Memo Approach - Generating Synthetic Data

Steps to the Memo Approach - Fine Tune the Model

Take aways and Conclusions for fine tuning

领英推荐

Acing LegalBench as a Fine Tuning task

The Memo Approach

Base Model Performance

Epilogue - Peering into the Robot's Brain

There's a Twist

Leonard Park的更多文章

LLMs and Contract Intelligence, Part II Reasoning Models

LLMs and Contract Intelligence - How Many "GPT4-level" models are there?

Google vs. Perplexity, Whose API Reigns Supreme?

Perplexity got sued. What does that mean for OpenAI and Anthropic and You?

Reflections on LegalTech Benchmarking

Let's Fine Tune ?? GPT-4o mini using LegalBench ?????? datasets!

Prompt-Hacking Meta LlamaGuard 2 into a PII Classifier - part 1

Methodological Considerations re: Stanford HAI’s “Hallucination-Free?”

Visualizing Legal Text Embeddings with Gradient Maps

On Large Language Models and "Regurgitation"

社区洞察

其他会员也浏览了

AI Vs. Lawyers: the race is still on!

Why LLMs Are Not the Place to Manage Your Legal Data and Workflows

The Transformative Impact of Generative AI on the Legal Sector

RAG: Why Does It Matter, What Is It, and Does It Guarantee Accuracy?

Transforming Legal Workflows: The Role of AI in Document Review

Generative AI in Legal: Solution Providers

How to Draft Great AI Terms

The Transformative Influence of Intelligent Automation and A.I. on Law Firms

AI in Transforming the Legal Industry: Opportunities, Challenges, and Navigating Change

How to Summarize Legal Orders and Opinions Using the GPT-3.5