Fine Tune like a Lawyer

Fine Tune like a Lawyer

Originally, I had titled this f"Fine Tune like a {job_title}", but I thought that it might be confusing, and also I am only speaking to domain expertise of one kind, namely training an LLM on a legal taxonomy of contract terminology to improve performance classification performance. I do hope that this provides some inspiration to others who are seeking to fine-tune for other domain-specific tasks.

Intro

Like a lot of good experiments, the idea behind this article was born of a series of failures and mistakes. Seeing how one of my fine-tune jobs went horribly wrong gave me ideas about how one might actually do it right.

In my previous experiment, I tried to figure out if fine-tuning gpt-4o-mini on The Contracts Understanding Atticus Dataset might improve its performance on a different contracts benchmark. After several iterations, I came up with the idea to train an LLM using a detailed Memo to generate high-quality synthetic training data.

Identifying the Problem - CUAD Contracts Classification

A first step to fine-tuning is identifying a task where model performance needs improvement, such as conforming to a specific output behavior, or generalizing knowledge in order to perform new tasks. In my case, I wanted to improve LegalBench benchmark performance in some way. This is a somewhat artificial task, as it isn't improving a workflow or attacking a real-world problem. But, it did conceivably involve training a model to recognize legal concepts through new token associations, as LegalBench is full of domain specific association tasks (such as labeling legal language), and legal reasoning tasks, such as interpreting the effects of certain kinds of language.

CUAD Cap on Liability was a dataset where GPT's performance was pretty atrocious. The base model's performance when tested against a 400 row training set was that about 71% of rows were accurately labeled. In terms of identifying true negatives, GPT was around 99% accurate. however, in identifying true positives, it only recognized 87/200, or a Recall around 44%. Notably, GPT seemed to have very high precision in identifying liability cap clauses that involved numerical or aggregate maximum amounts. It could not identify any other limitation of liability clauses, such as damages exclusions. This was great, because it gave me a very focused and narrow fine-tuning target - train GPT to recognize more types of Cap on Liability contract clauses.

Fine Tuning Approaches that don't work

At first, I took a "naive approach" to fine-tuning that consisted of giving a model several hundred correct question/answer pairs, and then seeing if it improved when benchmarked against another question/answer set. It took only a few attempts to identify that fine-tuning on benchmark datasets did not work well. I suspect that the problem is that benchmark datasets contain too few output token examples. Benchmark datasets typically have a task / query / answer format where LLM is asked to perform a benchmark task, provided a context passage and question, where the answer is typically a "Yes / No" format or very brief answer, such as a single letter or word.

Sample Rows of CUAD Cap on Liability

This dramatic imbalance between input and output tokens doesn't seem to positively shift token weights and create desired associations for token generation, and as a result, the benchmark scores and model behavior do not change. Since the Benchmark datasets by themselves were not good at reinforcing the desired answers, I thought, "what would create a lot of desired input / output token associations for a clause classification task?" Perhaps, lots of over-explaining of why a particular classification answer was correct?

A second failed fine-tune occurred when I accidentally included a system-message in my training data from the original CUAD task (instructing the model to answer "Yes or No") but then providing an assistant answer that was a five-part analysis. The result was a fine-tune model that scored a 0/400 on the benchmark because it was incapable of producing "Yes" or "No" answers! Oops.

This "catastrophic failure" actually gave me some insights into how the training data triplet formatting mattered for fine-tune model output. The fact that my assistant outputs contained bullet-point formatting, and the fact that the system message disagreed with the assistant output format had an overwhelming influence on the model's behavior, and I couldn't make it produce anything useful. But I did learn something from it.

The "Memo Approach"

What would legal professional do if they had a team of legal reviewers who were consistently missing certain key concepts over and over? They would write a detailed, explanatory memo that reinforces the legal concepts, explains the "why" behind each tasks, and hopefully brings structure and method to the review process. I adopted a "memo approach" to generating synthetic data for fine-tuning a model. (Of course, lots of other folks in many other professions also write explanatory memos, too. Maybe we can fine-tune LLMs with those as well!)

The following approach was inspired by previous experiments I'd done with LlamaGuard, which is a fine-tuned version of the smallest Llama model from Meta, and part of the Llama family of LLMs. LlamaGuard uses a long "taxonomy" prompt in order to perform content moderation classifications. What I noticed, and the Meta team indicated, is that LlamaGuard will perform the taxonomy labeling with or without including the taxonomy prompt due to its fine-tuning. The model had learned the unsafe categories and labeling codes through the fine-tune process. So, perhaps teaching a model to label contracts could work similarly.

Steps to the Memo Approach - Generating Synthetic Data

  1. Build/Acquire a dataset of question / answer pairs. Unfortunately there's usually not an easy way to do this. If you have a specific legal topic in mind, there may be practice guides, multiple choice questions, or existing legal datasets to help get you started. I had LegalBench - CUAD Question Answers, but otherwise gathering this data may take a long time.
  2. Evaluate your LLM. Run Question through an LLM to find where it's performance is lagging. This is most efficiently performed when the questions have boolean "Yes / No" answers, or there is an answer key that can be used with LLM evaluations to determine when correct and incorrect answers are being generated. Again, LegalBench provided this part.
  3. Write a memo. Based on the subjects and questions where the LLM shows deficiencies, generate an outline or practice guide that covers these topics in detail. Include features that will assist training the concepts you would like the model to learn, such as background facts, theory and doctrine, rules, examples, or exceptions. Be exhaustive and thorough here because the information will selectively appear in the training set when it's relevant to a particular question / answer pair.
  4. Select a Training Model. This is typically a model that is more capable and that possess superior generation quality we are trying to train our lightweight model to emulate. One of the touted benefits of fine-tuning is that you can train a lightweight LLM to produce output comparable to a high-powered model for some specific task, all the while paying a fraction of the inference cost. In this case, both gpt-4o and gpt-4o-mini have comparable (poor) performance at CUAD classifications, so what are really seeking to do is surpass both models' performance by enabling new task performance.
  5. Generate Synthetic Data. I provided the Training Model with my Memo, along with Contract + Label Answers, and prompted the LLM to write an explanation for the labeling choice using information based on the memo. I actually wasn't too excited with the quality of explanations generated by GPT-4o. They were a bit repetitive and didn't quite have the "IRAC" structure I'd wanted. But, I figured the repetitive answers might create the token associations we needed for contact clause recognition.

The cap_guide training memo can be viewed here in plain text form.

System Prompt used to Generate Synthetic Data for step 5.

Below is a sample training triplet of system message, CUAD clause, and the explanation generated by the Training Model. While it provides a justification for its answer that includes citations to the clause, it's rules statement is actually a policy justification because it the system prompt isn't quite fully baked. It also repeats itself a lot, but I figured that might be grist for the mill.

Mmm... gristy.

Steps to the Memo Approach - Fine Tune the Model

Google Colab Notebook is Here

  1. Kick of Fine-Tune job. Once I had the training data and validation sets properly formatted and uploaded to OpenAI, I fired off a fine-tune job. OpenAI provides extensive documentation on how to fine-tune a model, which are fairly easy to follow if you are familiar with basic Python programming and a bit of data manipulation.
  2. Evaluate the Fine Tune Model. In this case, I just ran the fine-tune model against the same validation set I used to score the base model. The results blew me away! Below, I have some theories as to why it performed so well. Remember when I noted that GPT's recall for true positives was only 44%? The fine-tune model more than doubles recall, with performance in the low 90s.

Remember when I noted that GPT's recall for true positives was only 44%? The fine-tune model more than doubles recall, with performance in the low 90s.

Take aways and Conclusions for fine tuning

I don't want to overstate any conclusion because this is one positive result. I could have gotten lucky for a lot of reasons, it may not generalize for various reason, and therefore, I don't know how well any of this works. I do think it's a worthwhile avenue for experimentation because it's a fairly low-effort method for boosting LLM performance, and PEFT fine tuning is really cheap!

The potential limitations are many and unknown here, so don't leave thinking we've solved LegalBench or cracked fine-tuning, because none of that is the case.

Acing LegalBench as a Fine Tuning task

In many cases, optimizing a model to crush a benchmark is not a particularly useful endeavor. LegalBench has some exceptions, as certain tasks within it are labeling or reasoning tasks involving legal language that have some applications in legal tech / legal informatics. In this instance, improving CUAD labeling performance has direct application in contract lifecycle and clause analysis software, so I think this actually approaches usefulness.

I also wonder if there are tradeoffs. Understanding LLM performance is so complex that it's difficult to determine if a fine-tune model that has mastered contracts clauses has now forgotten how to talk like a pirate, or explain how hula hooping works. I kid, but messing with low-rank matrices in the convolutional layers means deprioritizing some pathways and capabilities and presumably this means a hit to performance elsewhere.

In terms of implications for benchmarking, I don't think this really changes how I think about benchmarking. It's still important for evaluating off-the-shelf LLMs, and there are still limitations to what you can learn from them based on the benchmark design and composition. I am, however, much more optimistic about the kinds of performance deficiencies that can be addressed through fine-tuning.

The Memo Approach

I like this "double rag" approach of applying a research memo to a variety of contract clauses because it bakes a lot of domain specific knowledge into the training data, while providing token diversity at the same time. Providing many different types of examples and explanations, hopefully, results in fine-tuning that embeds the concepts in a more generalizable fashion, such that the LLM can not only regurgitate the rules or facts, but apply them correctly in labeling.

Base Model Performance

The base model performance was evenly divided between tasks it performed near perfectly (true negatives, and certain cap on liability clauses), and tasks it performed uniformly incorrectly (other cap clauses). I didn't know whether or not my fine-tuning needed to reinforce the tasks the base model already did well, but I did not train on true negatives -or- the cap on liability clauses that the base model already labeled correctly. I still need to check to verify whether there are instances where the ft model is now getting things wrong that the base model labeled correctly. However, the ft model score was so high that I don't think very many previously correct answers could have flipped post-fine-tuning.

Epilogue - Peering into the Robot's Brain

That's its eye, not brain.

I got curious about my new fine-tuned LLM and wanted to try and understand why it performed so much better than the base model gpt-4o-mini. I headed over to OpenAI's playground to test the models side-by-side...

There's a Twist

When I asked both models to explain "Cap on Liability", I was surprised - the base model gpt-4o-mini knew all about it, and in fact provided more examples of limitation of liability clauses than my fine-tune model:


It isn't necessarily a lack of knowledge or exposure to the contract concepts in the training data, but that the base model likely was not trained to perform the task of recognizing kinds of contract clauses when a clause was presented to it. Amusingly the ft model was producing answers that had more detailed examples as well as explaining the purpose of the clause in the context of commercial contracting, reflecting the content learned from the Memo.

When asked to label CUAD clauses, gpt-4o-mini was suddenly using a different definition of "cap on liability". I spoon-fed several CUAD Cap on Liability clauses where gpt-4o-mini failed at labeling them correctly to see if a long-form answer provided the same results. Consistent with the benchmark results, the base model failed to identify various types of limitations of liability as cap on liability clauses, despite having identified them when asked about 'cap on liability' generally. This demonstrates how weirdly brittle LLM performance can be with respect to generalizing knowledge for successful task completion.




I was particularly curious about an example involving a "notice and cure" clause, because in my mind, that's sort of at the far-reaches of a limitation on liability clause (I don't even understand a Notice and Cure to be a cap on liability, but CUAD does), and I expected the models to have trouble with that. The fine-tune model was trained to recognize these, so I expected it to get it right. However, notice how both models identify the clause as a condition precedent, but then the ft model proceeds to apply the Memo logic to why it is a limitation of liability. This demonstrates some interesting inter-mingling of the pre-training parametric knowledge, and the newly imprinted fine-tune knowledge into a single answer analyzing this contract clause. Neat!


base model answer
fine tune model answer

Compare the ft model's answer above to the "Notice and Cure" description from the memo:

This clause serves to cap liability by creating procedural requirements that must be satisfied before a claim can be pursued or damages can be sought. Its effectiveness lies in its ability to provide a structured approach to addressing potential breaches, thereby potentially averting formal legal action and associated liabilities. By mandating notice within a specific timeframe, it can limit the period during which breaches can be claimed, effectively capping historical liability. The cure period offers an opportunity to remedy issues before they escalate, potentially reducing or eliminating damages altogether. This provision acts as a buffer against immediate litigation or other formal dispute resolution processes, encouraging communication and problem-solving between the parties. The clause's impact on liability can vary based on factors such as the length of notice and cure periods, the specificity required in the notice, and whether certain types of breaches (e.g., payment defaults) are excluded from these requirements. Ultimately, this clause can significantly mitigate liability by promoting early identification and resolution of issues, potentially preserving business relationships and avoiding costly legal proceedings.

Mariia Shcherbakova

AI & Project Management | International Telecommunication Union (ITU) | Master of Science in Applied AI

7 个月

This is a great experiment, thank you for sharing this! I wonder if the tradeoff here that you mention may happen when you ft the model for other types of contracts as well, would that mean it’s not possible to generalize

Michael McGinn

Senior Manager, Artificial Intelligence Solutions @ Fasken | AI Strategy, Innovation ????

7 个月

Nice! Thanks for sharing this.

回复
Joshua Schoen

Investor | Advisor | Ex-founder

7 个月

Yes!

回复

要查看或添加评论,请登录

Leonard Park的更多文章

社区洞察

其他会员也浏览了