ç™»å½•æŸ¥çœ‹æ›´å¤šå†…å®¹

Legal drafting by AI: Putting GPT-2 to practical use

Sergii Shcherbak

Founder & CEO @ Shcherbak AI | AI/ML engineer (NLP) | Lawyer (LL.M.)

å‘å¸ƒæ—¥æœŸ: 2020å¹´6æœˆ4æ—¥

GPT-2, which stands for Generative Pretrained Transformer-2, is a powerful novel language model architecture open-sourced by OpenAI, a renowned artificial intelligence research laboratory. This architecture has demonstrated state-of-the-art results in a variety of NLP tasks, particularly in text generation.

GPT-2 has been released in tiers, from a light version with a small number of parameters, to a very large version with a much bigger amount of pre-trained weights. Less than a week before the publication of this post, a GPT-3 research paper was issued by OpenAI: it is very likely that the new model, which has a whopping 175 billion parameters (vs 1.5 billion parameters of the largest GPT-2 model), will be released in tiers as well. This post will focus on the application of the currently available GPT-2, and we are very much looking forward to the release of the first available version of GPT-3.

The GPT-2 model is very good at writing texts in any language domain, provided that it was fine-tuned on the domain-specific dataset. To explore the potential of GPT-2 for legal language generation (i.e. writing legal texts), we fine-tuned the available GPT-2 architecture on the extensive GDPR dataset we have collected in the context of developing such services as Maigon PPC and Maigon DPA. The dataset consists of more than 1500 documents, including privacy policies, data processing agreements, the GDPR and official privacy guidelines from supervisory authorities.

After fine-tuning the model, we put it to practical use. First, we took the current privacy policy from one of our websites and run it through the AI evaluation by Maigon PPC. The result is that 59% of the GDPR criteria are addressed, which is pretty good since the conditional GDPR criteria, such as automated decision-making and legal ground legal obligation, which constitute the other 41%, are not relevant in the context of that privacy policy.

Now, we have the result of 59% addressed GDPR criteria, let us assume that we need to get to 100%. In this case, why not use our trained GPT-2 model to fill in the "gaps"? We started from giving â€œpromptsâ€, or starting points, to the model, so that it can â€œcompleteâ€ (i.e. finish writing) the rest of the statements addressing the â€œmissingâ€ GDPR criteria. What you see below are the resulting statements: the underscored text is our prompts (starting points), and the wording in bold is fully generated by our trained GPT-2 model.

Let us review criterion-by-criterion, statement-by-statement, highlighted as â€œmissingâ€ by the Maigon PPC tool.

We focus on the â€œmissingâ€ legal grounds for processing first, to complete the existing section â€œPurposes and legal grounds for the processingâ€ of the analyzed privacy policy. We start with Legal Ground â€“ Consent.

You consent to our terms and conditions ("Agreement") and the use of your personal information in accordance with this Privacy Policy. Consent must be voluntary, a specific and informed indication of your intention to consent to our use of personal information about you.

You can see how our model picked up the previous context of the privacy policy and completed the statement â€œYou consentâ€ with wording related to the userâ€™s consent to the data processing. To introduce a bit of more useful information into the privacy policy, we â€œaskedâ€ the model what the consent must be, and it â€œansweredâ€ correctly in accordance with the GDPR and the absorbed knowledge of privacy guidelines.

The next criterion we addressed is Legal Ground â€“ Contract.

Information you submit is necessary to offer you our products, services, and to perform the services we require. Without this information, we cannot provide the content of our websites or otherwise provide our services.

Notice how correctly the model reasoned, considering the previous context, that we are talking about the necessity of the information for fulfilling a contract with the user, i.e. providing the services the user has requested through the website. Trying to make things a bit more â€œdifficultâ€ for the model, we â€œaskedâ€ it to write what will happen if no information is provided by the user, and the model â€œansweredâ€ correctly stating that no services can be provided without it. Not to mention that the statement is grammatically correct too.

Up next, Legal Ground â€“ Legal Obligation.

The processing is necessary for compliance with our legal obligations. Such legal obligations may include a court order or other legal obligation.

This looks like a semantically correct wording. We prompted the GPT-2 model that we (probably) would talk about compliance with legal obligations, and we got a relevant wording back. We also "asked" the model to clarify what a legal obligation could be, and the model mentioned a court order as an example. Still, the resulting statement could be more comprehensive than that.

As regards Automated Decision-Making, it should be quite straightforward: we prompt that we do not use it and see what the model comes up with.

No automated decisions are taken about you. "Automated decision-making", in accordance with the GDPR, means "an automated decision which produces legal effects concerning a natural person, such as profiling".

Looks like an explanation that we need. After completing the negating statement about our use of automated decision-making, we "asked" the model to introduce the definition of automated decision-making, which it did. The model even mentioned profiling, which is a widespread application of automated decision-making. However, the model did not mention explicitly that an automated decision is based solely on automated processing, as the GDPR states, therefore we ask the model again and see what it comes up with.

"Automated decision-making", in accordance with the GDPR, means 'any decision which produces legal effects concerning a data subject or similarly significantly affects a data subject; 'any decision which is based solely on automated processing of personal data'.

A more detailed description now, although dropping single quotes and having the â€œandâ€ between two phrases would help. Also, profiling was omitted from the statement this time, which is not critical. We can see that the model text generation capabilities are not ideal, and sometimes the resulting text needs to be slightly adjusted by a lawyer.

Next, we wrote about Special Categories of Personal Data.

No special categories of personal data are collected on this website. Special categories of personal data, in accordance with the GDPR, include such sensitive information as racial or ethnic origin, political opinions, religious or philosophical beliefs, trade union membership, genetic and biometric data, and the processing of health data.

By starting the statement with â€œNo special categoriesâ€, we wanted to guide the model towards writing a negating statement about the processing of special categories of personal data, which it did correctly. Notice how the model picked up the previous context and included â€œon this websiteâ€ in the statement. To clarify to the users what special categories of personal data mean under the GDPR, we asked the model, and it gave us an explanation based on its GDPR domain knowledge obtained from the legislation and official privacy guidelines.

What about Childrenâ€™s Personal Data? This is an interesting aspect, since countries define the â€œchildâ€ age differently (within the limits defined by the GDPR, of course). In Sweden, a â€œchildâ€, for the purposes of data protection, is a person younger than 13 years of age, and parental consent is required for the processing of their personal data. Provided that our privacy policy mentions that we are a Swedish company, it was interesting to see what the model would come up with.

No personal data of children under the age of 13 will be collected on this site.

There we have it. The age was correctly set based on the previous context. We believe that the model deduced this legal limit from the array of legislative guidelines and privacy documents, some of which included data on Sweden or were drafted by companies established in Sweden.

Now, this is not related to any of the specific GDPR criteria, but the next wording is good to have if we want to make sure that the users do not accidentally submit us information containing the above sensitive data.

Make sure such personal data is not included in the documents or communications you send to us.

Here, the model easily detected that we are talking about the information the user submits to us.

We complete the â€œmissing criteriaâ€ list with Data Protection Officer (DPO).

We have appointed a data protection officer (DPO) who will be responsible for the handling of any questions you may have regarding our use of personal information. Please see the contact information below.

Assuming that we have appointed (or have to appoint) a data protection officer (which we donâ€™t), we asked the model to come up with wording about the role of our imaginary data protection officer. And it not only did that â€“ it also completed the statement with a reference to his/her contact details, which, statistically, are usually listed â€œbelowâ€.

Having drafted all the above statements to fill-in the â€œgapsâ€ highlighted by the Maigon PPC tool, let us check if we achieved the 100% score by submitting the updated privacy policy to the tool for evaluation.

Unsurprisingly, we have now addressed all the GDPR requirements. And we achieved this with the help of our GPT-2 model trained on thousands of documents representing the GDPR domain knowledge. We could go further and say that our model served us as a legal consultant, which technically is exactly the role the model performed.

Interestingly, the GPT-2 model, having â€œabsorbedâ€ all the provided GDPR domain knowledge, does not generate the same deterministic text every time. Instead, it always comes up with new wording, while trying to make it contextually relevant and grammatically and semantically correct at the same time. We could describe it as a â€œreasoningâ€ process which is not â€œmemoryâ€-based but â€œunderstandingâ€ based.

We see many potential applications of the GPT-2 (and the upcoming GPT-3) architecture in legal work. Our model, trained on troves of GDPR data, could serve as a helper tool for privacy lawyers in their contract drafting work, or as a reference source when a client has a question about a certain GDPR term or concept. To demonstrate this, let us ask our model about one of the GDPR principles.

"Purpose limitation" principle, in accordance with the GDPR, means that personal information will only be processed in accordance with the purpose(s) for which the information was originally collected.

The model has described the â€œgraspedâ€ concept â€œin its own words.â€

If you would like to know more about our GDPR model, based on the GPT-2 architecture, send us a message at contact@maigon.io, or contact me directly on LinkedIn.

Simon Uytterhoeven

4 å¹´

Thomas Aertgeerts Ik moest aan u denken.

èµž

å›žå¤

1 æ¬¡å›žåº”

Mauro Roberto Martins Junior

4 å¹´

Amazing!

èµž

å›žå¤

1 æ¬¡å›žåº”

æŸ¥çœ‹æ›´å¤šè¯„è®º

è¦æŸ¥çœ‹æˆ–æ·»åŠ è¯„è®ºï¼Œè¯·ç™»å½•

Sergii Shcherbakçš„æ›´å¤šæ–‡ç«

Wide(r) Adoption of LegalTech: Key Obstacles, and Ways to Overcome Them

2023å¹´2æœˆ20æ—¥

Wide(r) Adoption of LegalTech: Key Obstacles, and Ways to Overcome Them

LegalTech, or the use of technology to improve the delivery of legal services, has gained considerable momentum withâ€¦
Using GPT-3 in Contract Review: Maigon Example

2023å¹´1æœˆ16æ—¥

Using GPT-3 in Contract Review: Maigon Example

With all the buzz surrounding ChatGPT, some people are asking me if we are using a large language model in Maigon. (Forâ€¦
AI Review Of Consultancy Agreements: Challenge, Solution, Results

2021å¹´12æœˆ9æ—¥

AI Review Of Consultancy Agreements: Challenge, Solution, Results

Consultancy agreement is a type of professional services agreement between an independent contractor (Consultant) and aâ€¦
Some Thoughts on Standardization of Legal Templates

2021å¹´2æœˆ16æ—¥

Some Thoughts on Standardization of Legal Templates

There are ongoing attempts to standardize recurring legal documents. In private sector LegalTech, the most recent oneâ€¦

13 æ¡è¯„è®º
Benefits of Being Active on StackOverflow

2017å¹´12æœˆ18æ—¥

Benefits of Being Active on StackOverflow

StackOverflow is like a Wikipedia for developers: a community-based curated knowledge base for programmers of allâ€¦

See all articles

Legal drafting by AI: Putting GPT-2 to practical use

Sergii Shcherbak

Founder & CEO @ Shcherbak AI | AI/ML engineer (NLP) | Lawyer (LL.M.)

Sergii Shcherbakçš„æ›´å¤šæ–‡ç«

ç¤¾åŒºæ´žå¯Ÿ

å…¶ä»–ä¼šå‘˜ä¹Ÿæµè§ˆäº†

Deepseek - Knowledge distillation and vertical AI agents - we are entering uncharted waters

Applied AI, NLP, and Blockchain for Lawyers: A Data Professional's View!

How To Build The Ultimate Legal LLM Stack

Open Source vs Closed Source LLM Models, How to Choose?

Is Open Source the Answer for Ethical and Explainable AI?

French AI Startup Mistral Unveils Pixtral 12B: A Multimodal Game-Changer for Images and Text!

2024 LLM Review

ADVANCED PROMPT ENGINEERING: A Guide for Lawyers, With Best Practices

Should Your Business Build a Private LLM?

Knowledge Graph Semantic Enhancement of Input Data for Improving AI

Sergii Shcherbakçš„æ›´å¤šæ–‡ç«

Wide(r) Adoption of LegalTech: Key Obstacles, and Ways to Overcome Them

Using GPT-3 in Contract Review: Maigon Example

AI Review Of Consultancy Agreements: Challenge, Solution, Results

Some Thoughts on Standardization of Legal Templates

Benefits of Being Active on StackOverflow

ç¤¾åŒºæ´žå¯Ÿ

å…¶ä»–ä¼šå‘˜ä¹Ÿæµè§ˆäº†

Deepseek - Knowledge distillation and vertical AI agents - we are entering uncharted waters

Applied AI, NLP, and Blockchain for Lawyers: A Data Professional's View!

How To Build The Ultimate Legal LLM Stack

Open Source vs Closed Source LLM Models, How to Choose?

Is Open Source the Answer for Ethical and Explainable AI?

French AI Startup Mistral Unveils Pixtral 12B: A Multimodal Game-Changer for Images and Text!

2024 LLM Review

ADVANCED PROMPT ENGINEERING: A Guide for Lawyers, With Best Practices

Should Your Business Build a Private LLM?

Knowledge Graph Semantic Enhancement of Input Data for Improving AI

å…¶ä»–ä¼šå‘˜ä¹Ÿæµè§ˆäº†