Legal drafting by AI: Putting GPT-2 to practical use
Sergii Shcherbak
Founder & CEO @ Shcherbak AI | AI/ML engineer (NLP) | Lawyer (LL.M.)
GPT-2, which stands for Generative Pretrained Transformer-2, is a powerful novel language model architecture open-sourced by OpenAI, a renowned artificial intelligence research laboratory. This architecture has demonstrated state-of-the-art results in a variety of NLP tasks, particularly in text generation.
GPT-2 has been released in tiers, from a light version with a small number of parameters, to a very large version with a much bigger amount of pre-trained weights. Less than a week before the publication of this post, a GPT-3 research paper was issued by OpenAI: it is very likely that the new model, which has a whopping 175 billion parameters (vs 1.5 billion parameters of the largest GPT-2 model), will be released in tiers as well. This post will focus on the application of the currently available GPT-2, and we are very much looking forward to the release of the first available version of GPT-3.
The GPT-2 model is very good at writing texts in any language domain, provided that it was fine-tuned on the domain-specific dataset. To explore the potential of GPT-2 for legal language generation (i.e. writing legal texts), we fine-tuned the available GPT-2 architecture on the extensive GDPR dataset we have collected in the context of developing such services as Maigon PPC and Maigon DPA. The dataset consists of more than 1500 documents, including privacy policies, data processing agreements, the GDPR and official privacy guidelines from supervisory authorities.
After fine-tuning the model, we put it to practical use. First, we took the current privacy policy from one of our websites and run it through the AI evaluation by Maigon PPC. The result is that 59% of the GDPR criteria are addressed, which is pretty good since the conditional GDPR criteria, such as automated decision-making and legal ground legal obligation, which constitute the other 41%, are not relevant in the context of that privacy policy.
Now, we have the result of 59% addressed GDPR criteria, let us assume that we need to get to 100%. In this case, why not use our trained GPT-2 model to fill in the "gaps"? We started from giving “promptsâ€, or starting points, to the model, so that it can “complete†(i.e. finish writing) the rest of the statements addressing the “missing†GDPR criteria. What you see below are the resulting statements: the underscored text is our prompts (starting points), and the wording in bold is fully generated by our trained GPT-2 model.
Let us review criterion-by-criterion, statement-by-statement, highlighted as “missing†by the Maigon PPC tool.
We focus on the “missing†legal grounds for processing first, to complete the existing section “Purposes and legal grounds for the processing†of the analyzed privacy policy. We start with Legal Ground – Consent.
- You consent to our terms and conditions ("Agreement") and the use of your personal information in accordance with this Privacy Policy. Consent must be voluntary, a specific and informed indication of your intention to consent to our use of personal information about you.
You can see how our model picked up the previous context of the privacy policy and completed the statement “You consent†with wording related to the user’s consent to the data processing. To introduce a bit of more useful information into the privacy policy, we “asked†the model what the consent must be, and it “answered†correctly in accordance with the GDPR and the absorbed knowledge of privacy guidelines.
The next criterion we addressed is Legal Ground – Contract.
- Information you submit is necessary to offer you our products, services, and to perform the services we require. Without this information, we cannot provide the content of our websites or otherwise provide our services.
Notice how correctly the model reasoned, considering the previous context, that we are talking about the necessity of the information for fulfilling a contract with the user, i.e. providing the services the user has requested through the website. Trying to make things a bit more “difficult†for the model, we “asked†it to write what will happen if no information is provided by the user, and the model “answered†correctly stating that no services can be provided without it. Not to mention that the statement is grammatically correct too.
Up next, Legal Ground – Legal Obligation.
- The processing is necessary for compliance with our legal obligations. Such legal obligations may include a court order or other legal obligation.
This looks like a semantically correct wording. We prompted the GPT-2 model that we (probably) would talk about compliance with legal obligations, and we got a relevant wording back. We also "asked" the model to clarify what a legal obligation could be, and the model mentioned a court order as an example. Still, the resulting statement could be more comprehensive than that.
As regards Automated Decision-Making, it should be quite straightforward: we prompt that we do not use it and see what the model comes up with.
- No automated decisions are taken about you. "Automated decision-making", in accordance with the GDPR, means "an automated decision which produces legal effects concerning a natural person, such as profiling".
Looks like an explanation that we need. After completing the negating statement about our use of automated decision-making, we "asked" the model to introduce the definition of automated decision-making, which it did. The model even mentioned profiling, which is a widespread application of automated decision-making. However, the model did not mention explicitly that an automated decision is based solely on automated processing, as the GDPR states, therefore we ask the model again and see what it comes up with.
- "Automated decision-making", in accordance with the GDPR, means 'any decision which produces legal effects concerning a data subject or similarly significantly affects a data subject; 'any decision which is based solely on automated processing of personal data'.
A more detailed description now, although dropping single quotes and having the “and†between two phrases would help. Also, profiling was omitted from the statement this time, which is not critical. We can see that the model text generation capabilities are not ideal, and sometimes the resulting text needs to be slightly adjusted by a lawyer.
Next, we wrote about Special Categories of Personal Data.
- No special categories of personal data are collected on this website. Special categories of personal data, in accordance with the GDPR, include such sensitive information as racial or ethnic origin, political opinions, religious or philosophical beliefs, trade union membership, genetic and biometric data, and the processing of health data.
By starting the statement with “No special categoriesâ€, we wanted to guide the model towards writing a negating statement about the processing of special categories of personal data, which it did correctly. Notice how the model picked up the previous context and included “on this website†in the statement. To clarify to the users what special categories of personal data mean under the GDPR, we asked the model, and it gave us an explanation based on its GDPR domain knowledge obtained from the legislation and official privacy guidelines.
What about Children’s Personal Data? This is an interesting aspect, since countries define the “child†age differently (within the limits defined by the GDPR, of course). In Sweden, a “childâ€, for the purposes of data protection, is a person younger than 13 years of age, and parental consent is required for the processing of their personal data. Provided that our privacy policy mentions that we are a Swedish company, it was interesting to see what the model would come up with.
- No personal data of children under the age of 13 will be collected on this site.
There we have it. The age was correctly set based on the previous context. We believe that the model deduced this legal limit from the array of legislative guidelines and privacy documents, some of which included data on Sweden or were drafted by companies established in Sweden.
Now, this is not related to any of the specific GDPR criteria, but the next wording is good to have if we want to make sure that the users do not accidentally submit us information containing the above sensitive data.
- Make sure such personal data is not included in the documents or communications you send to us.
Here, the model easily detected that we are talking about the information the user submits to us.
We complete the “missing criteria†list with Data Protection Officer (DPO).
- We have appointed a data protection officer (DPO) who will be responsible for the handling of any questions you may have regarding our use of personal information. Please see the contact information below.
Assuming that we have appointed (or have to appoint) a data protection officer (which we don’t), we asked the model to come up with wording about the role of our imaginary data protection officer. And it not only did that – it also completed the statement with a reference to his/her contact details, which, statistically, are usually listed “belowâ€.
Having drafted all the above statements to fill-in the “gaps†highlighted by the Maigon PPC tool, let us check if we achieved the 100% score by submitting the updated privacy policy to the tool for evaluation.
Unsurprisingly, we have now addressed all the GDPR requirements. And we achieved this with the help of our GPT-2 model trained on thousands of documents representing the GDPR domain knowledge. We could go further and say that our model served us as a legal consultant, which technically is exactly the role the model performed.
Interestingly, the GPT-2 model, having “absorbed†all the provided GDPR domain knowledge, does not generate the same deterministic text every time. Instead, it always comes up with new wording, while trying to make it contextually relevant and grammatically and semantically correct at the same time. We could describe it as a “reasoning†process which is not “memoryâ€-based but “understanding†based.
We see many potential applications of the GPT-2 (and the upcoming GPT-3) architecture in legal work. Our model, trained on troves of GDPR data, could serve as a helper tool for privacy lawyers in their contract drafting work, or as a reference source when a client has a question about a certain GDPR term or concept. To demonstrate this, let us ask our model about one of the GDPR principles.
- "Purpose limitation" principle, in accordance with the GDPR, means that personal information will only be processed in accordance with the purpose(s) for which the information was originally collected.
The model has described the “grasped†concept “in its own words.â€
If you would like to know more about our GDPR model, based on the GPT-2 architecture, send us a message at contact@maigon.io, or contact me directly on LinkedIn.
AI translator | Business analist with AI and genAI expertise | GenAI inspiration and discovery sessions | Attainable AI | Ethical & explainable AI | Microsoft Azure and Power Platform
4 å¹´Thomas Aertgeerts Ik moest aan u denken.
Empresário | Especialista em Tecnologia e Inova??o | Palestrante | Autor | Mentor | Advogado | Conselheiro | Consultor
4 å¹´Amazing!