登录查看更多内容

Working with ChatGPT oi preview: Error recognition and classification in raw machine transcriptions of C17th legal depositions

Colin Greenstreet

Co-founder of international mental health non-profit

发布日期: 2024年9月14日

Overview

MarineLives is running a project to machine transcribe and machine/human correct 50,000 pages of legal depositions from the C17th. That's roughly 30 million words. Our goal is to complete a high quality machine transcription by the end of 2025, which is pretty ambitious!

We are using a bespoke machine transcription model run on the Transkribus platform, on which we also have a demo of model output. The model is trained on roughly 400,000 tokens.

We are exploring the use of LLMs to automate the cleanup of the raw machine transcriptions, and over the last two days we have had a chance to road test ChatGPT oi preview and to compare it with a tailored version of ChatGPT which we have developed named Aurelius-HTR. oi preview wins hands down.

You can read about our work with small scale and large scale LLMs on our Hugging Face site.

Exploring ChatGPT oi preview

Prompt one

I set the scene with two lines at the start of my first prompt and stated the task

You are an excellent pattern recogniser, systems analyst, and coder. You are helping me design a process for ChatGPTo1 to improve the quality of raw-HTR output. Please analyse these data and identify different types of HTR error which these python snippets correct. Here are the snippets:

The Python snippets I provided contain a range of machine transcription errors, in no particular order. The first part of each snippet contains an error; the second part contains the corrected text. I gave no guidance on error recognition or classification beyond the prompt and the data.

r'\bamonighte them the goodes articulate\b': 'amongste them the goodes articulate',
r'\bAnd th hee saith is true\b': 'And this hee saith is true',
r'\bby tresse of weather\b': 'by stresse of weather', 
r'\bof the befence of yt\b': 'of the defence of yt',
r'\brighte worshipfull hard Zouch\b': 'righte worshipfull Richhard Zouch',
r'\bsaieth and depose as afolloweth\b': 'saieth and deposeth as followeth',
r'\bsawe the said shippe comaged and searched\b': 'sawe the said shippe romaged and searched', #'romaged' or 'rumaged'
 r'\bsawe the comageinge of the said shippe after all her goodes were take out of her\b': 'sawe the romageinge of the said shippe after all her goodes were take out of her', #'romageing' or 'rumageinge'
r'\bthe articulate shippe the Mary ffse\b': 'the articulate shippe the Mary Rose',
r'\bthe barrells of Anthones and Capers\b': 'the barrells of Anchovies and Capers',
r'\bthe rese of the sd goodes\b': 'the reste of the said goodes',
 r'\bthe take of the Blessinge\b': 'the takeinge of the Blessinge',
r'\bthe usuall and ordinant fraighte prices \b': 'the usuall and ordinary fraighte prices ',
r'\bthe yeare of on bord 1636 or 1637\b': 'the yeare of our Lord 1636 or 1637',
 r'\bthis examats beefe remembrance \b': 'this examinats beste remembrance ',
r'\bto Mashforde\b': 'to Washforde',
r'\bto Mashforde in Ireland\b': 'to Washforde in Ireland',
r'\bto Pishforde\b': 'to Washforde',
r'\bviselicet\b': 'videlicet',
r'\bto Wishford in Ireland\b': 'to Washford in Ireland',

oi preview took thirty-seven seconds of "think time" in which it went through the following process:

Prompt two

The results oi preview then shared were pretty good, but I wanted to see if we could do better. So I constructed a second prompt, this time containing an approach to error classification which I had developed with oi preview the previous day.

Here is the start of my second prompt in my ongoing conversation with oi preview:

Michael Spencer 1 年前

ChatGPT: Your New Best Friend in the Tech World!

Kesha Williams 1 年前

Top 10 Secrets of Benefitting The Most Out of ChatGPT

Adeel Naeem 1 年前

Good but not perfect. You can do better. I am going to give you a framework to categorize errors. Please incorporate this framework into your thinking Modify it if appropriate Then reanalyze the 20 Python snippets I gave you earlier. Here is the framework to categorize errors

The framework I shared within the second prompt contained five sections, each section containing sub-types of errors together with examples. oi preview very helpfully summarised the framework in its response to this second prompt, allocating each section a letter, then used those letters to classify errors.

This time oi preview took fifty-nine second of "think time", integrating the error recognition framework and reanalyzing the python snippets within that framework.

The output from this second prompt was highly structured and analytical. oi preview walked through each of the snippets listing the original text, the corrected text, isolating and analyzing the error and assigning an error type. The assignment of the error type was the smart bit, and is an approach which is highly generalisable as we build our cumulative HTR error/correction. We plan to upload a dataset of 10,000 HTR error/correction pairs to our Hugging Face site when complete,

Here is an extract of this part of oi preview's response for the first two of the twenty python snippets:

oi preview then provided a summary of error patterns, which was thought provoking and provides the basis for scaling up and expanding our analysis of HTR errors. The analysis is ot perfect. For example, the single letter insertion examples are actually double letter insertons. But, when provided with much larger datasets, or fine-tuned with such larger datasets, I would expect a high degree of accuracy.

Finally, in it's respose to prompt two, oi preview provided some general observations on error patterns, which were insightful, and made some process improvement suggestions, which were plausible and merit working on.

Generalizing from the specific examples; making process recommentations

For those of you unfamiliar with C17th legal depositions, this is what they look like

Example of raw transcription

Here is an example of one page of raw transcription from a volume of depositions from the 1642 to 1644 period. The page is HCA 13/58 f.525v. The text is a so-called full diplomatic transcription, and preserves abbreviations and contractions. It is human readable, but not perfect, and needs cleanup, and needs the shortforms to be expanded to assist readability.

HCA 13/58 f.525v: Deposition of a thirty-three year old mercer named John Ford, who lived at Saint Mry's Le Bow in London in 1643

Working with ChatGPT oi preview: Error recognition and classification in raw machine transcriptions of C17th legal depositions

Colin Greenstreet

Co-founder of international mental health non-profit

Overview

Exploring ChatGPT oi preview

Prompt one

Prompt two

领英推荐

For those of you unfamiliar with C17th legal depositions, this is what they look like

更多精彩文章

社区洞察

其他会员也浏览了

How non-programmers can use Chatgpt’s Code Interpreter to kickstart analysis

Developers should be flocking to ChatGPT not running from it

What is ChatGPT? Can ChatGPT replace Google? Can ChatGPT replace Developers?

Using ChatGPT's Code Interpreter to take a large group of emails from a To: field and generate LinkedIn profile links

How to Upload a File to ChatGPT, and Why You Might Want To

How to write a plugin for ChatGPT-4?

Unlock ChatGPT Developer Mode

ChatGPT will not replace programmers, but Search Engines should be worried

I asked OpenAI's ChatGPT to write me a trading algorithm. It blew my mind.

Coding a Topological Sort with ChatGPT

Overview

Exploring ChatGPT oi preview

Prompt one

Prompt two

领英推荐

For those of you unfamiliar with C17th legal depositions, this is what they look like

AI agents for historical research

2024年9月5日

A huge thank you

2024年8月13日

Help MarineLives write history

2024年8月13日

So here is the COVID-19 paradox in the United States

2021年2月7日

Sponsor a groundbreaking Kaggle research competition & learn about image based machine learning

2018年5月27日

Signs of Literacy Kaggle Research Competition, 2018

2018年4月30日

Signs of Literacy GitHub organisation

2018年4月5日

#Occupationalsignatures

2018年2月22日

Announcing the formation of Chronoscopic Education

2018年2月12日

Using pattern recognition of signatures & markes to explore occupational fluidity

2018年2月10日

社区洞察

其他会员也浏览了

How non-programmers can use Chatgpt’s Code Interpreter to kickstart analysis

Developers should be flocking to ChatGPT not running from it

What is ChatGPT? Can ChatGPT replace Google? Can ChatGPT replace Developers?

Using ChatGPT's Code Interpreter to take a large group of emails from a To: field and generate LinkedIn profile links

How to Upload a File to ChatGPT, and Why You Might Want To

How to write a plugin for ChatGPT-4?

Unlock ChatGPT Developer Mode

ChatGPT will not replace programmers, but Search Engines should be worried

I asked OpenAI's ChatGPT to write me a trading algorithm. It blew my mind.

Coding a Topological Sort with ChatGPT