Advanced Text Mining with OpenAI
In today's data-driven business landscape, the ability to derive actionable data insights from unstructured content can deliver a significant competitive advantage.
In a previous article, I described a methodology for mining data from unstructured content by applying Intelligent Automation capabilities. This approach can be used to automate the analysis of a wide variety of content including documents, reports, contracts, emails, case notes and websites. In addition to providing better business insights, text mining can reduce manual processing costs, as well as assisting the detection of high-risk or fraudulent activities.
The arrival of the ChatGPT service, Large Language Models (LLMs) and associated Generative Pretrained Transformer (GPT) AI models, provide powerful new tools to extend our approach to content mining to deliver more insights with less configuration, training or set-up effort.
In this blog post, I'll explore how GPT models complement this content-mining methodology to deliver richer insights, faster with reduced effort as part of the large-scale processing of unstructured content.
Use Case: Research Automation
A classic use case for content mining (also known as text mining, text data mining or knowledge mining) is research automation. Research automation can be used to support jobs that require sifting through large volumes of content to locate important data: from banks providing investment guidance; security services doing criminal investigations; insurance fraud detection; risk analysis; to medical research.
Research automation brings together:
Using a research automation scenario, I extended an existing workflow to test the benefits generative AI can bring to this use case. While there are lots of possible ways this groundbreaking technology could be applied, for this article I added calls to OpenAI GPT & LLMs models at various steps in the process to:
I'll explore some of the benefits I saw in the following sections.
First Things First: Keeping Control of Your Data
Before diving into the benefits of GPT models, let's address the crucial concern of data security. While Open AI’s API usage policy offers robust protections and commitments about how data is processed and used, companies should not process data using a vendor without appropriate due diligence processes and contractual cover in place.
As our use case is mining large volumes of unstructured content for business insights, we can’t assume we won’t end up with personal data in the system, especially if we are mining content from public sources.
To address and mitigate concerns about data protection I've used OpenAI models hosted in Microsoft Azure, along with public domain content or public datasets available for testing machine learning (as ever, ensure you understand the ownership, consent, copyright, and legitimate interest required before processing any data or content, and apply appropriate security and risk management controls).
By hosting the models in Azure, we have complete control over where our data is sent, processed and stored, while retaining the right to delete that data without any concerns over data leakage.
At the time of writing, you need a corporate-level Azure account to request access to the latest Open AI models in Azure, and to agree to the responsible use policy.
Semantic Classification and Semantic Search
While GPT models are often associated with chatbot interfaces such as ChatGPT, there's more to Large Language Models than just that. OpenAI's text Embeddings API generates a vector that numerically encodes how the model interprets the supplied text.
When combined with a database that can store and search vectors, by labelling the different vectors in the database, you can search or classify content based on its meaning, by measuring the relatedness of text strings.
This approach offers a powerful alternative to classification methods that rely on language structure, keywords, component sounds (e.g. phonemes), or phrases, as it comprehends the text's meaning. Semantic search enables the classification of new content by comparing its meaning to previously labelled content, while also facilitating the discovery of similar content.
The search result returns entries ranked by proximity (0-1), allowing us to identify how similar the new content is to previously classified content.
It also means we can search for related content. For example, if we were working with contracts broken into separate paragraphs or clauses, we can classify these by searching for similar, labelled contractual clauses in the database.
Previously we would have needed large-scale, big-data infrastructure to break down content into searchable components and to map relationships between them. With a LLM we can bypass this expensive step, populating an off-the-peg cloud database with vectors from the embeddings API, without any need to understand or train the underlying algorithm!
Creating a semantic search service is a very cost-effective approach as part of the broader content mining process, as the cost to process a token (roughly 4 English letters or characters) using an Embeddings model is approx. 1/20th the cost per token than using a GPT model. Using Embeddings and vector search allows us to narrow down the content that is sent for analysis by a GPT model, restricting it to only relevant sections of the overall documents or text block. This in turn reduces the number of expensive GPT tokens that are consumed.
领英推荐
GPT Powered Locators & Extractors
Another powerful way to use LLM in our content mining use case is as data locators and extractors.
Traditional locators tend to require time-consuming training, where humans train the algorithms how to locate desirable data using curated, labelled examples, or using computer vision where there is a reliance on a visual anchor in the content’s layout to work from.
Pretrained Natural Language Processing (NLP) locators extended this allowing us to locate sections of text based on the entities in the text (dates, places, companies etc.), the sentiment of the content, or the intent of the sentences used. GPT models take this to the next level, as they can use the entire document (or multiple documents) as context, rather than working with 1 or 2 sentences. More importantly, GPT models can be configured without any technical knowledge or pretraining, just by telling the system what to look for using normal language.
GPT models can also return results using data format like JSON or XML – again, just provide the model some examples of how the returned data should be formatted and mapped using natural language, and the model does the rest.
This opens up whole new use cases, both for “citizen developers” who can now configure complex, natural language document extraction in seconds without any specialist training and for automation professionals by massively short-cutting the effort required to train and configure complex locators.
Another benefit is for “hard to train” use cases, such as locating a name or address on a CV where there is no standard format to train from and often no “anchors” in the layout that indicate a name, other than perhaps a different font being used. In our testing, GPT models do an excellent job of identifying names and addresses, without needing to resort to format locators such as regular expression to locate a postcode.
?
Content Creation & Summarisation
A third use case is summarising or creating content. For any use case where a report needs to be generated, GPT can summarise large amounts of text content in an infinite variety of styles and ways, depending on the prompts and examples provided.
In professions where the output of a workflow is some kind of document, especially where there is a high hourly cost of the knowledge worker creating that content, GPT offers some tantalising productivity boosts and time savers. For example, for caseworkers or legal teams, the combination of automation plus GPT AI can pre-read all the documentation associated with a case, summarise the main points, and highlight key areas, paragraphs, or aspects of the content that need to be focused on.
The approach to summarising content using an Azure GPT Model works in a very similar way to using the ChatGPT chatbot, with various options, or prompts, to guide the response you get. You can easily tune:
As the cost of an API call will depend on the total number of tokens used across both the request and response to the API, some pre-processing is recommended to only provide relevant information for summarisation. Here the separation and classification techniques discussed above can be applied to narrow down the content to be summarised.
?
Legal Implications
We are already starting to see legal cases against AI vendors for breach of copyright in the content used to train models, as well as new laws like the EU AI Act , which classifies different AI use cases according to the potential risk, and introduces new obligations, transparency requirements, and legal redress that must be taken into account by any company deploying AI in the business.
Wrapping the use of AI in technology that provides guardrails controlling when it is called, and monitoring the results, will mitigate many of these legal requirements, at least for non-high-risk scenarios. Key capabilities to consider:
?
Conclusions
As we have seen Generative AI replaces or extends some aspects of a content mining process, with benefits such as:
There are also some downsides to the use of GPT models including additional legal / data privacy considerations, maintaining audit trails, and ensuring appropriate and consistent results.
There are huge potential benefits from applying Generative AI to increase productivity in content mining use cases. When layered on strong governance and automation foundations, Generative AI promises to turbocharge business transformation and text-mining programmes.
?
Technical Notes:
In researching this article use was made of:
Senior Sales Manager at IDC I Helping IT Enterprises with Research, Consulting, Advisory and Events I Go to market strategy I Sponsorship I Business Development I Client Management I New Client Recruitment I CX
1 年Interesting article. very well written and explained.