登录查看更多内容

Advanced Text Mining with OpenAI

Tom C.

AI & Automation Hacker

发布日期: 2023年7月17日

In today's data-driven business landscape, the ability to derive actionable data insights from unstructured content can deliver a significant competitive advantage.

In a previous article, I described a methodology for mining data from unstructured content by applying Intelligent Automation capabilities. This approach can be used to automate the analysis of a wide variety of content including documents, reports, contracts, emails, case notes and websites. In addition to providing better business insights, text mining can reduce manual processing costs, as well as assisting the detection of high-risk or fraudulent activities.

The arrival of the ChatGPT service, Large Language Models (LLMs) and associated Generative Pretrained Transformer (GPT) AI models, provide powerful new tools to extend our approach to content mining to deliver more insights with less configuration, training or set-up effort.

In this blog post, I'll explore how GPT models complement this content-mining methodology to deliver richer insights, faster with reduced effort as part of the large-scale processing of unstructured content.

Use Case: Research Automation

A classic use case for content mining (also known as text mining, text data mining or knowledge mining) is research automation. Research automation can be used to support jobs that require sifting through large volumes of content to locate important data: from banks providing investment guidance; security services doing criminal investigations; insurance fraud detection; risk analysis; to medical research.

Research automation brings together:

Automation tools like Robotic Process Automation (RPA) and API connectors to gather and source the content.
Low Code to quickly model the required workflow steps, define business rules, create user interfaces to train and guide the AI, and accelerate testing cycles.
Artificial Intelligence capabilities like computer vision, trained machine learning models and pre-trained like natural language processing.

Using a research automation scenario, I extended an existing workflow to test the benefits generative AI can bring to this use case. While there are lots of possible ways this groundbreaking technology could be applied, for this article I added calls to OpenAI GPT & LLMs models at various steps in the process to:

Assist with classifying the content or sections of the content
Locating and extracting key data points of interest
To generate a report summarising the content

I'll explore some of the benefits I saw in the following sections.

First Things First: Keeping Control of Your Data

Before diving into the benefits of GPT models, let's address the crucial concern of data security. While Open AI’s API usage policy offers robust protections and commitments about how data is processed and used, companies should not process data using a vendor without appropriate due diligence processes and contractual cover in place.

As our use case is mining large volumes of unstructured content for business insights, we can’t assume we won’t end up with personal data in the system, especially if we are mining content from public sources.

To address and mitigate concerns about data protection I've used OpenAI models hosted in Microsoft Azure, along with public domain content or public datasets available for testing machine learning (as ever, ensure you understand the ownership, consent, copyright, and legitimate interest required before processing any data or content, and apply appropriate security and risk management controls).

By hosting the models in Azure, we have complete control over where our data is sent, processed and stored, while retaining the right to delete that data without any concerns over data leakage.

At the time of writing, you need a corporate-level Azure account to request access to the latest Open AI models in Azure, and to agree to the responsible use policy.

Semantic Classification and Semantic Search

While GPT models are often associated with chatbot interfaces such as ChatGPT, there's more to Large Language Models than just that. OpenAI's text Embeddings API generates a vector that numerically encodes how the model interprets the supplied text.

When combined with a database that can store and search vectors, by labelling the different vectors in the database, you can search or classify content based on its meaning, by measuring the relatedness of text strings.

This approach offers a powerful alternative to classification methods that rely on language structure, keywords, component sounds (e.g. phonemes), or phrases, as it comprehends the text's meaning. Semantic search enables the classification of new content by comparing its meaning to previously labelled content, while also facilitating the discovery of similar content.

The search result returns entries ranked by proximity (0-1), allowing us to identify how similar the new content is to previously classified content.

It also means we can search for related content. For example, if we were working with contracts broken into separate paragraphs or clauses, we can classify these by searching for similar, labelled contractual clauses in the database.

Previously we would have needed large-scale, big-data infrastructure to break down content into searchable components and to map relationships between them. With a LLM we can bypass this expensive step, populating an off-the-peg cloud database with vectors from the embeddings API, without any need to understand or train the underlying algorithm!

Creating a semantic search service is a very cost-effective approach as part of the broader content mining process, as the cost to process a token (roughly 4 English letters or characters) using an Embeddings model is approx. 1/20th the cost per token than using a GPT model. Using Embeddings and vector search allows us to narrow down the content that is sent for analysis by a GPT model, restricting it to only relevant sections of the overall documents or text block. This in turn reduces the number of expensive GPT tokens that are consumed.

Maxwell E. Uduafemhe, PhD. RTr. 3 个月前

Deep Learning Approaches to Sentiment Analysis, Data…

Open Data Science Conference (ODSC) 1 年前

Data Mining Revolution: How Innovative Techniques are…

DataThick 4 个月前

GPT Powered Locators & Extractors

Another powerful way to use LLM in our content mining use case is as data locators and extractors.

Traditional locators tend to require time-consuming training, where humans train the algorithms how to locate desirable data using curated, labelled examples, or using computer vision where there is a reliance on a visual anchor in the content’s layout to work from.

Pretrained Natural Language Processing (NLP) locators extended this allowing us to locate sections of text based on the entities in the text (dates, places, companies etc.), the sentiment of the content, or the intent of the sentences used. GPT models take this to the next level, as they can use the entire document (or multiple documents) as context, rather than working with 1 or 2 sentences. More importantly, GPT models can be configured without any technical knowledge or pretraining, just by telling the system what to look for using normal language.

GPT models can also return results using data format like JSON or XML – again, just provide the model some examples of how the returned data should be formatted and mapped using natural language, and the model does the rest.

This opens up whole new use cases, both for “citizen developers” who can now configure complex, natural language document extraction in seconds without any specialist training and for automation professionals by massively short-cutting the effort required to train and configure complex locators.

Another benefit is for “hard to train” use cases, such as locating a name or address on a CV where there is no standard format to train from and often no “anchors” in the layout that indicate a name, other than perhaps a different font being used. In our testing, GPT models do an excellent job of identifying names and addresses, without needing to resort to format locators such as regular expression to locate a postcode.

Content Creation & Summarisation

A third use case is summarising or creating content. For any use case where a report needs to be generated, GPT can summarise large amounts of text content in an infinite variety of styles and ways, depending on the prompts and examples provided.

In professions where the output of a workflow is some kind of document, especially where there is a high hourly cost of the knowledge worker creating that content, GPT offers some tantalising productivity boosts and time savers. For example, for caseworkers or legal teams, the combination of automation plus GPT AI can pre-read all the documentation associated with a case, summarise the main points, and highlight key areas, paragraphs, or aspects of the content that need to be focused on.

The approach to summarising content using an Azure GPT Model works in a very similar way to using the ChatGPT chatbot, with various options, or prompts, to guide the response you get. You can easily tune:

Tone of voice: by using different examples or role guidance via “messages” giving context to the model.
Creativity: by setting the “temperature” level to use (from 0-1 or 0-2 depending on the model version), with higher values resulting in less predictable outcomes.
Size of the response: by setting the max number of tokens to use, along with providing guidance in the prompt.

As the cost of an API call will depend on the total number of tokens used across both the request and response to the API, some pre-processing is recommended to only provide relevant information for summarisation. Here the separation and classification techniques discussed above can be applied to narrow down the content to be summarised.

Legal Implications

We are already starting to see legal cases against AI vendors for breach of copyright in the content used to train models, as well as new laws like the EU AI Act , which classifies different AI use cases according to the potential risk, and introduces new obligations, transparency requirements, and legal redress that must be taken into account by any company deploying AI in the business.

Wrapping the use of AI in technology that provides guardrails controlling when it is called, and monitoring the results, will mitigate many of these legal requirements, at least for non-high-risk scenarios. Key capabilities to consider:

Automate the redaction or anonymisation of personal data before it is passed to a model or included in a training data set.
Maintain an audit trail of the data used to train or give context to a model (inc. any relevant copyright information), as well as the data sent to and from the API.
Monitor the consistency or accuracy of results using techniques like benchmarking and sampling.
Build-in “human-in-the-loop” intervention, correction and oversight of the AI into the process.
Provide disclosure statements about when and how GPT models are being used.

Conclusions

As we have seen Generative AI replaces or extends some aspects of a content mining process, with benefits such as:

Simplified configuration, making this AI highly accessible to citizen developers and IA professionals alike.
Massively lower costs and effort Vs training bespoke machine learning analysis.
Accurately locating hard-to-train content extraction use cases.
Huge time savings for any workflow that requires generating content or summaries.

There are also some downsides to the use of GPT models including additional legal / data privacy considerations, maintaining audit trails, and ensuring appropriate and consistent results.

There are huge potential benefits from applying Generative AI to increase productivity in content mining use cases. When layered on strong governance and automation foundations, Generative AI promises to turbocharge business transformation and text-mining programmes.

Technical Notes:

In researching this article use was made of:

Kofax’s TotalAgility Cloud 7.11 and RPA Cloud 11.4 for the Intelligent Automation, Process Orchestration, Intelligent Document Processing and Low Code UI.
Azure Cognitive Services running OpenAI ’s gpt-35-turbo and text-embedding-ada-002?models.
Additional Azure services including: Azure Functions, Azure Cognitive Search and Azure Storage.
For an accompanying demonstration, see: Zero Code Research Automation with TotalAgility and OpenAI ( kofaxdemocenter.com )
OpenAI's ChatGPT Service, including editorial assistance ??

Chetan Kumar

Senior Sales Manager at IDC I Helping IT Enterprises with Research, Consulting, Advisory and Events I Go to market strategy I Sponsorship I Business Development I Client Management I New Client Recruitment I CX

1 年

Interesting article. very well written and explained.

1 次回应

要查看或添加评论，请登录

查看全部

Advanced Text Mining with OpenAI

Tom C.

AI & Automation Hacker

Use Case: Research Automation

First Things First: Keeping Control of Your Data

Semantic Classification and Semantic Search

领英推荐

GPT Powered Locators & Extractors

Content Creation & Summarisation

Legal Implications

Conclusions

Technical Notes:

更多精彩文章

社区洞察

其他会员也浏览了

What makes LLM inference more challenging than traditional NLP?

The Power of OpenText Magellan Text Mining

Latent Dirichlet Allocation (LDA)

5 Best Text Analytics Software’s for 2022

AI-based Text Analytics & Its Solutions - FutureAnalytica

How to Leverage AI for Web Scraping to Boost Business Growth?

Unlocking the Power of Communications Mining with UiPath: A Game-Changer for Shared Services

Exploring the Nuances: Differences Between Text Mining and Data Mining Software

Concept mining in knowledge graphs

Drug Sentiment Analysis using Machine Learning

Use Case: Research Automation

First Things First: Keeping Control of Your Data

Semantic Classification and Semantic Search

领英推荐

GPT Powered Locators & Extractors

Content Creation & Summarisation

Legal Implications

Conclusions

Technical Notes:

Fouille de texte avancée avec OpenAI

2023年8月23日

Overcoming Brexit Challenges with Intelligent Automation

2021年2月25日

Mining Content for Actionable Insights with Intelligent Automation

2020年12月2日

5 STEPS TO IMPROVE DATA QUALITY

2019年6月26日

A Big Week In The Analytics Industry

2019年6月11日

Will Voice Disrupt the AdTech and MarTech Worlds?

2017年10月18日

社区洞察

其他会员也浏览了

What makes LLM inference more challenging than traditional NLP?

The Power of OpenText Magellan Text Mining

Latent Dirichlet Allocation (LDA)

5 Best Text Analytics Software’s for 2022

AI-based Text Analytics & Its Solutions - FutureAnalytica

How to Leverage AI for Web Scraping to Boost Business Growth?

Unlocking the Power of Communications Mining with UiPath: A Game-Changer for Shared Services

Exploring the Nuances: Differences Between Text Mining and Data Mining Software

Concept mining in knowledge graphs

Drug Sentiment Analysis using Machine Learning