登录查看更多内容

Using Positional Frequency to Identify High & Low-Value Words

John M.

Founder/CEO @ RedFile Technologies, Inc | Veteran, Patented Inventor, Author, Master of Smoke & Flame

发布日期: 2016年7月25日

Positional word frequency involves identifying how many times individual words appear at the same relative locations within the pages or documents in a collection. Positional word frequency solves major problems that occur when performing three basic functions involving unstructured content:

Classification
Attribute Extraction / Coding
Search

Without positional word frequency, low-value words can cause clutter in text searches and misclassifications in text classification systems, and can make it difficult to extract document attributes. The use of positional information removes those difficulties.

Let’s look at how positional frequency works. We’ll examine a process where files are first classified, select attributes are then extracted, and finally, the files are indexed for ad hoc searching. As we’ll see, the value of the same words changes during the process depending on their positional frequency and which of the three basic functions are being performed.

Classification

When classifying files, you want to group files that have common words occurring in the same patterns. For example, most contracts contain many boilerplate provisions. They can be grouped for classification by associating the ones that have the same boilerplate terms in the same relative positions. For classification purposes the common or recurring word patterns have a high associative value.

Attribution

Once similar files are classified, the common words that occur in the same positions in all the files in the cluster have very low informational value because they don’t help discriminate among the documents in the cluster. The terms that change in the same positions become the high-value words for attribute extraction because they tell you the variables being used in a particular document or file, e.g., the names of the contracting parties, the beginning term, and the state for jurisdiction in the event of disputes. These low-frequency terms are the high-value words within the clusters for attribute extraction purposes.

Note that some common words can serve as flags to indicate where particular types of document attributes occur. For example, an equipment lease may start off with a paragraph with the name of the party leasing the equipment followed by “(hereinafter the ‘Lessor’)” or may have a line at the end that says “Date:” followed by the date of the contract. Those flags in conjunction with their positional coordinates enable the extraction of high-value variables and the association of those data elements with specific attribute or field names.

Non-Attributed Linking of Related Files Across Classifications

Even without explicit labelling of file attributes, high-value variables can be used to associate specific documents that pertain to the same person, object, or transaction even though they fall in different clusters. In this approach, the common or recurring pattern words are dropped or negated leaving just the high-value variables. These high-value variables are then analyzed to see which files pertain to the same company, person, or other topic.

In our contract analogy, the boilerplate would be dropped leaving just key variables like the names and addresses of the contracting parties, and those can be analyzed to associate documents even from different clusters or classifications. Those documents could be other types of contracts or completely different types of documents like invoices or checks.

Searching

Search precision can be improved substantially by the use of positional operators that enable users to specify where requested search terms occur either in absolute page positions or relative to other terms.

The best text search architecture would involve providing users with search operators or qualifiers so they could search for just high-value search terms within specified file classifications. This approach eliminates most clutter by dropping what are essentially noise words within the contexts of the clusters. Users can identify specific file or document types by searching for those classification labels without having to construct search logic that will help locate them.

As explained above, having the ability to use positional frequency to classify, attribute, and search unstructured content removes most of the major challenges in dealing with such content.

__________________________

BeyondRecognition has unique technology that enables it to classify documents based on visual appearance, i.e., without using any text, and to then identify common words within clusters of visually similar documents. It also provides fixed and relational positional operators for attribute extraction and search. The attribution and search functionalities operate on both text-related and non-textual glyphs.

The technology that enables the ability to drop out recurring word patterns within clusters of visually-similar files can also be extended to drop out recurring graphical elements like lines, logos, and other glyphs. The focus of the pattern matching can be adjusted down to individual graphical elements or zoomed out to higher levels like words, lines, blocks, pages, or documents.

For a free copy of our forthcoming book, Guide to Managing Unstructured Content, go to the following page:

https://beyondrecognition.net/guide-to-managing-unstructured-content/

要查看或添加评论，请登录

John M.的更多文章

Why Graph-Based LLMs Fall Short in Real-World Data Validation – A 3DI Perspective

2025年3月18日

Why Graph-Based LLMs Fall Short in Real-World Data Validation – A 3DI Perspective

In AI-driven data extraction, knowledge graphs are often touted as a superior way to organize relationships between…
The AI Gold Rush: Are You Investing in Fools’ Gold or the Real Thing?

2025年3月12日

The AI Gold Rush: Are You Investing in Fools’ Gold or the Real Thing?

Let’s cut through the noise. AI agents—chatbots, assistants, stock-trading bots, email automation—are everywhere.

5 条评论
Bringing Structure to Audio & Video Assets: A New Era of Compliance & Retention

2025年3月10日

Bringing Structure to Audio & Video Assets: A New Era of Compliance & Retention

Today organizations are managing more than just documents. Audio and video recordings—whether earnings calls, internal…
Unseen Landmines in Energy Land Management: Addressing Legal and Geospatial Deficiencies

2025年3月7日

Unseen Landmines in Energy Land Management: Addressing Legal and Geospatial Deficiencies

Introduction Energy land management is a high-stakes domain where legal, regulatory, and geospatial precision dictate…
AI/ML is the past. 3DI + Curated Corporate Data + Leave-Behind LLMs is the future.

2025年3月6日

AI/ML is the past. 3DI + Curated Corporate Data + Leave-Behind LLMs is the future.

3DI’s curated corporate data combined with a leave-behind LLM is a fundamentally better approach than the current AI+ML…

1 条评论
3DI LLM vs. Traditional AI Workflows

2025年3月5日

3DI LLM vs. Traditional AI Workflows

A Cost-Effective, High-Fidelity Alternative Issue: The Limitations of Traditional AI Workflows The prevailing AI…
The One Step That Makes AI More Accurate, More Compliant, and Less Expensive—By an Order of Magnitude

2025年3月3日

The One Step That Makes AI More Accurate, More Compliant, and Less Expensive—By an Order of Magnitude

AI is failing—at scale. Despite billions of dollars in investment, 80% of AI projects never deliver business value.
LCM vs. LLM: The 5 Key Differences and Why 3DI is the Front-End They Both Need

2025年2月28日

LCM vs. LLM: The 5 Key Differences and Why 3DI is the Front-End They Both Need

AI is evolving, and Meta’s Large Concept Models (LCMs) might just be the next big leap beyond Large Language Models…

1 条评论
There is no such thing as "Unstructured Data"

2025年2月27日

There is no such thing as "Unstructured Data"

Unstructured data isn't actually unstructured—it’s just that most technology today isn’t equipped to recognize its…
Can 3DI Uncover the Truth Behind the Alleged FBI Epstein Cover-Up?

2025年2月25日

Can 3DI Uncover the Truth Behind the Alleged FBI Epstein Cover-Up?

If these allegations are true, 3DI could play a crucial role in investigating and preserving evidence related to this…

See all articles

Using Positional Frequency to Identify High & Low-Value Words

John M.

Founder/CEO @ RedFile Technologies, Inc | Veteran, Patented Inventor, Author, Master of Smoke & Flame

Classification

Attribution

Non-Attributed Linking of Related Files Across Classifications

Searching

John M.的更多文章

社区洞察

其他会员也浏览了

Efficient Vector Retrieval - a perspective

Grind 75 - 20 - Add Binary

Dictionaries in Dot Net

What is inverted index?

Huffman Coding : A Lossless Algorithm for Data Compression

Awkwardly Awesome: Unlocking the Power of awk

Three designs for decoding file formats

SEMANTIC WEB

why can't we use lists as the key of a dictionary?!

Structured Query Interface to a Knowledge Graph

Classification

Attribution

Non-Attributed Linking of Related Files Across Classifications

Searching

John M.的更多文章

Why Graph-Based LLMs Fall Short in Real-World Data Validation – A 3DI Perspective

The AI Gold Rush: Are You Investing in Fools’ Gold or the Real Thing?

Bringing Structure to Audio & Video Assets: A New Era of Compliance & Retention

Unseen Landmines in Energy Land Management: Addressing Legal and Geospatial Deficiencies

AI/ML is the past. 3DI + Curated Corporate Data + Leave-Behind LLMs is the future.

3DI LLM vs. Traditional AI Workflows

The One Step That Makes AI More Accurate, More Compliant, and Less Expensive—By an Order of Magnitude

LCM vs. LLM: The 5 Key Differences and Why 3DI is the Front-End They Both Need

There is no such thing as "Unstructured Data"

Can 3DI Uncover the Truth Behind the Alleged FBI Epstein Cover-Up?

社区洞察

其他会员也浏览了

Efficient Vector Retrieval - a perspective

Grind 75 - 20 - Add Binary

Dictionaries in Dot Net

What is inverted index?

Huffman Coding : A Lossless Algorithm for Data Compression

Awkwardly Awesome: Unlocking the Power of awk

Three designs for decoding file formats

SEMANTIC WEB

why can't we use lists as the key of a dictionary?!

Structured Query Interface to a Knowledge Graph