Using Positional Frequency to Identify High & Low-Value Words
Positional word frequency involves identifying how many times individual words appear at the same relative locations within the pages or documents in a collection. Positional word frequency solves major problems that occur when performing three basic functions involving unstructured content:
- Classification
- Attribute Extraction / Coding
- Search
Without positional word frequency, low-value words can cause clutter in text searches and misclassifications in text classification systems, and can make it difficult to extract document attributes. The use of positional information removes those difficulties.
Let’s look at how positional frequency works. We’ll examine a process where files are first classified, select attributes are then extracted, and finally, the files are indexed for ad hoc searching. As we’ll see, the value of the same words changes during the process depending on their positional frequency and which of the three basic functions are being performed.
Classification
When classifying files, you want to group files that have common words occurring in the same patterns. For example, most contracts contain many boilerplate provisions. They can be grouped for classification by associating the ones that have the same boilerplate terms in the same relative positions. For classification purposes the common or recurring word patterns have a high associative value.
Attribution
Once similar files are classified, the common words that occur in the same positions in all the files in the cluster have very low informational value because they don’t help discriminate among the documents in the cluster. The terms that change in the same positions become the high-value words for attribute extraction because they tell you the variables being used in a particular document or file, e.g., the names of the contracting parties, the beginning term, and the state for jurisdiction in the event of disputes. These low-frequency terms are the high-value words within the clusters for attribute extraction purposes.
Note that some common words can serve as flags to indicate where particular types of document attributes occur. For example, an equipment lease may start off with a paragraph with the name of the party leasing the equipment followed by “(hereinafter the ‘Lessor’)” or may have a line at the end that says “Date:” followed by the date of the contract. Those flags in conjunction with their positional coordinates enable the extraction of high-value variables and the association of those data elements with specific attribute or field names.
Non-Attributed Linking of Related Files Across Classifications
Even without explicit labelling of file attributes, high-value variables can be used to associate specific documents that pertain to the same person, object, or transaction even though they fall in different clusters. In this approach, the common or recurring pattern words are dropped or negated leaving just the high-value variables. These high-value variables are then analyzed to see which files pertain to the same company, person, or other topic.
In our contract analogy, the boilerplate would be dropped leaving just key variables like the names and addresses of the contracting parties, and those can be analyzed to associate documents even from different clusters or classifications. Those documents could be other types of contracts or completely different types of documents like invoices or checks.
Searching
Search precision can be improved substantially by the use of positional operators that enable users to specify where requested search terms occur either in absolute page positions or relative to other terms.
The best text search architecture would involve providing users with search operators or qualifiers so they could search for just high-value search terms within specified file classifications. This approach eliminates most clutter by dropping what are essentially noise words within the contexts of the clusters. Users can identify specific file or document types by searching for those classification labels without having to construct search logic that will help locate them.
As explained above, having the ability to use positional frequency to classify, attribute, and search unstructured content removes most of the major challenges in dealing with such content.
__________________________
BeyondRecognition has unique technology that enables it to classify documents based on visual appearance, i.e., without using any text, and to then identify common words within clusters of visually similar documents. It also provides fixed and relational positional operators for attribute extraction and search. The attribution and search functionalities operate on both text-related and non-textual glyphs.
The technology that enables the ability to drop out recurring word patterns within clusters of visually-similar files can also be extended to drop out recurring graphical elements like lines, logos, and other glyphs. The focus of the pattern matching can be adjusted down to individual graphical elements or zoomed out to higher levels like words, lines, blocks, pages, or documents.
For a free copy of our forthcoming book, Guide to Managing Unstructured Content, go to the following page:
https://beyondrecognition.net/guide-to-managing-unstructured-content/