Text Inside the Computer can be Processed (incase you were curious)
Bill Inmon
Founder, Chairman, CEO, Best-Selling Author, University of Denver & Scalefree Advisory Board Member
PROCESSING TEXT INSIDE THE COMPUTER
By W H Inmon
Text has been there from the beginning. While keypunch operators were busy making holes in cards, text has sat in a corner, like a little child that is in “time out”.
99% of those holes made in Hollerith cards were for the purpose of tabulating or calculating bills on highly structured (mostly) numeric data. Text was ignored from the beginning.
Then one day it dawned on some systems analyst that there was a lot of useful data to be found in textual data. Indeed, MOST useful data was embodied in the form of text. But the computer and its ubiquitous data base management system was impervious to text. Trying to put text into a computer was sort of taking suntan lotion to the North Pole. In winter.
Or taking a pony to the Westminster Dog show. As an entrant to the show.
It was a non sequitur if there ever was one.
In order to be processed inside the computer, the computer demands that data be organized into neatly defined, highly structured units of repetitious data. Any other organization befuddles the computer.
But there is a paradox here. The paradox is that much very important data – text - does not exist in a neatly defined, highly structured, repetitious format. There is a very basic conflict then between important data and the way the computer wants to handle data. This basic conflict is at the heart of one of the fundamental limitations of the computer.
This conflict has been present from the moment the first programmer wrote the first program until today.
There have been several attempts to resolve this conflict. Some of the attempts have been futile. Some attempts have been successful. In fact the attempts to resolve this conflict can be looked at as an evolution, or a progression.
This progression looks like –
The very first attempts to incorporate text into the world of automation was the inclusion of comments into a data structure. Raw text could be stuffed into a field of data. Then when someone was reading the data the text was there.
There were several obvious shortcomings to the comments approach. The first shortcoming was that fields were often fixed length. This meant that the programmer had to estimate what the longest comment might be that someone might want to capture. This was always a haphazard guess. If the programmer guessed too long much space was wasted. If the programmer guessed too short then the end user did not have enough space to write what was needed. In a word, the programmer could never win.
Or the programmer could create a variable length filed. Early dbms were notorious for not being able to handle long variable length fields well.
The next step up the ladder was the blob. The blob was a data type that was dedicated to storing text. The problem with the blob was that you could put text into the blob but you then were stuck when it came to trying to read and make sense of the text found in the blob. In general the user was stuck with having to manually try to read every blob. This was a very tedious thing to do when it came to reading thousands and thousands of blobs.
The next approach was one that can be called the “sound-ex” approach. The sound-ex approach attempted to unify data based on its sound. The sound-ex approach is useful in some cases, but is not a general purpose solution. The sound-ex approach was the first step toward standardization of certain forms of text.
Next came tagging. Tagging is the process of reading text and picking out certain words – typically keywords – for further reference. Tagging was a step in the right direction. With tagging text could be read and certain words were spotted.
But there are a lot of deficiencies with tagging. One of those deficiencies is that in order to do tagging you have to know what you are looking for before you start the tagging process. It is like saying that you have to know the answer to a question before you ask the question. Tagging presupposes that you know what you are looking for before you start to look for it. As such tagging is better than nothing but tagging – at best – gives only partially accurate and partially satisfactory answers.
After tagging comes NLP or natural language processing. NLP is an academic oriented approach where tagged words are usually classified, typically into taxonomies looking at sentiment. NLP is a step up from tagging in that NLP recognizes the value of looking for great classifications of tagged words.
But NLP – which has been around for a long time – has its limitations. One of those limitations is that NLP is essentially limited to using taxonomies used for sentiment analysis. Typically NLP has only a few taxonomies which are used and even those taxonomies only look at satisfaction or dissatisfaction.
The most sophisticated form of textual processing is that of textual disambiguation. There are many differences between textual disambiguation and NLP. The primary difference is that in addition to using taxonomies, textual disambiguation also determines the context of the text being analyzed. When doing analytical processing, the context of text is extremely important. So textual disambiguation goes one step beyond NLP.
But textual disambiguation also uses far more approaches than taxonomical classification of data for the purpose of determining sentiment. Some of the other approaches used by textual disambiguation include –
- Inline contextualization, where the context of a word is determined by the words surrounding the word,
- Proximity analysis, where the context of a word is determined by the other words in proximity to the word
- Custom variable contextualization, where the context of a word is determined by the actual structure of the word itself,
- Homographic resolution, where the context of a word is determined by recognizing who wrote the word,
- Negation resolution, where the validity of a word is negated,
- And so forth.
There are over 70 different algorithms that make up the processing done by textual disambiguation. And those algorithms have to be carefully orchestrated in order to be applied to raw text at the right time and in the right way.
The end result however of textual disambiguation is the creation of a standard data base that contains the text and context that can then be used for analytical processing.
At long last text has been brought into a useful form that is compatible with the need for structure by the computer.