Text Inside the Computer can be Processed (incase you were curious)

Text Inside the Computer can be Processed (incase you were curious)

PROCESSING TEXT INSIDE THE COMPUTER

By W H Inmon

Text has been there from the beginning. While keypunch operators were busy making holes in cards, text has sat in a corner, like a little child that is in “time out”.

99% of those holes made in Hollerith cards were for the purpose of tabulating or calculating bills on highly structured (mostly) numeric data. Text was ignored from the beginning.

Then one day it dawned on some systems analyst that there was a lot of useful data to be found in textual data. Indeed, MOST useful data was embodied in the form of text. But the computer and its ubiquitous data base management system was impervious to text. Trying to put text into a computer was sort of taking suntan lotion to the North Pole. In winter.

Or taking a pony to the Westminster Dog show. As an entrant to the show.

It was a non sequitur if there ever was one.

In order to be processed inside the computer, the computer demands that data be organized into neatly defined, highly structured units of repetitious data. Any other organization befuddles the computer.

But there is a paradox here. The paradox is that much very important data – text - does not exist in a neatly defined, highly structured, repetitious format. There is a very basic conflict then between important data and the way the computer wants to handle data. This basic conflict is at the heart of one of the fundamental limitations of the computer.

This conflict has been present from the moment the first programmer wrote the first program until today.

There have been several attempts to resolve this conflict. Some of the attempts have been futile. Some attempts have been successful. In fact the attempts to resolve this conflict can be looked at as an evolution, or a progression.

This progression looks like –

The very first attempts to incorporate text into the world of automation was the inclusion of comments into a data structure. Raw text could be stuffed into a field of data. Then when someone was reading the data the text was there.

There were several obvious shortcomings to the comments approach. The first shortcoming was that fields were often fixed length. This meant that the programmer had to estimate what the longest comment might be that someone might want to capture. This was always a haphazard guess. If the programmer guessed too long much space was wasted. If the programmer guessed too short then the end user did not have enough space to write what was needed. In a word, the programmer could never win.

Or the programmer could create a variable length filed. Early dbms were notorious for not being able to handle long variable length fields well.

The next step up the ladder was the blob. The blob was a data type that was dedicated to storing text. The problem with the blob was that you could put text into the blob but you then were stuck when it came to trying to read and make sense of the text found in the blob. In general the user was stuck with having to manually try to read every blob. This was a very tedious thing to do when it came to reading thousands and thousands of blobs.

The next approach was one that can be called the “sound-ex” approach. The sound-ex approach attempted to unify data based on its sound. The sound-ex approach is useful in some cases, but is not a general purpose solution. The sound-ex approach was the first step toward standardization of certain forms of text.

Next came tagging. Tagging is the process of reading text and picking out certain words – typically keywords – for further reference. Tagging was a step in the right direction. With tagging text could be read and certain words were spotted.

But there are a lot of deficiencies with tagging. One of those deficiencies is that in order to do tagging you have to know what you are looking for before you start the tagging process. It is like saying that you have to know the answer to a question before you ask the question. Tagging presupposes that you know what you are looking for before you start to look for it. As such tagging is better than nothing but tagging – at best – gives only partially accurate and partially satisfactory answers.

After tagging comes NLP or natural language processing. NLP is an academic oriented approach where tagged words are usually classified, typically into taxonomies looking at sentiment. NLP is a step up from tagging in that NLP recognizes the value of looking for great classifications of tagged words.

But NLP – which has been around for a long time – has its limitations. One of those limitations is that NLP is essentially limited to using taxonomies used for sentiment analysis. Typically NLP has only a few taxonomies which are used and even those taxonomies only look at satisfaction or dissatisfaction.

The most sophisticated form of textual processing is that of textual disambiguation. There are many differences between textual disambiguation and NLP. The primary difference is that in addition to using taxonomies, textual disambiguation also determines the context of the text being analyzed. When doing analytical processing, the context of text is extremely important. So textual disambiguation goes one step beyond NLP.

But textual disambiguation also uses far more approaches than taxonomical classification of data for the purpose of determining sentiment. Some of the other approaches used by textual disambiguation include –

   - Inline contextualization, where the context of a word is determined by the words surrounding the word,

  - Proximity analysis, where the context of a word is determined by the other words in proximity to the word

  - Custom variable contextualization, where the context of a word is determined by the actual structure of the word itself,

  - Homographic resolution, where the context of a word is determined by recognizing who wrote the word,

  - Negation resolution, where the validity of a word is negated,

  - And so forth.

There are over 70 different algorithms that make up the processing done by textual disambiguation. And those algorithms have to be carefully orchestrated in order to be applied to raw text at the right time and in the right way.

The end result however of textual disambiguation is the creation of a standard data base that contains the text and context that can then be used for analytical processing.

At long last text has been brought into a useful form that is compatible with the need for structure by the computer.

  

要查看或添加评论,请登录

Bill Inmon的更多文章

  • POW WOW DENVER - MARCH 2025

    POW WOW DENVER - MARCH 2025

    THE DENVER POW WOW – March 2025 It was a lazy mid March Saturday afternoon and it was a warm day in Denver. Every year…

    1 条评论
  • STREAMLINING THE EMERGENCY ROOM - TEXTUAL ETL

    STREAMLINING THE EMERGENCY ROOM - TEXTUAL ETL

    STREAMLINING THE EMERGENCY ROOM By W H Inmon The emergency room of the hospital is where people turn to when they have…

    2 条评论
  • THE TEXT MAZE

    THE TEXT MAZE

    THE TEXT MAZE By W H Inmon A really interesting question is – why does text befuddle the computer? The fact that 80% or…

    2 条评论
  • BLAME IT ALL ON GRACE HOPPER

    BLAME IT ALL ON GRACE HOPPER

    BLAME IT ALL ON GRACE HOPPER By W H Inmon One of the more interesting aspects about the world of IT is that IT people…

    17 条评论
  • ASSOCIATIVE RECALL AND REALITY

    ASSOCIATIVE RECALL AND REALITY

    ASSOCIATIVE RECALL AND REALITY By W H Inmon A while back, on a Saturday night, my wife and I were looking for a movie…

    7 条评论
  • A FIRESIDE CHAT WITH BILL INMON

    A FIRESIDE CHAT WITH BILL INMON

    A FIRESIDE CHAT WITH BILL INMON Get Bill’s perspective on your IT organization and its initiatives. Come spend an hour…

  • MESSAGE TO ELON

    MESSAGE TO ELON

    MESSAGE TO ELON By W H Inmon Yesterday Elon Musk tweeted a message asking if anyone had some innovative ways to improve…

    73 条评论
  • GREAT EXPECTATIONS:WALT DISNEY AND THE PENTAGON

    GREAT EXPECTATIONS:WALT DISNEY AND THE PENTAGON

    GREAT EXPECTATIONS: WALT DISNEY AND THE PENTAGON By W H Inmon Think of all the delight Walt Disney has brought the…

    5 条评论
  • BUILDING THE LLM - PART VI

    BUILDING THE LLM - PART VI

    BUILDING THE LLM – Part VI By W H Inmon The language model is an interesting piece of technology. There are many facets…

    3 条评论
  • BUILDING THE LLM - PART V

    BUILDING THE LLM - PART V

    BUILDING THE LLM – Part V By W H Inmon The generic industry language model has at a minimum three important elements of…

    2 条评论

社区洞察

其他会员也浏览了