登录查看更多内容

Text Inside the Computer can be Processed (incase you were curious)

Bill Inmon

Founder, Chairman, CEO, Best-Selling Author, University of Denver & Scalefree Advisory Board Member

发布日期: 2018年6月29日

+ 关注

PROCESSING TEXT INSIDE THE COMPUTER

By W H Inmon

Text has been there from the beginning. While keypunch operators were busy making holes in cards, text has sat in a corner, like a little child that is in “time out”.

99% of those holes made in Hollerith cards were for the purpose of tabulating or calculating bills on highly structured (mostly) numeric data. Text was ignored from the beginning.

Then one day it dawned on some systems analyst that there was a lot of useful data to be found in textual data. Indeed, MOST useful data was embodied in the form of text. But the computer and its ubiquitous data base management system was impervious to text. Trying to put text into a computer was sort of taking suntan lotion to the North Pole. In winter.

Or taking a pony to the Westminster Dog show. As an entrant to the show.

It was a non sequitur if there ever was one.

In order to be processed inside the computer, the computer demands that data be organized into neatly defined, highly structured units of repetitious data. Any other organization befuddles the computer.

But there is a paradox here. The paradox is that much very important data – text - does not exist in a neatly defined, highly structured, repetitious format. There is a very basic conflict then between important data and the way the computer wants to handle data. This basic conflict is at the heart of one of the fundamental limitations of the computer.

This conflict has been present from the moment the first programmer wrote the first program until today.

There have been several attempts to resolve this conflict. Some of the attempts have been futile. Some attempts have been successful. In fact the attempts to resolve this conflict can be looked at as an evolution, or a progression.

This progression looks like –

The very first attempts to incorporate text into the world of automation was the inclusion of comments into a data structure. Raw text could be stuffed into a field of data. Then when someone was reading the data the text was there.

There were several obvious shortcomings to the comments approach. The first shortcoming was that fields were often fixed length. This meant that the programmer had to estimate what the longest comment might be that someone might want to capture. This was always a haphazard guess. If the programmer guessed too long much space was wasted. If the programmer guessed too short then the end user did not have enough space to write what was needed. In a word, the programmer could never win.

Or the programmer could create a variable length filed. Early dbms were notorious for not being able to handle long variable length fields well.

The next step up the ladder was the blob. The blob was a data type that was dedicated to storing text. The problem with the blob was that you could put text into the blob but you then were stuck when it came to trying to read and make sense of the text found in the blob. In general the user was stuck with having to manually try to read every blob. This was a very tedious thing to do when it came to reading thousands and thousands of blobs.

The next approach was one that can be called the “sound-ex” approach. The sound-ex approach attempted to unify data based on its sound. The sound-ex approach is useful in some cases, but is not a general purpose solution. The sound-ex approach was the first step toward standardization of certain forms of text.

Next came tagging. Tagging is the process of reading text and picking out certain words – typically keywords – for further reference. Tagging was a step in the right direction. With tagging text could be read and certain words were spotted.

But there are a lot of deficiencies with tagging. One of those deficiencies is that in order to do tagging you have to know what you are looking for before you start the tagging process. It is like saying that you have to know the answer to a question before you ask the question. Tagging presupposes that you know what you are looking for before you start to look for it. As such tagging is better than nothing but tagging – at best – gives only partially accurate and partially satisfactory answers.

After tagging comes NLP or natural language processing. NLP is an academic oriented approach where tagged words are usually classified, typically into taxonomies looking at sentiment. NLP is a step up from tagging in that NLP recognizes the value of looking for great classifications of tagged words.

But NLP – which has been around for a long time – has its limitations. One of those limitations is that NLP is essentially limited to using taxonomies used for sentiment analysis. Typically NLP has only a few taxonomies which are used and even those taxonomies only look at satisfaction or dissatisfaction.

The most sophisticated form of textual processing is that of textual disambiguation. There are many differences between textual disambiguation and NLP. The primary difference is that in addition to using taxonomies, textual disambiguation also determines the context of the text being analyzed. When doing analytical processing, the context of text is extremely important. So textual disambiguation goes one step beyond NLP.

But textual disambiguation also uses far more approaches than taxonomical classification of data for the purpose of determining sentiment. Some of the other approaches used by textual disambiguation include –

- Inline contextualization, where the context of a word is determined by the words surrounding the word,

- Proximity analysis, where the context of a word is determined by the other words in proximity to the word

- Custom variable contextualization, where the context of a word is determined by the actual structure of the word itself,

- Homographic resolution, where the context of a word is determined by recognizing who wrote the word,

- Negation resolution, where the validity of a word is negated,

- And so forth.

There are over 70 different algorithms that make up the processing done by textual disambiguation. And those algorithms have to be carefully orchestrated in order to be applied to raw text at the right time and in the right way.

The end result however of textual disambiguation is the creation of a standard data base that contains the text and context that can then be used for analytical processing.

At long last text has been brought into a useful form that is compatible with the need for structure by the computer.

要查看或添加评论，请登录

Bill Inmon的更多文章

POW WOW DENVER - MARCH 2025

2025年3月22日

POW WOW DENVER - MARCH 2025

THE DENVER POW WOW – March 2025 It was a lazy mid March Saturday afternoon and it was a warm day in Denver. Every year…

1 条评论
STREAMLINING THE EMERGENCY ROOM - TEXTUAL ETL

2025年3月14日

STREAMLINING THE EMERGENCY ROOM - TEXTUAL ETL

STREAMLINING THE EMERGENCY ROOM By W H Inmon The emergency room of the hospital is where people turn to when they have…

2 条评论
THE TEXT MAZE

2025年3月12日

THE TEXT MAZE

THE TEXT MAZE By W H Inmon A really interesting question is – why does text befuddle the computer? The fact that 80% or…

2 条评论
BLAME IT ALL ON GRACE HOPPER

2025年3月9日

BLAME IT ALL ON GRACE HOPPER

BLAME IT ALL ON GRACE HOPPER By W H Inmon One of the more interesting aspects about the world of IT is that IT people…

17 条评论
ASSOCIATIVE RECALL AND REALITY

2025年3月1日

ASSOCIATIVE RECALL AND REALITY

ASSOCIATIVE RECALL AND REALITY By W H Inmon A while back, on a Saturday night, my wife and I were looking for a movie…

7 条评论
A FIRESIDE CHAT WITH BILL INMON

2025年2月28日

A FIRESIDE CHAT WITH BILL INMON

A FIRESIDE CHAT WITH BILL INMON Get Bill’s perspective on your IT organization and its initiatives. Come spend an hour…
MESSAGE TO ELON

2025年2月18日

MESSAGE TO ELON

MESSAGE TO ELON By W H Inmon Yesterday Elon Musk tweeted a message asking if anyone had some innovative ways to improve…

73 条评论
GREAT EXPECTATIONS:WALT DISNEY AND THE PENTAGON

2025年2月10日

GREAT EXPECTATIONS:WALT DISNEY AND THE PENTAGON

GREAT EXPECTATIONS: WALT DISNEY AND THE PENTAGON By W H Inmon Think of all the delight Walt Disney has brought the…

5 条评论
BUILDING THE LLM - PART VI

2025年2月5日

BUILDING THE LLM - PART VI

BUILDING THE LLM – Part VI By W H Inmon The language model is an interesting piece of technology. There are many facets…

3 条评论
BUILDING THE LLM - PART V

2025年2月4日

BUILDING THE LLM - PART V

BUILDING THE LLM – Part V By W H Inmon The generic industry language model has at a minimum three important elements of…

2 条评论

See all articles

Text Inside the Computer can be Processed (incase you were curious)

Bill Inmon

Founder, Chairman, CEO, Best-Selling Author, University of Denver & Scalefree Advisory Board Member

Bill Inmon的更多文章

社区洞察

其他会员也浏览了

Notes on Data Compression: Part 2

SAS Global Forum: where curiosity meets data literacy

Bloom Filter Explained

Why Data Integrity Demands More Than Automation: A Data Analyst’s Holistic Approach

Sorting is not so Short || HighPeeks

Google Colab's Data Science Agent - A Review

Exploring Tree Traversal: Pre-order, In-order, Post-order, and Level-order.

Fluency Platform: A Solution for Data Message Handling

Recap over Rule Based Systems

Types of structure in the Design of Data Storage

Bill Inmon的更多文章

POW WOW DENVER - MARCH 2025

STREAMLINING THE EMERGENCY ROOM - TEXTUAL ETL

THE TEXT MAZE

BLAME IT ALL ON GRACE HOPPER

ASSOCIATIVE RECALL AND REALITY

A FIRESIDE CHAT WITH BILL INMON

MESSAGE TO ELON

GREAT EXPECTATIONS:WALT DISNEY AND THE PENTAGON

BUILDING THE LLM - PART VI

BUILDING THE LLM - PART V

社区洞察

其他会员也浏览了

Notes on Data Compression: Part 2

SAS Global Forum: where curiosity meets data literacy

Bloom Filter Explained

Why Data Integrity Demands More Than Automation: A Data Analyst’s Holistic Approach

Sorting is not so Short || HighPeeks

Google Colab's Data Science Agent - A Review

Exploring Tree Traversal: Pre-order, In-order, Post-order, and Level-order.

Fluency Platform: A Solution for Data Message Handling

Recap over Rule Based Systems

Types of structure in the Design of Data Storage