BIG DATA AND TEXT: A REALITY CHECK by W H Inmon

No alt text provided for this image

Several years ago – when Big Data – was a new thing, I read that one of the features of Big Data was that it could store and manage unstructured data. I was most happy to see this advancement in technology. It has long been known that text was a big part of unstructured data and that text made up a tremendous amount of the data found in the corporation. Furthermore, despite the volume of text that existed, no one seemed to be doing much with it. So, when I read that someone was doing something about text, I was excited.

There is much data that is wrapped up in text that is quite useful for making decisions. As a small sampling of valuable textual there were –

  Emails

  Call center conversations

  Contracts

  Warranties

  Insurance claims

  Customer feedback

  The Internet

  And many more sources.

And it was true that this information was being ignored for the most part in most organizations.

I thought that it was about time that someone was doing something about all of this valuable data.

Then I looked into it more deeply and found that what was being said about Big Data was that the cost of storage was so inexpensive that it was now possible to store and manage textual data. But as far as actually turning text into a useful form, there was nothing particularly useful that was being done by Big Data except store the data economically.

In order to turn text into a useful form, text needs to undergo a significant transformation. Text needs to be disambiguated before it is useful for analysis.

Disambiguation is a difficult discussion. As a simple example of disambiguation, you read the word “bridge”. Now what is being said? Bridge could refer to a card game. Bridge could be a way to cross a stream without getting wet. Bridge could be something the dentist puts in your mouth. Bridge could be a way to have a four way telephone conversation. And there are probably even more interpretations of the meaning of bridge. And there are literally thousands of other words that also have multiple meanings.

Or consider another conversation. Two guys are standing on a corner. A young lady passes by. One guy says to the other – “She’s hot.”

Now what is being said here?

One interpretation is that the young lady is attractive and the guy would like to have a date with her. Or it could be Houston, Texas in July and the temperature is 95 degrees and the humidity is 100%. The lady is covered in sweat. Or it could be that the two guys are doctors in a hospital and one doctor has just taken the temperature of the lady. She is sick and has a temperature of 104 degrees. The meaning of – she’s hot – depends entirely on the context of how the words were said and who said them.

 And these examples are just the tip of the tip of the tip of a big iceberg. It is one thing to store text. It is quite something else to actually understand what that text is saying.

At about that time I happened to be going to Silicon Valley and I happened to have a conversation with several large companies (IBM, Teradata), several start ups (Cloudera,  Hortonworks) and several venture capitalists about text and Big Data. I mentioned to them that in order to be useful, text has to be disambiguated. You would have thought that I was talking Greek to them. They didn’t understand a word I said.

In fact, I was treated as an interloper. I offered to share with them knowledge of how text could be made useful and they were rather rude in their response. Over and over I was told that I just didn’t “get it” (whatever that meant). What I did get was that merely storing text economically was not the same thing as actually using and understanding text.

The same sentiments go for accessing text. It is one thing to access text efficiently. But if text has not first been disambiguated, the smoothest access to text in the world is a pointless exercise.

The fact that you had to transform text before it was useful was unwelcome news to the early proponents and vendors of Big Data.

In fact, merely storing text economically is only one step in solving a long and complex problem. Storing text economically and proclaiming that you have made text useful is sort of like saying that having ocean water will turn the Sahara into a lush landscape. If you want to use ocean water to turn the Sahara into a lush landscape, you first need to desalinate the water before it is useful. And desalination is a complex, time consuming, expensive process. But only after you desalinate the water is the water potable and useful for life.

And it’s the same with text. If you want to make text really useful, you have to go through a complex, arduous, expensive process of disambiguation of text. And disambiguation is very different from storing text or efficiently accessing text.

Fortunately. there is technology today for disambiguation of text – textual ETL. The world is discovering an application at a time a company at a time the profound value of disambiguation of text. No thanks to the Big Data vendors.

Bill Inmon lives in Denver, Colorado. Two of Bill’s latest books are TURNING TEXT INTO GOLD and HEARING THE VOICE OF THE CUSTOMER, Technics Publications, 2018. Bill’s company – Forest Rim Technology – disambiguates text.

David Holcomb

Author | Data & Operations Executive | Culture & Organizational Design Expert | 2 x PhD

5 年

As usual Bill, I enjoy your thoughts.? The basics of data can be lost on folks: 1) Data has two parts, content and context.? Content = "She's Hot!" - Context = the girl's appearance, the girl's temperature, the car she is walking by, something else.? To your point, Bill,?context cannot be ignored. 2) Data is the artifact of a process - all data comes from a process and if we drive our understanding back to the process, we get the context.? Perhaps the "unstructured-ness" gains a little structure. 3) Data is always in the past; there is no "future" data to analyze.? A person said to me, "I have a picture when you were younger."? I responded, "Good because all pictures are pictures of me are from when I was younger" (followed by a good laugh).? That is data.?? 4) All data has the universal element of time and it singularly ties together all data.? It happened and begins to age immediately.? For example, I went to high school with Johnny Depp and Nicholas Cage? just at different high schools.? Time is the binder, neither of these guys know me.?? Bill, I appreciate your thoughts and agree we need to get back to basics on data and data analysis especially when we discuss text or unstructured data.? Thanks for the insights and letting me comment.

James McHugh

IBM and Gordion Knot Performance slayer

5 年

Text is the simplest Data to deal, Indexing is different. BTree has too many limitations.

回复
Jan Ruzicka

Assoc Director Production at IQVIA

5 年

It seems that the big data buzzword very often mean just to have unstructured data ready for future applications, no care what those future applications are.

回复
Russell Beech BSc (Hons), MSc, DipM, MCIM

Data Architect @PrivateContractor | Agile Data Model Requirements Workshops, Dimensional Modelling, Data Architecture, Strategic Services, Business Intelligence, Data Warehousing, Cloud Computing

5 年

Another good article from Bill. In 2005, my ex-boss Carlos Martinez MD, a world leader in pharmacoepidemiology, published a paper regarding a breakthrough. They had managed to extract drug dosage levels from medical text documents for research. This required high levels of ingenuity and development. Free text has always been a challenge and storage alone is not enough. https://onlinelibrary.wiley.com/doi/abs/10.1002/pds.1151

回复
Himanshu Tiwari

Senior Delivery Manager @ Material | Business Analytics * Decision Intelligence | Data Tech Delivery Solution * Change Management

5 年

Interestingly it reminds me of the following statement (its more than 2 decades old now;) - "You can catch all the minnows in the ocean and stack them together and they still do not make a whale". In other words (& as in the current days) a great deal of effort (read money) can be spent in bringing together unrelated data without the context and then hoping to get something hugely useful and insightful out of it.?

回复

要查看或添加评论,请登录

Bill Inmon的更多文章

  • POW WOW DENVER - MARCH 2025

    POW WOW DENVER - MARCH 2025

    THE DENVER POW WOW – March 2025 It was a lazy mid March Saturday afternoon and it was a warm day in Denver. Every year…

    1 条评论
  • STREAMLINING THE EMERGENCY ROOM - TEXTUAL ETL

    STREAMLINING THE EMERGENCY ROOM - TEXTUAL ETL

    STREAMLINING THE EMERGENCY ROOM By W H Inmon The emergency room of the hospital is where people turn to when they have…

    2 条评论
  • THE TEXT MAZE

    THE TEXT MAZE

    THE TEXT MAZE By W H Inmon A really interesting question is – why does text befuddle the computer? The fact that 80% or…

    2 条评论
  • BLAME IT ALL ON GRACE HOPPER

    BLAME IT ALL ON GRACE HOPPER

    BLAME IT ALL ON GRACE HOPPER By W H Inmon One of the more interesting aspects about the world of IT is that IT people…

    17 条评论
  • ASSOCIATIVE RECALL AND REALITY

    ASSOCIATIVE RECALL AND REALITY

    ASSOCIATIVE RECALL AND REALITY By W H Inmon A while back, on a Saturday night, my wife and I were looking for a movie…

    7 条评论
  • A FIRESIDE CHAT WITH BILL INMON

    A FIRESIDE CHAT WITH BILL INMON

    A FIRESIDE CHAT WITH BILL INMON Get Bill’s perspective on your IT organization and its initiatives. Come spend an hour…

  • MESSAGE TO ELON

    MESSAGE TO ELON

    MESSAGE TO ELON By W H Inmon Yesterday Elon Musk tweeted a message asking if anyone had some innovative ways to improve…

    73 条评论
  • GREAT EXPECTATIONS:WALT DISNEY AND THE PENTAGON

    GREAT EXPECTATIONS:WALT DISNEY AND THE PENTAGON

    GREAT EXPECTATIONS: WALT DISNEY AND THE PENTAGON By W H Inmon Think of all the delight Walt Disney has brought the…

    5 条评论
  • BUILDING THE LLM - PART VI

    BUILDING THE LLM - PART VI

    BUILDING THE LLM – Part VI By W H Inmon The language model is an interesting piece of technology. There are many facets…

    3 条评论
  • BUILDING THE LLM - PART V

    BUILDING THE LLM - PART V

    BUILDING THE LLM – Part V By W H Inmon The generic industry language model has at a minimum three important elements of…

    2 条评论

社区洞察

其他会员也浏览了