BIG DATA AND TEXT: A REALITY CHECK by W H Inmon
Bill Inmon
Founder, Chairman, CEO, Best-Selling Author, University of Denver & Scalefree Advisory Board Member
Several years ago – when Big Data – was a new thing, I read that one of the features of Big Data was that it could store and manage unstructured data. I was most happy to see this advancement in technology. It has long been known that text was a big part of unstructured data and that text made up a tremendous amount of the data found in the corporation. Furthermore, despite the volume of text that existed, no one seemed to be doing much with it. So, when I read that someone was doing something about text, I was excited.
There is much data that is wrapped up in text that is quite useful for making decisions. As a small sampling of valuable textual there were –
Emails
Call center conversations
Contracts
Warranties
Insurance claims
Customer feedback
The Internet
And many more sources.
And it was true that this information was being ignored for the most part in most organizations.
I thought that it was about time that someone was doing something about all of this valuable data.
Then I looked into it more deeply and found that what was being said about Big Data was that the cost of storage was so inexpensive that it was now possible to store and manage textual data. But as far as actually turning text into a useful form, there was nothing particularly useful that was being done by Big Data except store the data economically.
In order to turn text into a useful form, text needs to undergo a significant transformation. Text needs to be disambiguated before it is useful for analysis.
Disambiguation is a difficult discussion. As a simple example of disambiguation, you read the word “bridge”. Now what is being said? Bridge could refer to a card game. Bridge could be a way to cross a stream without getting wet. Bridge could be something the dentist puts in your mouth. Bridge could be a way to have a four way telephone conversation. And there are probably even more interpretations of the meaning of bridge. And there are literally thousands of other words that also have multiple meanings.
Or consider another conversation. Two guys are standing on a corner. A young lady passes by. One guy says to the other – “She’s hot.”
Now what is being said here?
One interpretation is that the young lady is attractive and the guy would like to have a date with her. Or it could be Houston, Texas in July and the temperature is 95 degrees and the humidity is 100%. The lady is covered in sweat. Or it could be that the two guys are doctors in a hospital and one doctor has just taken the temperature of the lady. She is sick and has a temperature of 104 degrees. The meaning of – she’s hot – depends entirely on the context of how the words were said and who said them.
And these examples are just the tip of the tip of the tip of a big iceberg. It is one thing to store text. It is quite something else to actually understand what that text is saying.
At about that time I happened to be going to Silicon Valley and I happened to have a conversation with several large companies (IBM, Teradata), several start ups (Cloudera, Hortonworks) and several venture capitalists about text and Big Data. I mentioned to them that in order to be useful, text has to be disambiguated. You would have thought that I was talking Greek to them. They didn’t understand a word I said.
In fact, I was treated as an interloper. I offered to share with them knowledge of how text could be made useful and they were rather rude in their response. Over and over I was told that I just didn’t “get it” (whatever that meant). What I did get was that merely storing text economically was not the same thing as actually using and understanding text.
The same sentiments go for accessing text. It is one thing to access text efficiently. But if text has not first been disambiguated, the smoothest access to text in the world is a pointless exercise.
The fact that you had to transform text before it was useful was unwelcome news to the early proponents and vendors of Big Data.
In fact, merely storing text economically is only one step in solving a long and complex problem. Storing text economically and proclaiming that you have made text useful is sort of like saying that having ocean water will turn the Sahara into a lush landscape. If you want to use ocean water to turn the Sahara into a lush landscape, you first need to desalinate the water before it is useful. And desalination is a complex, time consuming, expensive process. But only after you desalinate the water is the water potable and useful for life.
And it’s the same with text. If you want to make text really useful, you have to go through a complex, arduous, expensive process of disambiguation of text. And disambiguation is very different from storing text or efficiently accessing text.
Fortunately. there is technology today for disambiguation of text – textual ETL. The world is discovering an application at a time a company at a time the profound value of disambiguation of text. No thanks to the Big Data vendors.
Bill Inmon lives in Denver, Colorado. Two of Bill’s latest books are TURNING TEXT INTO GOLD and HEARING THE VOICE OF THE CUSTOMER, Technics Publications, 2018. Bill’s company – Forest Rim Technology – disambiguates text.
Author | Data & Operations Executive | Culture & Organizational Design Expert | 2 x PhD
5 年As usual Bill, I enjoy your thoughts.? The basics of data can be lost on folks: 1) Data has two parts, content and context.? Content = "She's Hot!" - Context = the girl's appearance, the girl's temperature, the car she is walking by, something else.? To your point, Bill,?context cannot be ignored. 2) Data is the artifact of a process - all data comes from a process and if we drive our understanding back to the process, we get the context.? Perhaps the "unstructured-ness" gains a little structure. 3) Data is always in the past; there is no "future" data to analyze.? A person said to me, "I have a picture when you were younger."? I responded, "Good because all pictures are pictures of me are from when I was younger" (followed by a good laugh).? That is data.?? 4) All data has the universal element of time and it singularly ties together all data.? It happened and begins to age immediately.? For example, I went to high school with Johnny Depp and Nicholas Cage? just at different high schools.? Time is the binder, neither of these guys know me.?? Bill, I appreciate your thoughts and agree we need to get back to basics on data and data analysis especially when we discuss text or unstructured data.? Thanks for the insights and letting me comment.
IBM and Gordion Knot Performance slayer
5 年Text is the simplest Data to deal, Indexing is different. BTree has too many limitations.
Assoc Director Production at IQVIA
5 年It seems that the big data buzzword very often mean just to have unstructured data ready for future applications, no care what those future applications are.
Data Architect @PrivateContractor | Agile Data Model Requirements Workshops, Dimensional Modelling, Data Architecture, Strategic Services, Business Intelligence, Data Warehousing, Cloud Computing
5 年Another good article from Bill. In 2005, my ex-boss Carlos Martinez MD, a world leader in pharmacoepidemiology, published a paper regarding a breakthrough. They had managed to extract drug dosage levels from medical text documents for research. This required high levels of ingenuity and development. Free text has always been a challenge and storage alone is not enough. https://onlinelibrary.wiley.com/doi/abs/10.1002/pds.1151
Senior Delivery Manager @ Material | Business Analytics * Decision Intelligence | Data Tech Delivery Solution * Change Management
5 年Interestingly it reminds me of the following statement (its more than 2 decades old now;) - "You can catch all the minnows in the ocean and stack them together and they still do not make a whale". In other words (& as in the current days) a great deal of effort (read money) can be spent in bringing together unrelated data without the context and then hoping to get something hugely useful and insightful out of it.?