DATA QUALITY - STRUCTURED AND TEXTUAL
Bill Inmon
Founder, Chairman, CEO, Best-Selling Author, University of Denver & Scalefree Advisory Board Member
DATA QUALITY – STRUCTURED AND TEXTUAL
By W H Inmon
The issues of data quality have been around since the first program was written. And if there is one concept that overshadows all others in data quality it is – GIGO – garbage in/garbage out. Furthermore, new advances in technology rely on the fact that the data that they operate on is “clean” – complete, up to date and accurate. And of course the data is not clean, neat and accurate.
You would think that data quality would be high on the list of the IT organization to get right. But strangely, it isn’t. Data quality is relegated to the back bench – to be addressed when there is time, and there is never time. Something more pressing comes along and butts in front of data quality.
Data quality is based in no small part on the works for Larry English, the original pioneer of data quality. Larry did his work in the day and age of structured data. The English approach always made the assumption that if you found incorrect data you should simply correct it.
Of course, there was a lot more than the correction of faulty data in the structured environment. But at the heart of managing data quality in the structured world, the thought that data should be corrected if found to be incorrect permeated throughout.
Today, the world is different than it was in Larry English’s day.
Today – in addition to having structured data – we also have textual data. And with textual data comes a whole different understanding of the meaning of data quality. In many ways textual data quality is diametrically different than structured data quality.
The first difference in the meaning of textual data quality is that notion that data should be corrected if found to be faulty. In the world of text, you absolutely do not correct text – even if it is incorrect. If the author of the text says that 1 + 1 = 5, then that is what you work with in the world of text even if the proposition is flawed. Suppose that I write on the bank loan that I make $1,000,000 a year. When I wrote that number I forgot the decimal places and I actually make $10,000 a year. I have written the data incorrectly on my loan application. So what happens to the incorrect data? Absolutely nothing. The bank is obligated by law to not change anything on my request for a loan, even if it is incorrect. Stated differently you are breaking the law if you correct someone else’s bank loan application.
Textual data must be managed and manipulated as recorded, regardless of the correctness of the text.
But the differences between text and structured data do not end there. Another difference relating to data quality is that of the need for context to accompany text. Context is needed to assert meaning to the text that has been encountered. In structured data, there is always – hidden or in plain sight – metadata that supplies context to the structured data. In structured data there are elements of metadata such as table name, key name, attribute name and so forth. But with text, context is implicit, not explicit. You have to have context in order to make sense of text.
For example – what does the word “fire” mean? Without context fire can mean many things. Fire may be a conflagration. Fire may be the firing of a gun. Fire may be a dismissal of an employee at work. Without context you don’t know what fire means.
And there are many other essential differences between the structured world and the textual world, all of which are relevant to the processing that is done in the different environments and all that relate to the idea of data quality.
Data quality is vitally important in both the structured and the textual environment. But the implementation and the manifestation of quality ?- the very meaning of data quality - is very different in the two environments.
?
Bill Inmon lives in Denver with his wife and his two Scotty dogs – Jeb and Lena. Bill wakes Jeb and Lena up every morning. Jeb always greets Bill with his morning howl. If you didn’t know better you would think that Jeb was dying. But that is just his way of waking up.
?
I was a friend and a peer of Larry English. I spoke at many conferences with Larry over the years. Larry was passionate about his contribution to our profession and he was truly the pioneer of data quality. In addition, Larry was a truly wonderful human being. We all remember and miss Larry.
Bill Inmon... I think everyone needs a Datavox!!!