TEXTUAL METADATA INFRASTRUCTURE
Bill Inmon
Founder, Chairman, CEO, Best-Selling Author, University of Denver & Scalefree Advisory Board Member
TEXTUAL METADATA INFRASTRUCTURE
By W H Inmon
In a recent post about the metadata infrastructure required for textual data, it was stated that when text is formatted into a structured format that a different approach to structuring the data found in text is needed. This is due to the inherent flexibility and free form format required by text. Text is fundamentally different than structured data. Anyone can say anything or write anything they want. There is no prescribed format or constraints for speech and the written word. Text is inherently free form. As such a different structuring of data is needed when text is converted to a structured format.
It is true that text needs to be placed into a structured format in order to be analyzed using standard analytical processing. The need for a structured format is dictated by analytical software – Excel, Tableau, knowledge graph data bases, and so forth. But even so, there are some inherent differences between a classical structured structuring of data and a textual structuring of data.
In order to understand these differences consider a simple classical data base structure –
The simple structured data base shows that the column headers directly specify – constrain - what data is to be found in the column. In the column for school, only a school belongs there. In the column for city, only cities and towns belong there. The column header directly describes and constrains the contents of the column.
Now consider a data base approach for textual data –
领英推荐
In the textual data base, the column heading for word indirectly describes what is in the column. The column for word contains only words. But the words can be anything. The column heading for word places no constraints on what the content might be for the column. The direct description of the contents of the word column are described by another column – context. The data found in context in the same row describes the contents of the data found in word.
The column “context” allows for there to be a direct description of the contents of the word column.
In a structured data base there are only direct column headings. In the textual data base there are indirect and direct descriptors of data.
The ability to have many types of data in the word column allows the structured data base for text to accommodate undisciplined and unstructured text.
?
Bill Inmon lives in Denver with his wife and his two Scotty dogs – Jeb and Lena.? It is a nice cool winter day in Denver. It snowed last night but melted off by noon. Jeb and Lena went out to the back yard in the sun and played. Lena is faster than Jeb because she is younger than him. But Lena puts up with Jeb in any case.
Metadata Everyone says it Nobody does it
Disambiguation Specialist
3 个月Bill - With respect, the column 'Name' by itself doesn't appear to have any constraints either. "Jeb' and 'Lena' qualify as members of the 'Name' set and you could likely get away with 'Denver' as a City value associated with those two records. School value would be a challenge though.
Bill Inmon... apologies in advance. The words "Textual Metadata Infrastructure" sounds to me like they could use a Megazord suffix... like some monstrous mechanical armed vehicle robot Zord thing that the Power Rangers would summon to action to combat intergalactic villainy. In all seriousness... I recognize your work to make an advantage of the monstrous volume of undisciplined and unstructured text to be impactful and essential. How could we be saving and yet not be using unstructured data all these years? With modern tools and a little INMONnovation... it looks like advantages of making use of unstructured data are just around the corner.
Interesting. As the context column is filled in , it will add relations, I would say a LLM is well fit to fill the context with business terminology and then its matter of denormalize and query ,
Data Warehouse Architect in Hach
3 个月Hi Bill Inmon, as you mention, free text is more complex than structured data, in real life words can be misspelled, for example, “desert” can have a different meaning depending on the context, but at the same time it can be misspelled as “dessert”. What is your suggestion to deal with these problems, misspellings and ambiguous meanings?. Thanks