For the Sake of Codd, Please stop saying "Unstructured Data"

For the Sake of Codd, Please stop saying "Unstructured Data"

Edgar Codd could arguably be called the father of modern data science. As most computer scientists and DBAs know, he invented the relational model of database design. It is a tightly-structured paradigm, based on set theory and solid mathematics. Modern "relational" databases (I use that term loosely since few products actually employ the relational model fully) represent data in tabular form, in which there is no ambiguity and data items are stored in one place only (deliberate denormalization ignored for now). Since Codd, many other data-management models have been developed, and a discussion over which model is "best" is a good way to start a religious world war.

But the bottom line is that all data - if it is to be called data - is structured, to some degree or in some way. It is common for budding data scientists to refer to text documents, social-media posts, medical-imaging metadata, and other non-tabular information stores as "unstructured" data. But this is a bad practice for a number of reasons.

The most obvious is that the term "unstructured" imposes a binary hierarchy on a category which is continuous and non-discrete by nature. Data is, in fact, structured to various degrees and in various ways. But it is always "structured" - lest it is not "data".

Wait, you might ask - what about a newspaper story? Or a movie script? Or my friend's latest Facebook status? I don't see any explicit non-ambiguous mapping of symbols to semantics. How can you call such free-form text "structured" in any reasonable sense of the word?

Quite simply - I can call it "structured" simply because you are able to read and understand it.

Although you may not be able to see or quantify "structure" in a text document - a part of your brain is, in fact, capable of imposing a structural context on it. If that were not the case, the document would be nonsense. Thus, any collection of symbols that is semantically interpretable by the brain of some sentient being, somewhere, - or even by some non-sentient machine - is in fact "structured".

Claude Shannon, the father of information theory, developed a methodology for quantifying the data content of a collection of symbols. Re-purposing the concept of thermodynamic entropy used in the physical sciences (specifically, statistical mechanics), he formulated the following equation:

Where H is the symbol for the Greek letter Eta (i.e., "e" as in "entropy").

Without going into the details of this mathematical formulation (you can study the Wikipedia entry for more insight or reference the large number of online scholarly papers which discuss this concept), this function essentially assigns a real-valued number between 0 and 1 to a data set X - with a value close to 0 meaning that the data's structural constraints are such that nearly its entire information content can be inferred from just one of its entries x, and a value approaching 1 indicating a reduced ability to use one x to predict any other x.

A value of 1 indicates that the collection of symbols is maximally entropic - that is, pure chaos or "white noise". This means that no number of items extracted from the set can provide any information about any of the unextracted items. If the set contained, say, 1000 items - and you extracted and examined 999 of those items - that would tell you absolutely nothing about the remaining 1 item.

Such a document would, in fact, be "unstructured". It would be the equivalent of a monkey pounding away at the keyboard. Yet this is hardly the picture that most data-analysis professionals have in mind when they use the term "unstructured data".

As we learn more about how the brain processes data - and develop better computational algorithms for modeling this process - the moniker "unstructured", when applied to data, will become less and less relevant. Until now, if you want to use the term "unstructured data" in a casual sense with your professional or academic colleagues when discussing free-form or loosely-constrained information sources, that is perfectly fine (I do it all the time). Just be aware that when you are submitting official publications or giving formal presentations, the reference to any theoretically-comprehensible data model as "unstructured" is misleading and potentially confusing.

Hahaha. Yeah, I'm tired of that term.

回复
John Roscoe

Master Trainer, Therapist, Consultant and Coach at True Power Consultants

9 年

?

回复

要查看或添加评论,请登录

社区洞察