For the Sake of Codd, Please stop saying "Unstructured Data"

Cade Bryant, MS CSc

Staff Software Engineer @ Origence | Full-stack software engineer specializing in C#, .NET Core, SQL, Q#, AWS, Azure, and React/TypeScript

发布日期: 2015年2月21日

Edgar Codd could arguably be called the father of modern data science. As most computer scientists and DBAs know, he invented the relational model of database design. It is a tightly-structured paradigm, based on set theory and solid mathematics. Modern "relational" databases (I use that term loosely since few products actually employ the relational model fully) represent data in tabular form, in which there is no ambiguity and data items are stored in one place only (deliberate denormalization ignored for now). Since Codd, many other data-management models have been developed, and a discussion over which model is "best" is a good way to start a religious world war.

But the bottom line is that all data - if it is to be called data - is structured, to some degree or in some way. It is common for budding data scientists to refer to text documents, social-media posts, medical-imaging metadata, and other non-tabular information stores as "unstructured" data. But this is a bad practice for a number of reasons.

The most obvious is that the term "unstructured" imposes a binary hierarchy on a category which is continuous and non-discrete by nature. Data is, in fact, structured to various degrees and in various ways. But it is always "structured" - lest it is not "data".

Wait, you might ask - what about a newspaper story? Or a movie script? Or my friend's latest Facebook status? I don't see any explicit non-ambiguous mapping of symbols to semantics. How can you call such free-form text "structured" in any reasonable sense of the word?

Quite simply - I can call it "structured" simply because you are able to read and understand it.

Although you may not be able to see or quantify "structure" in a text document - a part of your brain is, in fact, capable of imposing a structural context on it. If that were not the case, the document would be nonsense. Thus, any collection of symbols that is semantically interpretable by the brain of some sentient being, somewhere, - or even by some non-sentient machine - is in fact "structured".

Claude Shannon, the father of information theory, developed a methodology for quantifying the data content of a collection of symbols. Re-purposing the concept of thermodynamic entropy used in the physical sciences (specifically, statistical mechanics), he formulated the following equation:

Where H is the symbol for the Greek letter Eta (i.e., "e" as in "entropy").

Without going into the details of this mathematical formulation (you can study the Wikipedia entry for more insight or reference the large number of online scholarly papers which discuss this concept), this function essentially assigns a real-valued number between 0 and 1 to a data set X - with a value close to 0 meaning that the data's structural constraints are such that nearly its entire information content can be inferred from just one of its entries x, and a value approaching 1 indicating a reduced ability to use one x to predict any other x.

A value of 1 indicates that the collection of symbols is maximally entropic - that is, pure chaos or "white noise". This means that no number of items extracted from the set can provide any information about any of the unextracted items. If the set contained, say, 1000 items - and you extracted and examined 999 of those items - that would tell you absolutely nothing about the remaining 1 item.

Such a document would, in fact, be "unstructured". It would be the equivalent of a monkey pounding away at the keyboard. Yet this is hardly the picture that most data-analysis professionals have in mind when they use the term "unstructured data".

As we learn more about how the brain processes data - and develop better computational algorithms for modeling this process - the moniker "unstructured", when applied to data, will become less and less relevant. Until now, if you want to use the term "unstructured data" in a casual sense with your professional or academic colleagues when discussing free-form or loosely-constrained information sources, that is perfectly fine (I do it all the time). Just be aware that when you are submitting official publications or giving formal presentations, the reference to any theoretically-comprehensible data model as "unstructured" is misleading and potentially confusing.

For the Sake of Codd, Please stop saying "Unstructured Data"

Cade Bryant, MS CSc

Staff Software Engineer @ Origence | Full-stack software engineer specializing in C#, .NET Core, SQL, Q#, AWS, Azure, and React/TypeScript

更多精彩文章

社区洞察

The Top Five Myths of Software Engineering Productivity

2024年2月25日

The Top 10 Things Only a Software Engineer Knows

2023年3月26日

Neurodiversity in the Workplace, Part I: The Vocabulary of Neurodiversity

2021年4月7日

Why I've Come Out of the Asperger's Closet Today

2019年4月3日

When to use Buzzwords

2015年3月21日

How to not look like an Idiot when Talking to your Geek Friends

2015年2月28日

Top Signs that you Need a Change of Career (or Perspective)

2015年2月19日

New "RoboGovernor" Fails Turing Test

2014年7月20日

A Modern Software Engineer's Manifesto - Part 2

2014年7月9日

A Modern Software Engineer's Manifesto - Part 1

2014年6月30日

社区洞察