登录查看更多内容

The Anatomy of Data

Kurt Cagle

Editor In Chief @ The Cagle Report | Ontologist | Author | Iconoclast

发布日期: 2015年6月25日

To hear it from marketing people, data is a the stream of a high pressure fire hose, coming fast and furious. Or there's the metaphor of data lakes, where you store data in great big data systems that people can tap into as needed. Then of course there's the ever popular data silos, which hold data that, while not water, nonetheless has a lot of fluid characteristcs. I guess that "data water towers" just doesn't have the same pinache.

The danger with metaphors is that if taken too far they can actually obscure rather than elucidate. In my experience, data systems tends to be a lot more like oil distillation systems than anything else, where each spigot releases reagents, each segregation tank holds different grades of petroleum by-products, and mixing the wrong lines together is likely to result in an explosion.

The reality of data management is that all data is structured, and moreover is structured in a staggering number of potential forms. Some of these are forms that emulate real-world entities - books, chapters, sections paragraphs. They may have formal "markup" or simply follow a convention, but the structure is nonetheless there (to prove the validity of this, try reading a book that's printed out as one long continuous string of characters without white space beyond blank characters).

In other cases, data has some form of metadata embedded within it that identifies specific properties and relationships. Programmers tend to have religious wars about the superiority of XML vs. JSON (and before that between XML and YAML, and before that between COM and CORBA) but the reality is that these markup tools are intended primarily just to provide some kind of semantic distinction within a line of characters or bytes.

A relational database has this same structure, but broken down along the lines of tables, rows and fields. In a marked up document, the semantics (albeit very limited) exist within a list, a hash, or some combination of the same, along with a container-contained relationship that is so intrinsic to the markup that most people don't even think about the partOf/hasPart relationship that would ordinarily be encapsulated in foreign key relationships.

Changing Metaphors

Put another way, most data storage or data transmission formats are designed first and foremost to identify relationships, and only secondarily to identify properties. Those relationships - how one table relates to another, how one object or concept relates to another - can be thought of via an altogether different metaphor. These are the bones and muscles of data.

Most data modeling efforts starts out wrong, especially when business analysts get into the process. This is no knock on business analysts - most of them identify the bones - the classes of items - but then jump directly to attributes - properties that identify specific internal characteristics. This is because these properties are things that are quantifiable or measurable, and as such are more important to them. The problem with this approach is that it is a lot like a person who has bones, nerves and flesh, but no muscles. The muscles connect the bones, nerves and blood vessels interweave within the muscles, and when a nerve is stimulated, the body responds by moving an arm or a leg.

This is the anatomy of data:

Bones identify the things that are relevant - a person, a house with an address, a job, an insurance policy or policies, children or a spouse, all those things that are described in the system to model the business. Without understanding all of the bones,all of the types of objects, you will end up with significant pieces missing or improperly placed.

Muscles and ligaments identify the relationships between bones and connect them. Muscles describe different kinds of relationships - the long muscle of the gluteus maximus has a different kind of relationship with both the pelvis and the femur than muscles that rotate the wrist. Once these relationships are known, then you actually have a pretty clear idea about the mechanics of your data system.

Nerves establish identity within the body, the elusive but very important kinesthetic sense. Each nerve ending establishes a specific locus, the equivalent of an id that the brain uses to identify parts of the body. In some cases, ids are only given with very large structures, such as in the upper back, where nerve distribution is sparse. In other places (such as the hands or the tongue) nerves are clustered close together, because these systems have very specialized duties. It is not, in general necessary for nerves (or ids) to be bound to properties, but anything that effectively acts under volition as a unit is identified by body, and as it should be identified within a data structure.

Attributes, on the other hand, are generally emergent. A knee can be described as arthritic, but arthritis is not a thing - it is a condition applied towards several things within a system. Similarly, you can't talk about revenue in a business system as a thing, because revenue is simply a state - it is the state of income that a business unit was able to generate within a specified interval.

There is another type of property, however: classification. A person object may be classified by gender - (either the binary pair of "female" and "male" or the FBI version in NIEM that has one of seventeen distinct states). This usually provides a shorthand for some cluster of properties - template short-hand. Classifications become problematic in modeling because classification systems tend to evolve independently, and so may not consistently define the same cluster of properties from one classification scheme to the next. This means that most models, while not "incorrect" per se, tend to have distinct biases in the classifications - what are often called controlled vocabularies - that suit them to one or another purpose.

From Foundations to Fillials

Changing up metaphors for a moment, you can argue that the process of creating a good data model is akin to the weigh that a visual artist creates a painting. Typically the basic shapes are swathed in with a broad brush, with borders indistinct and arbitrary, and with colors that are flat. The artist then corrects for composition (muscle tension - the relationships), to keep the overall piece balanced but dynamic, and then begins to delineate the various forms (the categorizations). Once these are established, then and only then is the detail (the attributes) added in, refined and shaded with metadata of its own.

Most people can recognize the hallmarks of an amateur artist - there may be a great deal of detail (perhaps too much, in fact) but proportions are off, hands may have too many or too few fingers, figures appear stilted and two dimensional. It's usually not as obvious in data models, but there are similar hallmarks.

These modelers focus upon one object at a time, and put the relationships in only at the end of the modelling process, rather than identifying the bones (the types of objects first) and working out the relationships between them as the next step.
The model has too many properties. Do you need to retain every subject component of a street address - house or building number, street number, street name, apartment number, and so forth - or is a simple line called Street_Address sufficient? Unless there is a clear business case for the detail it should not end up in the model.
There few no globally unique identifiers on distinct objects - too few nerves for the body. This is especially problematic for XML or JSON, where hasPart/isPart joins essentially fold identifiers removing them from the document.
There is a great deal of explicit redundancy that is not captured by abstractions. A physical address and a shipping address are distinct objects, but they are still both addresses, and as such should also be subclasses of a generic address class.
References are underutilized. References are associational links to other entities in the system - the primary form of relationships in most models. Even when they do occur, they are often paired with descriptive content (such as titles) that may end up becoming out of sync if the resources in question are updated.

I also talk about this in the post Seven Data Modeling Mistakes.

This can be countered by remembering the anatomy of data: establish the bones (the object types), add in the muscles (create the references to other resources first), set up the nerves (identify the compositional breakdown within the model and create global identifiers), identify systems and symmetries (determine and utilize abstract structures to create commonality where appropriate), and finally establish the skin (add in the individual atomic properties).

By the way, this also has an impact upon development. Inexperienced programmers working with inexperienced models will tend to focus too early on details, and utilize programming methodologies that are too fragile in the face of evolving schemas (and schemas always evolve over time).

Programming for Exceptions

If you as a developers and application architect go into a data project knowing that new properties and relationships will be added over time, you can use a kind of programming known as exception oriented programming - focus on the most general aspects of a model first, use patterns and annotations to facilitate discovery of new model changes, and then, as possible, fill in details at the edges. This approach is also consistent with a change of focus of databases from a closed world model, where the structure of information is generally assumed as fully known, to an often world model in which the structure (and the associated schemas) are unknown.

The also gives rise to a common misconception within the database community (especially the NoSQL portion).. In a relational model, the schema, or organization, of the data is baked into the database. In the NoSQL world, however, schema is "optional" - you do not need to explicitly define a schema in order to put a document into a NoSQL database - instead you associate schemas indirectly. and any given document may in fact be bound to arbitrary schemas. This doesn't mean such documents are schema-less - they still have a definite structure and rules that indicate what is and is not valid. However the documents/data do not require that the schemas be present to be useful.

In practice, however, the use implicit schemas puts more of the onus on the solid identification of the bones and muscles of the data, especially when there is the potential (growing daily) that multiple data systems may end up feeding into the modeled system, and when data systems are increasingly distributed and federated. It is these requirements that are driving the rise of semantic data systems, to be discussed in an upcoming post.

Summary

Understanding that data has structure, whether explicitly declared or not, goes a long way towards dispelling the myth that so long as you know how data "flows" from one system to another that data management is easy. In reality, this liquid view of information, while compelling from a marketing standpoint, is in fact the easy part of data management.

The complexity comes when you have to integrate two or more models together, have to deal with the bones and muscles of the data. If you have several hundred databases and you want to extract these from their respective silos, then there is no magic bullet that will make this process easy, only tools that can make it easier. Most of these tools require moving away from a fixation on attributes and towards a better awareness of logical structures ... the anatomy of data.

As always, your comments and questions are welcome.

Kurt Cagle is a writer and information architect. Today is his birthday. Be nice to him. :-)

Fer O'Neil

Knowledgebase Manager at ESET, PhD ABD Texas Tech

9 年

I think it's important to point out the converse as well--that the metaphor one chooses to frame their approach to something (in this case, using big data), can also cause problems for how we evaluate and ultimately use that information. Metaphors, in this way, can be useful as well as important for contributing to how we make meaning from things.

Edward W.

Growth & SEO consultant

9 年

A great analogy can do magic. Thanks for this one.

Flavio Tosi

Cialdini Certified Coach & Speaker | Consultant: Commercial Strategy, Sales Processes EPC ETO OEM CM | Co-Author: "At the Negotiation Table" | Industrial Sales Representative GCC - Middle East

9 年

Auguri Kurt, Today you give a chilled welcome to those fluids analogy... ;-)

Mary Gee

VP Business Development

9 年

Thanks for the share Jeffrey Strickland, Ph.D., CMSP. Great article and Happy Birthday to Kurt!

查看更多评论

要查看或添加评论，请登录

Kurt Cagle的更多文章

Reality Check

2025年2月22日

Reality Check

Copyright 2025 Kurt Cagle / The Cagle Report What are we seeing here? Let me see if I can break it down: ?? Cloud…

14 条评论
MarkLogic Gets a Serious Upgrade

2025年2月15日

MarkLogic Gets a Serious Upgrade

Copyright 2025 Kurt Cagle / The Cagle Report Progress Software has just dropped the first v12 Early Access release of…

14 条评论
Beyond Copyright

2025年2月9日

Beyond Copyright

Copyright 2025 Kurt Cagle / The Cagle Report The question of copyright is now very much on people's minds. I do not…

5 条评论
Beware Those Seeking Efficiency

2025年2月8日

Beware Those Seeking Efficiency

Copyright 2025 Kurt Cagle / The Cagle Report As I write this, the Tech Bros are currently doing a hostile takeover of…

86 条评论
A Decentralized AI/KG Web

2025年2月1日

A Decentralized AI/KG Web

Copyright 2025 Kurt Cagle / The Cagle Report An Interesting Week This has been an interesting week. On Sunday, a…

48 条评论
Thoughts on DeepSeek, OpenAI, and the Red Pill/Blue Pill Dilemma of Stargate

2025年1月26日

Thoughts on DeepSeek, OpenAI, and the Red Pill/Blue Pill Dilemma of Stargate

I am currently working on Deepseek (https://chat.deepseek.

41 条评论
The (Fake) Testerone Crisis

2025年1月15日

The (Fake) Testerone Crisis

Copyright 2025 Kurt Cagle/The Cagle Report "Testosterone! What the world needs now is TESTOSTERONE!!!" - Mark…

22 条评论
Why AI Agents Aren't Agents

2025年1月15日

Why AI Agents Aren't Agents

Copyright 2025 Kurt Cagle/The Cagle Report One of the big stories in 2024 was that "2025 Would Be The Year of Agentic…

22 条评论
What to Study in 2025 If You Want A Job in 2030

2025年1月12日

What to Study in 2025 If You Want A Job in 2030

Copyright 2025 Kurt Cagle/The Cagle Report This post started out as a response to someone asking me what I thought…

28 条评论
Ontologies and Knowledge Graphs

2025年1月9日

Ontologies and Knowledge Graphs

Copyright 2025 Kurt Cagle/The Cagle Report In my last post, I talked about ontologies as language toolkits, but I'm…

53 条评论

See all articles

The Anatomy of Data

Kurt Cagle

Editor In Chief @ The Cagle Report | Ontologist | Author | Iconoclast

Changing Metaphors

From Foundations to Fillials

Programming for Exceptions

Summary

Kurt Cagle的更多文章

社区洞察

其他会员也浏览了

What’s in a name? How to Split and Enrich Names

How Data Science Can Empower Your Business

How Many Customers Do We Have?

OPC UA over MQTT: Describing the Message Content

learn Data Structures:

About joins & table cardinality

To graph or not to graph, that really shouldn't be a question!

Beyond Graphs: The Art of Complex Data Models

Exploring the "Group Anagrams" Problem: Strategy and Complexity Analysis.

Antipatterns in data access, part 2 - nested selects

Changing Metaphors

From Foundations to Fillials

Programming for Exceptions

Summary

Kurt Cagle的更多文章

Reality Check

MarkLogic Gets a Serious Upgrade

Beyond Copyright

Beware Those Seeking Efficiency

A Decentralized AI/KG Web

Thoughts on DeepSeek, OpenAI, and the Red Pill/Blue Pill Dilemma of Stargate

The (Fake) Testerone Crisis

Why AI Agents Aren't Agents

What to Study in 2025 If You Want A Job in 2030

Ontologies and Knowledge Graphs

社区洞察

其他会员也浏览了

What’s in a name? How to Split and Enrich Names

How Data Science Can Empower Your Business

How Many Customers Do We Have?

OPC UA over MQTT: Describing the Message Content

learn Data Structures:

About joins & table cardinality

To graph or not to graph, that really shouldn't be a question!

Beyond Graphs: The Art of Complex Data Models

Exploring the "Group Anagrams" Problem: Strategy and Complexity Analysis.

Antipatterns in data access, part 2 - nested selects