Seven Data Modeling Mistakes
Data modelers tend to be odd ducks (says this particular data modeler). A data modeler is a lot like the more cerebral of detectives, carrying large amounts of information in their heads about the various parts of a business, trying to figure out how everything fits. Ontologists, who use semantic tools, tend to be more like the figure of Sherlock in the TV series of the same name, because what they are building has to not only fit a database requirement, but also has to be logically consistent (the two are not necessarily the same). It's an attractive analogy (and yes, I really want a Belstaff coat like he has - that's a great Great Coat!), but all too often data modeling can end up a disaster, one that can often be avoided by refraining from make a few fairly easy mistakes.
#1. Starting Too Late
One of the biggest problems I've seen when working on big enterprise models is the fact that all too often modeling is done concurrently with (or evenafter)application development. This is roughly analogous to laying down the foundation of an office building before even completing the blueprints.
In data applications (and most enterprise applications are data applications) understanding the shape and structure of that data is critical. Starting to develop without understanding schemas, how schemas get extended and how data gets persisted and serialized can often mean that significant work will end up being rework instead. As a rule of thumb, assume that the first few sprints should be heavily devoted to data design, with this then dropping off to mostly development at about 25% of the way into a project.
#2. Getting Too Abstract Too Early
I've found that a good way to start a design is to identify a particular class of object - a person, a film, a policy - and create a few "samples" that have specific stories. For a person, as an example, I'll create two or three base characters - a woman who has divorced and remarried, for example, which tests among other things whether the issue of a person changing names over time is important.
With these story characters, I'll then examine which relationships are important to the overall domain - an actress who has been in a number of films and tv productions will have very different relevant relationships than a woman who has had two or three different medical coverage plans and insurance policies.
One goal of this exercise is to identify what is immediately important to the model and what is not, and I'll usually concentrate on relationships and their characteristics than on properties, at least at first. Only after I identify this web of relationships will I start identifying classes, formally.
#3. Mixing Conceptual, Logical and Physical Models
Do you know what my favorite tool is for creating models?
A whiteboard, a camera, yellow legal pads and sticky notes.
Most enterprise architecture tools are great at creating entity relationship diagrams, but an ER diagram is in fact well on its way to being a logical model. The distinction between the two is important. A conceptual model describes the way that things relate to one another "in the real world", with the full awareness that this may very well change based upon the business requirements you are trying to model. A logical model is more formal, but not yet fully tied down into a specific representation. A physical model on the other hand, is an expression of that language in a given format - UML2, Relational ER diagrams, XML Schema Language, OWL/RDFS or whatever is used to translate these relationships into something that is meaningful to computer systems and computer programmers.
A conceptual model needs to be fluid, needs to allow for input, and most importantly needs to provide a basis for experimentation. This is where you put together scenarios and use cases, and see what breaks. Too often, this area of true design is given short-shrift because it's easier to start drawing boxes into a UML tool, freezing the design long before it's fully tested. Changing it at the logical model stage is harder, at the physical stage (where there are explicit links and dependencies) and in code much harder still.
This means that you're better off spending time with a magic marker, a white board and a couple other designers to argue the merits and attempt to knock holes in the logic, business requirements or design, and only then get out the heavy tools.
#4. Not Thinking About Inheritance
From a modeling perspective, almost all models deal 99% with the following: people, places, organizations, events, contracts, intellectual property, named artifacts, collections, categorizations, named identifiers, vectors and joins. Named artifacts are classes of things that have some formal identifier - a car with a VIN number, a book with an ISBN number and so forth. Categorizations are ways of describing specific attributes - gender, color, genre, etc. A named identifier is a specific way of specifying a unique entity, and typically will be associated with both an "id" and an authorizing agent (an organization) that determines that id. Vectors are sets of numeric values or equivalent categorization values (northeast, southwest) that provide a magnitude, along with a qualifier (mph, US Dollars, degrees latitude or longitude relative to Greenwich, England). Finally joins are composite objects that relate two or more other kinds of objects. Many tend to have "hyphenate names" such as character roles, which relate a particular character with the person (people) who creates the expression of that character.
The point here is that once you recognize that most classes fall into one of these categories then you can use this to find common properties for those objects. All individuals have names, for instance, though there are many variations of names. Individuals have lives made up of one long event (their life) and smaller subsets, not always distinct, of lives that are relevant to a model (when an actor worked on a film may be very relevant to as scheduling application, but likely less so for a movie database).
By identifying these common characteristics of a given class, you can create patterns of usage in which multiple properties may derive from the same base property type. This means that, except in very specialized situations, you can use the broader properties, which means that your model becomes simpler. This is, in semantics especially, one of the most elementary forms of inferential reasoning.
However, once you do identify such base classes, you will almost certainly extend them, creating more specialized examples of classes that model things that are more relevant to your business needs. This process of subclassing and building sub-properties forms the bulk of what a good modeler should do, as well determining when something shifts from being a role or an aspect to being a full blown subclass.
A lot of models tend to have too many distinct, unrelated modeled parts rather than looking too closely at inheritance. This increases the number of property names, decreases consistency even when it is helpful, and makes the model more complex than it could be. In a logical conceptual model, you'll have three dimensions - the relationships of things to one another (which is two dimensional) and the relationship of classes of things to their super classes. This makes things more cohesive, and in the long run easier to work with during search and navigation.
#5. Using Physical Models to Build Conceptual Models
This is so common that finding situations where modeling is done differently is hard. Department X has produced an ER model when they built their relational database fifteen years ago, and if the model was good enough for them, it was good enough for us.
There are a few big assumptions that are made here, however, that need to be examined more closely. First, business requirements change and technologies evolve. New storage mechanisms and NoSQL data systems such as MarkLogic change how information can be both stored and transformed, meaning that models can be richer than they were before. As importantly At the very least revisiting these issues is a worthwhile exercise.
Additionally, the most from conceptual model to physical model is almost invariably lossy. Nuances and constraints considered in the design tend not to get translated, or limitations in the modeling language force complex workarounds that are not always considered. Semantics is better at capturing conceptual models - it is itself a conceptual framework, and has the flexibility to manage arbitrarily complex models - but it is also a fairly complex framework that requires some deep technical knowledge to work with.
Use existing work to help inform the model, but at a minimum see such physical models as a starting point, not a template to slavishly follow.
#6. Modeling is a Programming Issue
Ontologists and data modelers fit into a weird position. They often have deep technical knowledge. However, most ontologists spend their time in meetings working with business analysts (BAs) in order to determine how to create a representation of business needs at a logical level, and typically are far more involved before a line of code is first written.
They work closely with architects, and it is not surprising that many ontologists are also information architects. Typically ontologists and modelers will spend much of their time trying to both translate business requirements, and also to push back on analysts to identify recurrent patterns under different guises to try to reduce the complexity of the model. Most BAs tend to see their business largely as trying to identify properties that they want to capture in the model to support their business, but this close focus often comes at the cost of not seeing bigger picture issues.
A good modeler, on the other hand, should constantly be evaluating these requests for properties as being only one piece of the puzzle. Are two properties by two different BAs equivalent? Are keys and identifiers actually hiding the need to define new objects? (most often, yes!) Are there sufficient separations of concern? This becomes critical when developing mapping strategies, especially when you start looking at incorporating dozens or even hundreds of data systems into your model.
I often like to describe ontologists as politically astute programmers - they are, in a very real sense, creating a language, and what is (and more importantly what is not) in a language has very real business (political) impacts upon different groups within any organization.
#7. Modeling One Piece At a Time
This is one of the big areas where modeling and programming differ. Programmers like to work with discrete modules or functional blocks. This makes perfect sense - most programmers are reductionists - they like to break down complex problems into modules or components that can work independently of one another.
The problem with reductionism is that it tends to hide analogies that are often indicative of deep relationships. As an example, loading and saving a file is almost universally seen as being a basic action, and as such, is usually consigned to a specific library. However, in many applications, there are often times where there are structures within the files that represent common information, and as such should be part of a subclass of the core save function that says "When you save files of this type, you should create these data structures to make sure that other aspects of the program can retrieve the relevant information."
When a single person, or even a single team, is working on an application, this usually will get caught. When you have large, distributed sets of teams, each working on some aspect of this problem (which will usually be the case for enterprise data applications), this is no longer a given.
Thus, one of the roles of the ontologist is to identify when and where there are common structures and intents, and to work with an architect (also a systems person) to make sure that the APIs for the applications reflect this. It means that processes of inheritance are understood, and at the same time that inheritance does not become too forced and a new kind of object be constructed instead. In that regard, an ontologist is a lot like a composer writing for an orchestra.While they will be writing the model out note by note, they have to at least have in the back of their minds the overall structure of the piece, or it will fail.
Summary
Data modeling and ontology work is not hard, but it is evolving. In the Entity Relationship era, ironically, the focus was on properties first, and only secondarily on relationships. As tables in databases are replaced by complex objects connected via semantics modeling itself has had to become more adaptive as well, with the broader model shaping not only how data gets saved in a database but also large scale enterprises think about their information, about moving out of data silos, and building ontologies that are canonical for the whole organization, not just one database.
This in turn is reshaping the best practices of data modeling, and what may have been considered an inviolate rule about how models were created twenty years ago is increasingly considered obsolete, especially in the world of adaptive modeling.
Kurt Cagle is an information architect and principal evangelist for Avalon Consulting, LLC.
Consultant specializing in Election Integrity and Cloud AI frameworks and Cryptology technologies. Maryland coordinator for implementing the FAIRtax.
7 å¹´I love the old joke - What is the difference between a data modeler and a terrorist? You can negotiate with a terrorist :-)
Sales Consultant at Meadowgate Technology
9 å¹´Just saying hello. how you doing?
Sales, Marketing and Logistics
9 å¹´Excellence !!!
Data Science Leader
9 å¹´excellent...spot on!
XML RDF and Ontological Technologist
9 å¹´8. The axiomatic presumption of an object oriented model.