Becoming a Data Modeler
I've had a few people lately inquire about what it takes to become a data modeler in the first place. While everyone's route to achieving mastery is different, there are usually a few things that you can do to get started.
A programming background does not hurt, but keep in mind that programming and data modeling are actually fairly different things, though they can sometimes overlap. In general, programmers make things go, modelers describe things. Put another way, most programmers think imperatively (with a few critical exceptions in the functional programming space), while modelers think declaratively.
This distinction can be seen in thinking about the difference between JSON and a Javascript object. A JavaScript developer, if asked to design a car sprite for a game, will focus on function - writing a function that will move the car from a starting point 0 to some point farther down the track, consuming a certain amount of gas (g) in the process. They would tend to see things like the color of the car as a property that can be changed for different instances, and would see the process of creating such a car as instantiation.
The example given here works upon the assumption that you put gas into the tank, and that you have an acceleration function which has the value 1 as long as there's gas in the tank, but is a negative percentage of the car's mass (representing friction) as the car slows down once the gas is used up.
When this is run, it creates a simulation of the car moving, recording the state of that car from one step in the sequence to the next.
Significantly, this is where the programmer leaves off and the data modeler begins. There is a data model in the application, one that keeps track of the temporary state of the simulation, but this data model is constantly in flux. The data model that is of more concern to the modeler is the report that is generated as part of this process. In effect, each state is a record, and the collection of states creates a local database.
The modeler is not (at least here) in the least interested in process, and isn't really all that interested in the internal variables used by the programmers. Instead, the modeler is more interested in the data generated as part of the process, and how that data could in turn be shared with others. The data in the table above could be generated from an actual car with an unknown acceleration function. To the data modeler, determining what that acceleration function is in fact falls into the hands of a data scientist; the modeler is much more concerned about the shape of the data.
So what exactly do I mean by "shape"? The modeler is going to be thinking in abstract terms (at least at first). The programmer modeled a car. However, the data modeler will recognize that what's actually in the table above is a representation of the car's state at any given time. The class being modeled then is Car_State>. It has five attributes - {time, x, vx, gas and color}. Time is an integer (it can be thought of as one "tick" of the simulation), x and vx are most likely floating point numbers, as is gas (gas remaining in the tank), while color is a string. The specific types are not that important in the abstract representation, but they do come into play later when talking about physical modeling.
Now, there is a major point of redundancy here - notice that "blue" is repeated all throughout this initial dataset. Blue by itself is not that significant, but the fact that all of the data refers to a blue car is a pretty telling indication that the this particular car is it's own entity - a car that has been painted blue. We may have a red car and a green car, each of which have different data-sets associated with them, but what's important here is that this actually hints at a more complex relationship between a car and a car's state:
car:BlueCar is an instance of a class of Cars.
car:BlueCar is painted "blue".
carState:BlueCar_1 is a state of car:BlueCar.
carState:BlueCar_2 is a state of car:BlueCar.
carState:BlueCar_1 has a time of 1 tick.
carState:BlueCar_2 has a time of 2 ticks.
These are all instance models, with statements 3 and 4 describing a relationship between an instance of a car state and an instance of a car. Instance relations usually describe explicit collections between things, rather than classes of things:
Here's a situation that often trips up beginning modelers - which way do the arrows go in a relationship? Put another way, is it more proper to talk about a car state being a "state of" a car, or a car "having" a car state. In general, when modeling abstractly, it is usually better to reduce the number of connections going out from an object than it is to reduce the number of connections from a given type of relationship coming in. This also aligns well with most relational databases.
This holds especially in the case where there's a natural sort order within the data (such as the time attribute within the dataset above, or an alphabetic key). The situation becomes considerably more complex when you talk about a sequence of paragraphs in a chapter, where this sequential metadata is not implicit in the data set, but this is fodder for another article.
Entity Relationship diagrams, or ER Diagrams, are often used to capture these kinds of relationships in a graphical shorthand. For instance, the class relationship identified above would be drawn as:
This indicates that CarState class has a one to many relationship with the Car class. Attributes are shown in a list, usually with their associated (abstract) base types such as strings or indicators. The 1:∞ symbol is shorthand for one to many (indicating that you must have at least one car-state but could have more). These relationship indicators - 0:1, 1:1, 0:∞ and 1:∞ are cardinality indicators, and they typically are the biggest headaches that modelers end up facing.
Note that there are other ways of expressing this same information. In the Resource Description Framework (RDF), the class relationships are usually detailed from property definitions. For instance,
class:Car
typeOf rdfs:Class.
class:CarState
typeOf rdfs:Class.
property:carState
typeOf rdfs:Property;
rdfs:domain class:CarState;
rdfs:range class:Car;
minCount 1.
(I'm cheating on the last just a bit. Cardinality relationships in OWL, a logical superset of the RDF Schema language, can get to be complex. I'll be talking about the upcoming SHACL language designed to simplify this in an upcoming post.)
This is how semantics looks at these relationships, and also shows that a property (a predicate) can in fact be modeled as an object in its own right, which provides a rather astonishing degree of power for modeling relationships. RDF also places no constraints upon cardinality - everything is assumed to be zero to many unless otherwise explicitly constrained.
These are all typically described as logical models, because they describe the relationships between classes of entities without any specific restriction on how this data is represented within a data store. A physical model, on the other hand, is typically implementation specific. and will usually end up referring to the DDL (or data definition language) that SQL uses, or the XML Schema Definition Language (XSD) that XML uses. If you assume that the model instance reflects the logical model, your XML would look something like this.
The XSD for this structure. its logical model, involves identifying the two entity classes then tying them together. The following is a simplified version of the XSD in question.
The structure reflects the logical structure above, though because XML is hierarchical the structure reflects the dominant container/contained relationship. Note that you could also go with a more normalized structure (one where you keep the XML, but also retain the tabular structure) by going with an RDF-XML like approach:
This approach, where you're working with decomposed components, is one that's employed by such standards as NIEM, the National Information Exchange Model. For transport purposes, the classes would be collective wrapped in some kind of an envelope, but what you're dealing with internally to the database are separate records and subrecords that bear a lot of resemblance to a database table. The stateOf> element contains a pointer to the resource identified in the @about attribute of the blue car entry itself.
(Note, I've deliberately stripped out namespaces here, which are necessary but not necessarily designed for elucidation.)
Becoming a data modeler for the most part involves stepping away from the content of the data and focusing on the patterns. To be a good modeler, you can focus on one expression (many data modelers start out as database designers, or doing XSD, others come from more of an RDF background), but should be cognizant that logical models, the shape of the data, must come first. In today's world this is especially important, because data will move through the systems that you are involved with in many different forms, and being able to understand at an abstract level how the data relates can make is crucial to ensure that you have as little loss of fidelity of your information as possible.
Note that there are tools out there that can help. Most modelers should have a base familiarity with UML (Unified Modeling Language), which is typically a graphical approach bearing a lot of similarity to ER diagrams. It's not perfect - UML doesn't handle abstraction or class inheritance well - but it's a good starting point. There are a number of high quality modeling tools as well, such as IBM's Rational Rose Enterprise software, as well as open source projects such as ArgoUML, StarUML and others.
For XSD development, I'd recommend OxygenXML or Altova's XML Schema Editor. I've still not found an RDF modeler that I'm wild about, though SmartLogic's toolset is good for basic modeling, while the modeler that's part of Allegrograph is reasonably complete, if a bit clunky. (A big part of this has to do with how flexible RDF can be, making modeling difficult to capture without imposing a pre-existing structure).
Finally, going into data modeling, remember to keep an open eye about the structure of information coming into your domain. Simply because data is produced in a certain way does not necessarily mean that it was well modeled, or even modeled at all. Your purpose, as a modeler, is to increase the utility of the data - to make it more consistent, more readily accessible by others, and more efficiently stored and transmitted.
Kurt Cagle is the founder of Semantical LLC, a smart data company.