Whose Flight Is It Anyway?
Christian Kaul
Data Modeling Aficionado and Senior Technical Consultant at virtual7 GmbH
If you’re familiar with data vault modeling or data modeling in general, you probably know the dreaded Flight example. It comes up again and again in books, presentations and trainings, confuses everybody and then is brushed aside again until the next time it will rear its ugly head.
I’m not sure why people keep choosing this example (maybe it’s just because it is relatively easy to get flight data online) but I have some ideas why they have difficulties modeling it.
The Flight example exposes two crucial mistakes many data modelers (and also some data modeling educators) make.
Mistake #1: Using words without stating their meaning
The first, and probably also the worst, mistake is that they think it’s not necessary to properly define the terms they are working with. They assume everybody shares their understanding of the word “flight” while often they don’t even have a clear understanding of its meaning themselves:
- Is a flight something an airline schedules regularly, e. g. weekly, twice a week or daily (a new daily flight from Munich to Ljubljana) ...
- ... or is it a concrete instance of the scheduled flight (the flight from Munich to Ljubljana that started this morning at 10:45)?
- Is it a direct connection between a departure airport and an arrival airport (so that Munich to Stockholm via Copenhagen actually is two flights) ...
- ... or can the plane start and land multiple times along the way (so that Munich to Stockholm via Copenhagen is one flight)?
Depending on your understanding of the word “flight”, your data model will look very differently. Your version of the data model might be perfectly valid for your definition of a flight but since nobody knows what your definition is, you most likely will run into acceptance problems.
The lack of an explicit definition for the terms you use in your model can lead to heated arguments with business people or fellow modelers or, even worse, to silent misunderstandings that you will only discover when inconsistencies in the usage of your data model blow up in your face.
Some modeling approaches (like fact-based modeling or BEAM?) use concrete examples, making it easier to spot that you’re working with different meanings of the same word.
Mistake #2: Equating things and their identifiers
The second, also very prevalent mistake is to assume that there is always a one-to-one relationships between things and the numbers or strings we commonly use to identify them (often called the “business key”).
In the case of flights, this assumption leads to an unfortunate equation of actual flights and the flight numbers used to identify them:
- People get nervous when they learn that flight numbers are recycled all the time and usually are only unique in combination with the departure date and sometimes (and sometimes not even then). If there is no nice, simple identifier for it, maybe the actual instance of a flight isn’t really a thing?
- They get even more nervous when they hear about codeshare agreements. If the same flight can have multiple flight numbers, maybe it actually is multiple flights by different airlines that just happen to share the same airport, time, aircraft, crew and passengers?
This happens because the thing-identifier identity assumption has some, usually unstated, implications that can lead you into all kinds of rabbit holes:
- If there is no commonly used unique identifier for a thing, then it isn’t a proper thing worthy of your attention and you must ignore it even though it plays an important role in the processes of the organization.
- If there are multiple identifiers for one thing, one of them must be the correct one, the treasured “enterprise-wide business key”. You must always use the right one, the true key, and discard all the others.
- If there are multiple commonly used identifiers and it’s not possible to elevate one of them to the status of “enterprise-wide business key”, they must identify different things.
One of the strengths of modeling approaches like anchor or focal modeling is that they separate the thing from its identifiers, making it harder to fall into this trap.
Solution: Looking at things before modeling them
Now that we know all this, how can we escape these mistakes and model the dreaded Flight example in a satisfactory way?
My suggestion would be to grab some binoculars, drive to the next airport and look at what happens there. Surprisingly, you don’t see any flight numbers or flights, you see departures and arrivals of planes.
So, why don’t you model what you see? A plane lifts off from an airport with some people inside and some time later, it touches the ground again (hopefully at another airport, in one piece and with all the people inside alive and well). You shouldn’t have any difficulties with providing a definition for a liftoff and a landing if anyone asks you. Chances are, it won’t be necessary to ask.
If you want to, you can associate these liftoffs and landings with flight numbers, flight legs, flight segments and all kinds of numbers or strings that might be used to identify them. Ideally, you can even explain what flight legs or flight segments are.
In any case, now you have an understandable model that is based on events that actually happen in physical reality (planes lifting off somewhere and landing somewhere) and doesn’t leave you with the uneasy feeling you usually get from the Flight example.
As always, I’m very interested in your questions, comments and suggestions.
Senior data architect
5 年pst...: ?https://airtechzone.iata.org/industry-programs/aidm/?_ga=2.109024709.655416328.1576075567-1287853382.1576075567#model
Senior data architect
5 年"to silent misunderstandings that you will only discover when inconsistencies in the usage of your data model blow up in your face." Amen! We have all fallen into this trap. More than once probably. It is also the single reason why data modelling is so hard and where methods like fact-oriented modelling show their added value.
Dáta Modeller / Architect / AUDM Jedi
5 年Interesting.. I have used the flight example for over 30 years, but from a different angle. To the engine tech, a flight begins when the engine is turned in, and ends when the engine is turned off. To an airframe tech, a flight begins when the weight comes off the wheels and ends when the weight comes back on the wheels. To the pilot, it is between check lists for takeoff a d landing. For a dispatcher its from start to final destination. There is a tendency to assume only one is right, when in fact all are right. A good model respects all of these events in proper context. Semantic definitions are the key to good models.
Founder and DWH Consultant at Mumplini
5 年Regarding Mistake #2: How do you feel about assuming that existence of a surrogate key in a source system is equivalent to existence of an object (as it is defined in the area where this system is in operation)? In my view it is so in 99% cases, as it's simply difficult to program things in another way. DWH definition can for sure differ from the one used in the source system. And I agree about business keys, I consider them to be nice to have human readable pointers used a) to communicate with business departments, like "please check if LH1234 is shown in the report in a right way". Yes, it can be that LH1234 was a complete different flight 2 years ago, but in known context it is good enough pointer. b) when there is no other choice
Senior Researcher, Semantics & Reasoning - "The things we notice only become what they are because of the relationships." (Iain McGilchrist)
5 年I realize we are in a privileged position having access to dedicated teams of law experts and system developers to properly model our data structures for the automatized implementation of tax law. Our challenge, however, is to find the right balance between "detailed enough vs over-specified", despite or just exactly because we have access to experts. It seems that this has something to do with the "fractal nature of information" as someone in this community once put it (my apologies for not being able to attribute this poignant expression to the right author): the more our teams dig into a concept and try to come up with official definitions and conceptual models, the more questions arise. Thus I have lately started to get interested in the dynamics of the generation of opinion, consensus and shared knowledge within groups. So if anyone in this community here has any recommendations for relevant literature, I would be very grateful.