登录查看更多内容

Whose Flight Is It Anyway?

Christian Kaul

Data Modeling Aficionado and Senior Technical Consultant at virtual7 GmbH

发布日期: 2019年11月13日

If you’re familiar with data vault modeling or data modeling in general, you probably know the dreaded Flight example. It comes up again and again in books, presentations and trainings, confuses everybody and then is brushed aside again until the next time it will rear its ugly head.

I’m not sure why people keep choosing this example (maybe it’s just because it is relatively easy to get flight data online) but I have some ideas why they have difficulties modeling it.

The Flight example exposes two crucial mistakes many data modelers (and also some data modeling educators) make.

Mistake #1: Using words without stating their meaning

The first, and probably also the worst, mistake is that they think it’s not necessary to properly define the terms they are working with. They assume everybody shares their understanding of the word “flight” while often they don’t even have a clear understanding of its meaning themselves:

Is a flight something an airline schedules regularly, e. g. weekly, twice a week or daily (a new daily flight from Munich to Ljubljana) ...
... or is it a concrete instance of the scheduled flight (the flight from Munich to Ljubljana that started this morning at 10:45)?
Is it a direct connection between a departure airport and an arrival airport (so that Munich to Stockholm via Copenhagen actually is two flights) ...
... or can the plane start and land multiple times along the way (so that Munich to Stockholm via Copenhagen is one flight)?

Depending on your understanding of the word “flight”, your data model will look very differently. Your version of the data model might be perfectly valid for your definition of a flight but since nobody knows what your definition is, you most likely will run into acceptance problems.

The lack of an explicit definition for the terms you use in your model can lead to heated arguments with business people or fellow modelers or, even worse, to silent misunderstandings that you will only discover when inconsistencies in the usage of your data model blow up in your face.

Some modeling approaches (like fact-based modeling or BEAM?) use concrete examples, making it easier to spot that you’re working with different meanings of the same word.

Mistake #2: Equating things and their identifiers

The second, also very prevalent mistake is to assume that there is always a one-to-one relationships between things and the numbers or strings we commonly use to identify them (often called the “business key”).

In the case of flights, this assumption leads to an unfortunate equation of actual flights and the flight numbers used to identify them:

People get nervous when they learn that flight numbers are recycled all the time and usually are only unique in combination with the departure date and sometimes (and sometimes not even then). If there is no nice, simple identifier for it, maybe the actual instance of a flight isn’t really a thing?
They get even more nervous when they hear about codeshare agreements. If the same flight can have multiple flight numbers, maybe it actually is multiple flights by different airlines that just happen to share the same airport, time, aircraft, crew and passengers?

This happens because the thing-identifier identity assumption has some, usually unstated, implications that can lead you into all kinds of rabbit holes:

If there is no commonly used unique identifier for a thing, then it isn’t a proper thing worthy of your attention and you must ignore it even though it plays an important role in the processes of the organization.
If there are multiple identifiers for one thing, one of them must be the correct one, the treasured “enterprise-wide business key”. You must always use the right one, the true key, and discard all the others.
If there are multiple commonly used identifiers and it’s not possible to elevate one of them to the status of “enterprise-wide business key”, they must identify different things.

One of the strengths of modeling approaches like anchor or focal modeling is that they separate the thing from its identifiers, making it harder to fall into this trap.

Solution: Looking at things before modeling them

Now that we know all this, how can we escape these mistakes and model the dreaded Flight example in a satisfactory way?

My suggestion would be to grab some binoculars, drive to the next airport and look at what happens there. Surprisingly, you don’t see any flight numbers or flights, you see departures and arrivals of planes.

So, why don’t you model what you see? A plane lifts off from an airport with some people inside and some time later, it touches the ground again (hopefully at another airport, in one piece and with all the people inside alive and well). You shouldn’t have any difficulties with providing a definition for a liftoff and a landing if anyone asks you. Chances are, it won’t be necessary to ask.

If you want to, you can associate these liftoffs and landings with flight numbers, flight legs, flight segments and all kinds of numbers or strings that might be used to identify them. Ideally, you can even explain what flight legs or flight segments are.

In any case, now you have an understandable model that is based on events that actually happen in physical reality (planes lifting off somewhere and landing somewhere) and doesn’t leave you with the uneasy feeling you usually get from the Flight example.

As always, I’m very interested in your questions, comments and suggestions.

Martijn ten Napel

Senior data architect

5 年

pst...: ?https://airtechzone.iata.org/industry-programs/aidm/?_ga=2.109024709.655416328.1576075567-1287853382.1576075567#model

3 次回应

Martijn ten Napel

Senior data architect

5 年

"to silent misunderstandings that you will only discover when inconsistencies in the usage of your data model blow up in your face." Amen! We have all fallen into this trap. More than once probably. It is also the single reason why data modelling is so hard and where methods like fact-oriented modelling show their added value.

3 次回应

Bruce Laidlaw

Dáta Modeller / Architect / AUDM Jedi

5 年

Interesting.. I have used the flight example for over 30 years, but from a different angle. To the engine tech, a flight begins when the engine is turned in, and ends when the engine is turned off. To an airframe tech, a flight begins when the weight comes off the wheels and ends when the weight comes back on the wheels. To the pilot, it is between check lists for takeoff a d landing. For a dispatcher its from start to final destination. There is a tendency to assume only one is right, when in fact all are right. A good model respects all of these events in proper context. Semantic definitions are the key to good models.

3 次回应

Oleg Ivanov

Founder and DWH Consultant at Mumplini

5 年

Regarding Mistake #2: How do you feel about assuming that existence of a surrogate key in a source system is equivalent to existence of an object (as it is defined in the area where this system is in operation)? In my view it is so in 99% cases, as it's simply difficult to program things in another way. DWH definition can for sure differ from the one used in the source system. And I agree about business keys, I consider them to be nice to have human readable pointers used a) to communicate with business departments, like "please check if LH1234 is shown in the report in a right way". Yes, it can be that LH1234 was a complete different flight 2 years ago, but in known context it is good enough pointer. b) when there is no other choice

1 次回应

Veronika Haderlein-H?gberg, PhD

Senior Researcher, Semantics & Reasoning - "The things we notice only become what they are because of the relationships." (Iain McGilchrist)

5 年

I realize we are in a privileged position having access to dedicated teams of law experts and system developers to properly model our data structures for the automatized implementation of tax law. Our challenge, however, is to find the right balance between "detailed enough vs over-specified", despite or just exactly because we have access to experts. It seems that this has something to do with the "fractal nature of information" as someone in this community once put it (my apologies for not being able to attribute this poignant expression to the right author): the more our teams dig into a concept and try to come up with official definitions and conceptual models, the more questions arise. Thus I have lately started to get interested in the dynamics of the generation of opinion, consensus and shared knowledge within groups. So if anyone in this community here has any recommendations for relevant literature, I would be very grateful.

3 次回应

查看更多评论

要查看或添加评论，请登录

Christian Kaul的更多文章

Data Vault Modeling Patterns: Links, Hierarchy, Identity (Modern Data Warehousing, Part 13)

2022年6月29日

Data Vault Modeling Patterns: Links, Hierarchy, Identity (Modern Data Warehousing, Part 13)

If you’re new to the topic or don’t have a lot of practical data vault experience, you might want to consult the…

4 条评论
Data Vault Conventions: Technical Columns (Modern Data Warehousing, Part 12)

2021年8月25日

Data Vault Conventions: Technical Columns (Modern Data Warehousing, Part 12)

This article is the twelfth part in an ongoing series on modern data warehousing using data vault. The first part on…
Data Vault Conventions: Surrogate Identifiers (Modern Data Warehousing, Part 11)

2021年8月17日

Data Vault Conventions: Surrogate Identifiers (Modern Data Warehousing, Part 11)

This article is the eleventh part in an ongoing series on modern data warehousing using data vault. The first part on…

4 条评论
Data Vault Conventions: Construct Naming Conventions (Modern Data Warehousing, Part 10)

2021年8月4日

Data Vault Conventions: Construct Naming Conventions (Modern Data Warehousing, Part 10)

This article is the tenth part in an ongoing series on modern data warehousing using data vault. The first part on data…

3 条评论
Data Vault Conventions: Construct Usage (Modern Data Warehousing, Part 9)

2021年7月22日

Data Vault Conventions: Construct Usage (Modern Data Warehousing, Part 9)

This article is the ninth part in an ongoing series on modern data warehousing using data vault. The first part on data…

1 条评论
Raw Vault and Business Vault (Modern Data Warehousing, Part 8)

2021年7月16日

Raw Vault and Business Vault (Modern Data Warehousing, Part 8)

This article is the eighth part in an ongoing series on modern data warehousing using data vault. The first part on…

1 条评论
Data Vault Constructs: Other Optional Constructs (Modern Data Warehousing, Part 7)

2021年4月22日

Data Vault Constructs: Other Optional Constructs (Modern Data Warehousing, Part 7)

This article is the seventh part in an ongoing series on modern data warehousing using data vault. The first part on…

5 条评论
Data Vault Constructs: Satellite Variations (Modern Data Warehousing, Part 6)

2021年3月31日

Data Vault Constructs: Satellite Variations (Modern Data Warehousing, Part 6)

This article is the sixth part in an ongoing series on modern data warehousing using data vault. The first part on data…

10 条评论
Data Vault Constructs: Links & Satellites (Modern Data Warehousing, Part 5)

2021年3月1日

Data Vault Constructs: Links & Satellites (Modern Data Warehousing, Part 5)

This article is the fifth part in an ongoing series on modern data warehousing using data vault. The first part on data…

7 条评论
Data Vault Constructs: Hubs (Modern Data Warehousing, Part 4)

2021年2月15日

Data Vault Constructs: Hubs (Modern Data Warehousing, Part 4)

This article is the fourth part in an ongoing series on modern data warehousing using data vault. ?The first part on…

1 条评论

See all articles

Whose Flight Is It Anyway?

Christian Kaul

Data Modeling Aficionado and Senior Technical Consultant at virtual7 GmbH

Mistake #1: Using words without stating their meaning

Mistake #2: Equating things and their identifiers

Solution: Looking at things before modeling them

Christian Kaul的更多文章

社区洞察

其他会员也浏览了

Pie Charts in Focus — A Comprehensive Guide to Effective Visualization

HOOK vs Data Vault: Modelling

Data Visualization Best Practices

How to Communicate Data Insights

The Art of Data Analysis ????

EDA Cheat Sheet

Elevate Your Data Game: Mastering Data Cleaning and Preparation for Accurate Analysis

A Journey into Data Analysis ????

Data Fallacies

Process of translating large data sets into visuals

Mistake #1: Using words without stating their meaning

Mistake #2: Equating things and their identifiers

Solution: Looking at things before modeling them

Christian Kaul的更多文章

Data Vault Modeling Patterns: Links, Hierarchy, Identity (Modern Data Warehousing, Part 13)

Data Vault Conventions: Technical Columns (Modern Data Warehousing, Part 12)

Data Vault Conventions: Surrogate Identifiers (Modern Data Warehousing, Part 11)

Data Vault Conventions: Construct Naming Conventions (Modern Data Warehousing, Part 10)

Data Vault Conventions: Construct Usage (Modern Data Warehousing, Part 9)

Raw Vault and Business Vault (Modern Data Warehousing, Part 8)

Data Vault Constructs: Other Optional Constructs (Modern Data Warehousing, Part 7)

Data Vault Constructs: Satellite Variations (Modern Data Warehousing, Part 6)

Data Vault Constructs: Links & Satellites (Modern Data Warehousing, Part 5)

Data Vault Constructs: Hubs (Modern Data Warehousing, Part 4)

社区洞察

其他会员也浏览了

Pie Charts in Focus — A Comprehensive Guide to Effective Visualization

HOOK vs Data Vault: Modelling

Data Visualization Best Practices

How to Communicate Data Insights

The Art of Data Analysis ????

EDA Cheat Sheet

Elevate Your Data Game: Mastering Data Cleaning and Preparation for Accurate Analysis

A Journey into Data Analysis ????

Data Fallacies

Process of translating large data sets into visuals