Data standardization: even when it's easy...
Often, the biggest challenge in data standardization is having the ability to force everyone to conform. So let's say you have that because, well, you are the law. Great. Now let me pose a lay up: how many ways can you say "Report Period Begin Date" and "Report Period End Date"?
Turns out, a lot, at least if you look at version 1.6 of the ABS XML Technical Specification, dated as of this month (March 2017, presumably supplanting the earlier versions that were already in effect). Should it be a lot? Probably not. But a lot it is.
Why so many? Who knows. Maybe someone in the Federal Register printing back office was bored. Maybe there is a hidden meaning to "ing" that I am missing. I leave the question of "why?" as an exercise for you; I merely lay out the information for you to base your decision on.
And so, without further ado, here we go. Let's start with the singular concept of "Report Period Begin Date", which shows up in no less than 4 different ways. This is not a joke! These are all in the actual technical spec.
- reportPeriodBeginDate
- reportingPeriodBeginDate
- reportPeriodBeginningDate
- reportingPeriodBeginningDate
That's right. Every permutation of adding "ing" to the word "Report" and the word "Begin" has representation. Respect!
Now let's move on to "Report Period Begin Date":
- reportingPeriodEndDate
- reportPeriodEndDate
- reportingPeriodEndingDate
Aww shucks, they left one out (reportPeriodEndingDate)! But do not despair, our journey is not yet over! Start Date and End Date both appear in each dataset, so in addition to having 3-4 different names for each of them, we also have many different ways to combine them (4 x 3 = 12, to be exact). So how did we do on the "combination" front? Let's take a look:
- ReportPeriodBeginDate + ReportPeriodEndDate
- ReportPeriodBeginningDate + ReportPeriodEndDate
- ReportingPeriodBeginDate + ReportingPeriodEndDate
- ReportingPeriodBeginningDate + ReportingPeriodEndDate
- ReportingPeriodBeginningDate + ReportingPeriodEndingDate
So there you have it. 5 different standards. That's four too many by my count. That's like going to a movie by yourself, that should cost you $12, and having to pay $60 instead (not to mention it's XML which is about 10x or more voluminous than the equivalent CSV format, so it's more like $600).
Look on the bright side. They could have thrown the words Open and Close as substitutes for Begin and End, all permutations for each, and combined them in every combination, to get 8x8 = 64 different "standards". You think I jest? Try looking at ABS data from sources other than Edgar... (and feel free to give me a holler if you need some help...)
Strategic Planner at U.S. Department of Energy (DOE) / Associate Professor at MIT
7 年I feel more motivated to clean my basement than I do new datasets.
Alternatives, Private Markets and Multi Asset investing
8 年Good post Matt Wong, CFA - gentle reminder that big picture ideas can break down when you look at the details.