Zen and the Art of Data Quality
What is “Quality”?
In his 1974 book “Zen and the Art of Motorcycle Maintenance”, Robert Pirsig considers this a philosophical question.?He describes a fictional journey with his friend, John Sutherland, who has a brand new motorcycle - which he does not maintain, as he prefers to be ‘in the moment’ and to enjoy riding his machine.?When it goes wrong, he seeks professional help. By contrast, Robert rides an older machine which he maintains himself. This requires knowledge of the inner mechanics, but allows him to make proactive adjustments as they travel.?John thinks Robert is very boring! The book demonstrates that motorcycle maintenance may be dull and tedious drudgery, or an enjoyable and pleasurable pastime; it all depends on attitude. It concludes that to truly experience Quality one must embrace both sides and apply as best fits the requirements of the situation.
When searching for Truth in data (or at least a single version of it), as in philosophy, there are several dimensions to consider.
How is Data Quality defined?
In October 2017, the National Bank of Belgium issued some very detailed guidance on Data Quality. Defined as “the adequacy of the data for the ultimate user goal”, they specified six dimensions by which this would be assessed - as described below.
Accuracy, or - “Get it Right”
This represents the extent to which data values correctly describe the underlying concept / data definition.?Has the correct calculation been applied? Is that an actual ISO country code? Is it displayed in the right currency, unit of measure (millions / thousands)? Were there any errors in the calculation process?
Reliability, or - “Spot the Difference”
This refers to the difference between ‘versions’ of data submitted.?What was the extent of the change? Do we know / understand the difference between the old values and the new?
Completeness, or - “You Missed a Bit”
This dimension is about ensuring that all the relevant data items have been supplied.?A key factor here is actually knowing which of your data items are relevant - ie, that they are ultimately used for producing the prudential reports. If you know your data lineage, you know the scope for checking that all of your reporting inputs are complete.
Consistency, or - “One Version of the Truth”
Described as “logical concordance between data subsets” this basically means that where the same value (or set of values) appear in more than one report, they should not contradict.?Are you using the same version of source data as your colleagues in the other department? Does “Total Assets” (for example) match across reports produced by different systems?
Plausibility, or - “That Doesn’t Usually Happen”
Using time series analysis across reporting periods it should be possible to identify variables that have deviated significantly from the usual results.?
Timeliness, or - “Why are we waiting?”
This ultimately refers to the amount of time between the end of the reporting period and the point at which results are submitted.?On a more individual / systems level, it refers to the point at which the last required input is received, and the final output is produced.?This is especially important for those data processes that are on the ‘critical path’. Is your data produced in a reasonable timeframe?
Principles of Data Quality
领英推荐
Whilst John would get his motorcycle fixed after it went wrong, Robert would proactively inspect his machine on a regular basis.?In this regard, the NBB defines a set of principles with some specific action points that might be considered as “Preventative Maintenance” for Data Quality.
Principle 1 - Governance
The process of preparing, verifying, and submitting the prudential data to the Bank should be supported by a robust, documented governance system.?The main impact of this for regular users is the identification of roles / responsibilities - who is responsible for the data??Given that this information is always changing, it is important that this information (evidence) can be captured in an automated way.
In addition, there ought to be a separation between those who prepare the data, and those who validate it (sign it off) - which can be described as a ‘4 eyes’ principle.
Principle 2 - Technical Capacities
Institutions should design, establish and manage such data architecture and IT infrastructure as are appropriate for producing and verifying prudential reporting.?There are a number of points here - firstly, do your systems have the capacity to ensure reporting can be produced even in times of stress / crisis??What monitoring do you have in place to ensure this? Performance tuning solutions such as ESM? are invaluable when running heavy analytic workloads in a shared environment during a busy month end.
Secondly, are tools in place to ensure timely detection and resolution of Data Quality errors and inconsistencies? Is this information archived, and appropriately followed up??Are these tools periodically reviewed and maintained?
Finally, it is important that tools for information management are as automated and and integrated as possible.?Each unconnected End-User Computing (EUC) application requires a secured, verified and documented process to ensure it’s reliability.?Where that EUC involves manual data processing, it is also necessary to document the reason for manual processing, as well as the associated risks, and measures taken to compensate for those risks. Now if only there was an easy way to integrate and validate that EUC data..
Principle 3 - Process
The process of preparing, verifying and submitting the prudential data to the Bank should follow a documented internal process.?This is by far the most onerous of the principles!?A long list of requirements ensue:
As if that wasn’t enough - accredited statutory external auditors are also required to examine Data Quality.
Where is the Zen?
If your company is based in Belgium, or a country with equivalent requirements for Data Quality, you’d be forgiven for thinking the above requirements are horrifically manual and not particularly zen-like.
The good news is, that if you have a SAS? BI platform, the chances are that a significant chunk of the above can be automated.?The Data Controller is a modern HTML5 web application for real-time data modification and approval (4 eyes) workflow with full audit trail.?Data Quality rules are applied at source, and ‘hook scripts’ allow SAS jobs to execute after data updates are approved, enabling full automation and integration with existing systems.
More information on how this tool can help automate the requirements of the NBB_2017_27 circular?is available below:
“Care and Quality are internal and external aspects of the same thing. A person who sees Quality and feels it as he works is a person who cares. A person who cares about what he sees and does is a person who’s bound to have some characteristic of quality.”
A Data Scientist, Consultant, Educator, Developer, Programmer, and problem solver who transforms organizations and people with intelligent data-driven solutions and analytics.
1 年Nice article, Allan B. !
SAS Consultant & Blogger
3 年Great, well-written article, Allan! However, definition of Data Quality by the Bank as the “adequacy of data for the ultimate user goal” seems odd. I don't think data quality is connected in any way with "user goal", but rather with depiction of reality. If the ultimate user goal is to deceive then incorrect/madeup data is of great data quality ??
SAS App Migration, Modernisation, and Manifestation
5 年For the programming geeks, this is a MUST follow-on read:? https://www.slideshare.net/DavidHorvath22/20190413-zen-and-the-art-of-programming?credit David Horvath
Strategy | Design | Research | Analytics
5 年Great article, I love the link to Pirsig, one of my fav books