Ideas and guidelines on how businesses can improve the quality of their data
Krish Pillai
Helping people land their first data job | Successful mentee count: 107/500 | Australia ???? | Loves teaching essential data skills | Free resources to land a data job, fast | LinkedIn Top Voice ?
Data and evidence-based decision-making play an indispensable role in a company's, (whatever size or form) success and growth. The caveat nonetheless is that the insights gathered from a dataset are only as good as their quality.
Data analytics and science teams, irrespective of how advanced they are, spend a good amount of time (some say close to a whopping 90%) cleaning their datasets to ensure that the data they ingest to build advanced models or even for a simple bar chart is of high quality.
Bad data can have colossal business consequences for organizations. Poor data quality is cited often as the source of operational blunders, flop business strategies, and faulty insights. Examples of financial damage that data quality issues can cause include additional expenses when a product is shipped to the wrong customer address, fraudulent bank transactions to ghost account numbers, missed sales because of wrong or incomplete customer details, and hefty fines for improper regulatory and financial crime reporting.
There is a famous estimate calculated by IBM which states that the annual cost of data quality issues in the U.S. amounted to $3.1 trillion in 2016.
As part of a company’s data management process and data quality checks, how good would it be to have a centralized way that measures and checks the data quality of all datasets and fields continuously? A dream come true for all data analysts and scientists, isn't it? Through this article, I will share some ideas and guidelines on how businesses can improve the quality of their data.
1) Measure your data quality across these dimensions
It's important to first understand what are the different dimensions one can use to assess data quality. There are 6 major ones and they are shown below (source Precisely Blog)
Accuracy
How well does a piece of information reflect reality? E.g. customer's age is 32 but the system says 34.
Completeness
Does it fulfill your expectations of what’s comprehensive? The opposite of this is called 'missingess' - a check to see if there are any data points missing - fully or partially. E.g. customer name column having some blanks, or not present at all though it should have. It could also be certain rows are missing e.g. when a daily ETL process fails and goes unnoticed.
Consistency
Does information stored in one place match relevant data stored elsewhere? Ie. Does data stored at different locations tell the same story? E.g. if the HR system says an employee does not work here but the payroll says the employee is still active and receives a paycheck.
Timeliness
Is your information available when you need it? Say you need data every quarter to make a particular decision, if it is ready when it is supposed to be then this dimension is fulfilled.
Validity
Is the information in a specific format, does it follow business rules, or is it in an unusable format? A good example would be a phone number and the values are alpha and not numeric. Another one would be the date of birth field, with each customer entering theirs in different formats.
Uniqueness
Is this the only instance in which this information appears in the database? Some data points are now supposed to be duplicated e.g. in a table that holds the current credit card information of customers, the credit card number column should be unique. A duplicate suggests that something is out of place, and a data quality issue exists.
2. Involve subject matter experts to set data quality rules and measure your DQ performance
Now that we have seen what the different dimensions are, we can look into some of the rules that can be applied under each of these 6 dimensions.
Rules are usually drafted with the help of a number of people within an organization. Common rules such as checking for NULLs are easy to come up with, but there might be organization, domain end system-specific rules. For E.g. What characters can a customer ID take? Each organization will have different rules for this particular data element. Some companies allow alphanumeric only, others allow on numeric, etc.
Hence, it is very important to include subject matter experts from various departments. To devise comprehensive data quality rules, one should correctly define all the subject matter experts and make sure to include all their requirements. A team’s rule may not seem important to another team and vice versa.
3. Don’t go overboard with the number of rules set
It is important that you don’t get too carried away and blindly include rules just because you can. Too many rules will throw exceptions that are not manageable and sustainable. It also reduces system performance as you need more computing power and time required to do the check. It is not a good idea to overwhelm your critical databases with these data quality checks. As the number of data records grows, the number of rules you set should
4) Treat each field of the database individually and create rules accordingly
As you have seen, there are multiple dimensions out there. Each field within a database should be treated separately and the most appropriate rule and dimension have to be applied. For e.g. if your company is into shipping some product, “Customer Address” is more critical than say “Customer Gender”. A critical field should satisfy all the five dimensions, whereas customer gender needs only validity and completeness. These characteristics should be reflected in data quality rules. Another example will be “Employee Full Name” and “Employee Contact Number” - and depending on the situation, one will be more important than the other. Let’s look at some rules that could be applied to these fields (source: TDAN.com)
- Employee full name must not be N/A (to ensure completeness).
- Only one ‘Employee full name’ must correspond to one ‘Social security number’ (to ensure uniqueness).
- Employee full name must include at least one space, must consist only of letters, no figures or other characters allowed (to ensure accuracy and completeness).
- Employee contact number must include only digits (to ensure accuracy and orderliness).
- Employee contact number must be in the format +61-04-XXX XX XXX, where X means digits from 0 till 9 (to ensure accuracy and validity).
5) Have a DQ Monitoring Dashboard in place
A DQ monitoring dashboard gives the company a clear picture of what amount of its data asset is of bad and good quality. It shows you the realities of data. With a dashboard that continuously monitors and quantifies data quality - there are no more vague estimates or guesswork. The real culprits can be found and bringing that into the forefront leads to continuous improvement of data.
An example is shown below (source: Talend.com)
Summary level — overall data quality
The summary level shows the overall score across all data that are being profiled and monitored. It provides the ability to filter by:
- department (finance, risk, operations, etc.)
- data quality dimension (the six dimensions cited above)
Record level (rule level) data quality
This level view will allow us to make actionable recommendations on how data quality can be improved. We can pinpoint the real culprits down to a field and database level. Once these failed assessments are identified, necessary controls can be kept in place to improve the data quality score.
Business Intelligence Manager
3 年Great article Krish!