Ideas and guidelines on how businesses can improve the quality of their data
https://www.redpointglobal.com/wp-content/uploads/2017/12/data-quality-management.jpg

Ideas and guidelines on how businesses can improve the quality of their data

Data and evidence-based decision-making play an indispensable role in a company's, (whatever size or form) success and growth. The caveat nonetheless is that the insights gathered from a dataset are only as good as their quality.

Data analytics and science teams, irrespective of how advanced they are, spend a good amount of time (some say close to a whopping 90%) cleaning their datasets to ensure that the data they ingest to build advanced models or even for a simple bar chart is of high quality.

Bad data can have colossal business consequences for organizations. Poor data quality is cited often as the source of operational blunders, flop business strategies, and faulty insights. Examples of financial damage that data quality issues can cause include additional expenses when a product is shipped to the wrong customer address, fraudulent bank transactions to ghost account numbers, missed sales because of wrong or incomplete customer details, and hefty fines for improper regulatory and financial crime reporting.

There is a famous estimate calculated by IBM which states that the annual cost of data quality issues in the U.S. amounted to $3.1 trillion in 2016.

As part of a company’s data management process and data quality checks, how good would it be to have a centralized way that measures and checks the data quality of all datasets and fields continuously? A dream come true for all data analysts and scientists, isn't it? Through this article, I will share some ideas and guidelines on how businesses can improve the quality of their data.

No alt text provided for this image

1) Measure your data quality across these dimensions

It's important to first understand what are the different dimensions one can use to assess data quality. There are 6 major ones and they are shown below (source Precisely Blog)

No alt text provided for this image

Accuracy

How well does a piece of information reflect reality? E.g. customer's age is 32 but the system says 34.

Completeness

Does it fulfill your expectations of what’s comprehensive? The opposite of this is called 'missingess' - a check to see if there are any data points missing - fully or partially. E.g. customer name column having some blanks, or not present at all though it should have. It could also be certain rows are missing e.g. when a daily ETL process fails and goes unnoticed.

Consistency

Does information stored in one place match relevant data stored elsewhere? Ie. Does data stored at different locations tell the same story? E.g. if the HR system says an employee does not work here but the payroll says the employee is still active and receives a paycheck.

Timeliness

Is your information available when you need it? Say you need data every quarter to make a particular decision, if it is ready when it is supposed to be then this dimension is fulfilled.

Validity

Is the information in a specific format, does it follow business rules, or is it in an unusable format? A good example would be a phone number and the values are alpha and not numeric. Another one would be the date of birth field, with each customer entering theirs in different formats.

Uniqueness

Is this the only instance in which this information appears in the database? Some data points are now supposed to be duplicated e.g. in a table that holds the current credit card information of customers, the credit card number column should be unique. A duplicate suggests that something is out of place, and a data quality issue exists.

2. Involve subject matter experts to set data quality rules and measure your DQ performance

Now that we have seen what the different dimensions are, we can look into some of the rules that can be applied under each of these 6 dimensions. 

Rules are usually drafted with the help of a number of people within an organization. Common rules such as checking for NULLs are easy to come up with, but there might be organization, domain end system-specific rules. For E.g. What characters can a customer ID take? Each organization will have different rules for this particular data element. Some companies allow alphanumeric only, others allow on numeric, etc. 

Hence, it is very important to include subject matter experts from various departments. To devise comprehensive data quality rules, one should correctly define all the subject matter experts and make sure to include all their requirements. A team’s rule may not seem important to another team and vice versa. 

No alt text provided for this image

3. Don’t go overboard with the number of rules set

It is important that you don’t get too carried away and blindly include rules just because you can. Too many rules will throw exceptions that are not manageable and sustainable. It also reduces system performance as you need more computing power and time required to do the check. It is not a good idea to overwhelm your critical databases with these data quality checks. As the number of data records grows, the number of rules you set should 

No alt text provided for this image

4) Treat each field of the database individually and create rules accordingly

As you have seen, there are multiple dimensions out there. Each field within a database should be treated separately and the most appropriate rule and dimension have to be applied. For e.g. if your company is into shipping some product, “Customer Address” is more critical than say “Customer Gender”. A critical field should satisfy all the five dimensions, whereas customer gender needs only validity and completeness. These characteristics should be reflected in data quality rules. Another example will be “Employee Full Name” and “Employee Contact Number” - and depending on the situation, one will be more important than the other. Let’s look at some rules that could be applied to these fields (source: TDAN.com)

  • Employee full name must not be N/A (to ensure completeness).
  • Only one ‘Employee full name’ must correspond to one ‘Social security number’ (to ensure uniqueness).
  • Employee full name must include at least one space, must consist only of letters, no figures or other characters allowed (to ensure accuracy and completeness).
  • Employee contact number must include only digits (to ensure accuracy and orderliness).
  • Employee contact number must be in the format +61-04-XXX XX XXX, where X means digits from 0 till 9 (to ensure accuracy and validity).
No alt text provided for this image

5) Have a DQ Monitoring Dashboard in place

A DQ monitoring dashboard gives the company a clear picture of what amount of its data asset is of bad and good quality. It shows you the realities of data. With a dashboard that continuously monitors and quantifies data quality - there are no more vague estimates or guesswork. The real culprits can be found and bringing that into the forefront leads to continuous improvement of data.

An example is shown below (source: Talend.com)

Summary level — overall data quality

The summary level shows the overall score across all data that are being profiled and monitored. It provides the ability to filter by:

  • department (finance, risk, operations, etc.)
  • data quality dimension (the six dimensions cited above)
No alt text provided for this image

Record level (rule level) data quality

This level view will allow us to make actionable recommendations on how data quality can be improved. We can pinpoint the real culprits down to a field and database level. Once these failed assessments are identified, necessary controls can be kept in place to improve the data quality score.

No alt text provided for this image


Vidhya Rao

Business Intelligence Manager

3 年

Great article Krish!

要查看或添加评论,请登录

社区洞察

其他会员也浏览了