登录查看更多内容

Ideas and guidelines on how businesses can improve the quality of their data

Krish Pillai

Helping people land their first data job | Successful mentee count: 107/500 | Australia ???? | Loves teaching essential data skills | Free resources to land a data job, fast | LinkedIn Top Voice ?

发布日期: 2021年6月20日

Data and evidence-based decision-making play an indispensable role in a company's, (whatever size or form) success and growth. The caveat nonetheless is that the insights gathered from a dataset are only as good as their quality.

Data analytics and science teams, irrespective of how advanced they are, spend a good amount of time (some say close to a whopping 90%) cleaning their datasets to ensure that the data they ingest to build advanced models or even for a simple bar chart is of high quality.

Bad data can have colossal business consequences for organizations. Poor data quality is cited often as the source of operational blunders, flop business strategies, and faulty insights. Examples of financial damage that data quality issues can cause include additional expenses when a product is shipped to the wrong customer address, fraudulent bank transactions to ghost account numbers, missed sales because of wrong or incomplete customer details, and hefty fines for improper regulatory and financial crime reporting.

There is a famous estimate calculated by IBM which states that the annual cost of data quality issues in the U.S. amounted to $3.1 trillion in 2016.

As part of a company’s data management process and data quality checks, how good would it be to have a centralized way that measures and checks the data quality of all datasets and fields continuously? A dream come true for all data analysts and scientists, isn't it? Through this article, I will share some ideas and guidelines on how businesses can improve the quality of their data.

1) Measure your data quality across these dimensions

It's important to first understand what are the different dimensions one can use to assess data quality. There are 6 major ones and they are shown below (source Precisely Blog)

Accuracy

How well does a piece of information reflect reality? E.g. customer's age is 32 but the system says 34.

Completeness

Does it fulfill your expectations of what’s comprehensive? The opposite of this is called 'missingess' - a check to see if there are any data points missing - fully or partially. E.g. customer name column having some blanks, or not present at all though it should have. It could also be certain rows are missing e.g. when a daily ETL process fails and goes unnoticed.

Consistency

Does information stored in one place match relevant data stored elsewhere? Ie. Does data stored at different locations tell the same story? E.g. if the HR system says an employee does not work here but the payroll says the employee is still active and receives a paycheck.

Timeliness

Is your information available when you need it? Say you need data every quarter to make a particular decision, if it is ready when it is supposed to be then this dimension is fulfilled.

Validity

Is the information in a specific format, does it follow business rules, or is it in an unusable format? A good example would be a phone number and the values are alpha and not numeric. Another one would be the date of birth field, with each customer entering theirs in different formats.

Uniqueness

Is this the only instance in which this information appears in the database? Some data points are now supposed to be duplicated e.g. in a table that holds the current credit card information of customers, the credit card number column should be unique. A duplicate suggests that something is out of place, and a data quality issue exists.

2. Involve subject matter experts to set data quality rules and measure your DQ performance

Now that we have seen what the different dimensions are, we can look into some of the rules that can be applied under each of these 6 dimensions.

Rules are usually drafted with the help of a number of people within an organization. Common rules such as checking for NULLs are easy to come up with, but there might be organization, domain end system-specific rules. For E.g. What characters can a customer ID take? Each organization will have different rules for this particular data element. Some companies allow alphanumeric only, others allow on numeric, etc.

Hence, it is very important to include subject matter experts from various departments. To devise comprehensive data quality rules, one should correctly define all the subject matter experts and make sure to include all their requirements. A team’s rule may not seem important to another team and vice versa.

3. Don’t go overboard with the number of rules set

It is important that you don’t get too carried away and blindly include rules just because you can. Too many rules will throw exceptions that are not manageable and sustainable. It also reduces system performance as you need more computing power and time required to do the check. It is not a good idea to overwhelm your critical databases with these data quality checks. As the number of data records grows, the number of rules you set should

4) Treat each field of the database individually and create rules accordingly

As you have seen, there are multiple dimensions out there. Each field within a database should be treated separately and the most appropriate rule and dimension have to be applied. For e.g. if your company is into shipping some product, “Customer Address” is more critical than say “Customer Gender”. A critical field should satisfy all the five dimensions, whereas customer gender needs only validity and completeness. These characteristics should be reflected in data quality rules. Another example will be “Employee Full Name” and “Employee Contact Number” - and depending on the situation, one will be more important than the other. Let’s look at some rules that could be applied to these fields (source: TDAN.com)

Employee full name must not be N/A (to ensure completeness).
Only one ‘Employee full name’ must correspond to one ‘Social security number’ (to ensure uniqueness).
Employee full name must include at least one space, must consist only of letters, no figures or other characters allowed (to ensure accuracy and completeness).
Employee contact number must include only digits (to ensure accuracy and orderliness).
Employee contact number must be in the format +61-04-XXX XX XXX, where X means digits from 0 till 9 (to ensure accuracy and validity).

5) Have a DQ Monitoring Dashboard in place

A DQ monitoring dashboard gives the company a clear picture of what amount of its data asset is of bad and good quality. It shows you the realities of data. With a dashboard that continuously monitors and quantifies data quality - there are no more vague estimates or guesswork. The real culprits can be found and bringing that into the forefront leads to continuous improvement of data.

An example is shown below (source: Talend.com)

Summary level — overall data quality

The summary level shows the overall score across all data that are being profiled and monitored. It provides the ability to filter by:

department (finance, risk, operations, etc.)
data quality dimension (the six dimensions cited above)

Record level (rule level) data quality

This level view will allow us to make actionable recommendations on how data quality can be improved. We can pinpoint the real culprits down to a field and database level. Once these failed assessments are identified, necessary controls can be kept in place to improve the data quality score.

Vidhya Rao

Business Intelligence Manager

3 年

Great article Krish!

1 次回应

要查看或添加评论，请登录

查看全部

Ideas and guidelines on how businesses can improve the quality of their data

Krish Pillai

Helping people land their first data job | Successful mentee count: 107/500 | Australia ???? | Loves teaching essential data skills | Free resources to land a data job, fast | LinkedIn Top Voice ?

1) Measure your data quality across these dimensions

2. Involve subject matter experts to set data quality rules and measure your DQ performance

3. Don’t go overboard with the number of rules set

4) Treat each field of the database individually and create rules accordingly

更多精彩文章

社区洞察

其他会员也浏览了

Data quality doesn't matter

Data Quality beyond the Metrics..!!

Navigating Data Quality: Exploring Issue Types, Impacts, and Solutions

The Vital Link Between Data Validity and Data Reliability

What data?

The Art of Data Cleaning: Ensuring Accuracy in Your Analysis

"External data, a game changer or just an additional data layer?"

Data Quality is a Money Pit

How do you assess the value of your data?

Data Sharing: The best way to better data

1) Measure your data quality across these dimensions

2. Involve subject matter experts to set data quality rules and measure your DQ performance

3. Don’t go overboard with the number of rules set

4) Treat each field of the database individually and create rules accordingly

[MAJOR UPDATE??] WEEK 44: DATA JOB OPENINGS IN AUSTRALIA

2024年11月11日

??WEEK 43: DATA ANALYST JOB OPENINGS IN AUSTRALIA??

2024年11月3日

??WEEK 42: [390 jobs] DATA ANALYST JOB OPENINGS IN AUSTRALIA??

2024年10月28日

??WEEK 41: [392 jobs] DATA ANALYST JOB OPENINGS IN AUSTRALIA??

2024年10月21日

??WEEK 40: [269 jobs] DATA ANALYST JOB OPENINGS IN AUSTRALIA??

2024年10月13日

??WEEK 39: [412 jobs] DATA ANALYST JOB OPENINGS IN AUSTRALIA??

2024年10月8日

??WEEK 38: [426 jobs] DATA ANALYST JOB OPENINGS IN AUSTRALIA??

2024年9月30日

??WEEK 37: [403 jobs] DATA ANALYST JOB OPENINGS IN AUSTRALIA??

2024年9月23日

??WEEK 36: [388 jobs] DATA ANALYST JOB OPENINGS IN AUSTRALIA??

2024年9月16日

??WEEK 35: [393 jobs] DATA ANALYST JOB OPENINGS IN AUSTRALIA??

2024年9月9日

社区洞察

其他会员也浏览了

Data quality doesn't matter

Data Quality beyond the Metrics..!!

Navigating Data Quality: Exploring Issue Types, Impacts, and Solutions

The Vital Link Between Data Validity and Data Reliability

What data?

The Art of Data Cleaning: Ensuring Accuracy in Your Analysis

"External data, a game changer or just an additional data layer?"

Data Quality is a Money Pit

How do you assess the value of your data?

Data Sharing: The best way to better data