SMEs: A Practical Approach to Managing Your Data Quality Part 1
Peter Domanski
Co-Founder of Kegsoft.com, co-founder Chefswarehouse.co.uk, Lead Business & Data Architect
The importance of good quality data (DQ) in business is a frequent topic, with a strong push for proactive management. Granted, it's not the most thrilling subject and often requires a nudge or a mishap to capture our attention. With a multitude of more urgent tasks, DQ doesn't quite have the allure or appeal of, for instance, AI. However, I recently encountered a cv where the candidate professed a 'passion for Data Quality.' Not words I would use! Yet, considering the vast amount and variety of data we amass, along with the growing network of interconnected and automated systems we employ, the focus on DQ is indeed justified.
In Part 1 of this article, I explore the fundamental aspects of data quality and aim to clarify the crucial concepts that may seem intimidating to managers who are not well-versed in data theory or its management. If you persevere beyond the initial paragraphs, I hope to make these ideas more accessible. Part 2 will address practical approaches to enhancing data quality standards, emphasizing common sense, cost-effectiveness, and simplicity.
To begin, consider this 'if only' scenario from a recent experience: If only I had shifted the jam jar from the edge of the table, it wouldn't have shattered on the floor. I would have spared myself 20 minutes of cleaning, the dog wouldn't have cut its tongue on the shards, and we would have avoided an hour at the vet and a £200 bill. Moving the jar promptly would have prevented so much trouble and expense.
Similarly, indifference to DQ can have consequences but saving ‘stitches in time’ will help avoid many ‘if onlys’. A cost-benefit analysis of a proactive approach can be explained with a familiar industrial paradigm.
Inaction can lead to data-related failures, resulting in unnecessary costs and complications—remediation, compensation, and apologies are just a few of the repercussions. Poor data quality not only wastes valuable time and money but also damages reputation, which is particularly harmful in the digital age. Moreover, the use of unreliable data in analytics for decision-making is often underestimated and can have dire consequences. The effects may be subtle; for instance, sophisticated AI modeling techniques might inadvertently make incorrect inferences from questionable data, leading to bias and distortion. Therefore, maintaining high data quality (DQ) benefits your bottom line, enhances productivity, and minimizes the business risks associated with misleading analysis for strategic planning and decision-making.
Data quality (DQ) in regulated sectors like banking is now rightly treated with utmost seriousness. Overlooking poor data or neglecting it is negligence, as defects can lead to unpredictable and potentially catastrophic outcomes, including breaches of security, safety, and confidentiality. In financial services, substandard DQ has resulted in punitive fines, suggesting that the 1:10:100 ratios depicted in the diagram could realistically escalate to 1:10:1000.
Data Quality: 50 Shades of Grey
A simple question: Do you trust your data? If so, is your confidence backed by evidence? Are you able to accept a certain degree of imperfection? Data errors in 'transactions' tend to reveal themselves quickly, for instance, through customer complaints about incorrect charges or delivery details. However, when it comes to reporting and analytics, errors and inconsistencies can be hidden and elusive, making them insidious and potentially damaging. If you cannot substantiate the reliability of your data, it may be prudent to consider some form of audit.
In reality, most business data is not graded ‘good’ or ‘bad’ in absolute terms – it need not be perfect but has to be fit for purpose and knowing that your DQ is at least ‘good enough’ will give some peace of mind that your operational error rate will be kept within acceptable bounds. The measure of quality, therefore, is not black or white but a greyscale. Determining what 'good enough' means for you and gauging where your data stands on the quality spectrum is crucial, as improving and sustaining data quality requires consistent effort.
What is Data?
Data comes in many sizes, shapes and formats, varying from the simple to highly structured, it’s often time-dependent and is stored using a variety of technologies. Many organisations have evolved their IT and processes over time so store and manage data in everything from spreadsheets to cloud-based database (actually, more likely to be under the ocean), it may be highly dispersed or consolidated in data warehouses and lakes. For the purpose of this article, we’ll assume that operational business data is primarily of a ‘transactional’ style and tabular as in spreadsheets or database tables and structured in rows and columns. You may have huge volumes of customer or patient records stored as rows each with their attributes in columns and when we talk of DQ, it will apply at whole record (row) level, individual column values AND relationships between datasets - as ‘transactions’ comprise assemblages of linked and related records.
What are the Causes of Poor DQ?
Where to start? For now, we’ll just touch on a few obvious areas. We’re all human, fallible and there are numerous points in systems and processes where data can be mis-recorded, omitted or ambiguities introduced either at data entry or downstream processes. We’ve all experienced mistakes with our own data: telephone number, email address, postcode, account number, etc. and paradoxically, as so much data is now self-entered by us as agents or customers using online applications and websites there are more opportunities for errors and mistakes - accidental and deliberate (how many people have lied about their age or income?) ?
DQ is not just about careful data entry and acquisition – information can become stale. Data collected last year may now be well out of date and can be as misleading as accidental errors. Sometimes misunderstandings between different departments leads to ambiguity and fudge – it happens when business is ‘siloed’, when one department uses and updates data for one purpose and is not be aware of effects and nuances in another department. This happens when there is lack of shared understanding of data structure, definition and its governance.
Measuring DQ: 6 Pillars of Wisdom?
Now we’re getting to the nitty gritty, Google the term ‘DQ’ and a diverse, confusing subject is revealed. Notice the many references to ‘pillars and dimensions’ of data quality which vary in number between 3 and 12 depending on the pundit. ?Bear in mind that DQ is another IT ‘growth industry’ with numerous technology-oriented ‘practitioners’ avidly competing for your attention, they pay handsomely for advertising clicks to push their products and services to the top of search results. Best not be swayed by promises of AI solutions and applications that offer you the moon on a stick, DQ processes are fundamentally straightforward and it’s is worth scrolling down the listings for more impartial (and free) advice which concentrates on methods rather than technology. Look for something that sits between blatant advertising and academic waffle.
DQ is a catchall term (like sustainable, robustness or scalable) and the pillars or dimensions are terms used to differentiate aspects of quality that help make for an objective approach. Below are six commonly quoted categories that I like to work with, they could be subdivided but for me, they form a good basis for questioning quality of data.
1 Accuracy
Are data values you are using, correct? Stored values can meet all the standards required in the 5 subsequent DQ categories and yet still be inaccurate – e.g. holding the wrong date of birth, post code, current cost price or parameter value. Was there guess work involved? Was the customer’s name and address recorded correctly and is it still correct. Accuracy is a measure of truth for recorded values that represent current and contemporary reality.
So how can we check for accuracy? We can do a lot before we have to inspect records manually by ‘testing’ that they are ‘reasonable and sensible’ against criteria set out by other DQ categories. ?By ‘parsing’ data for basic rules we can rapidly identify the most heinous errors. For instance, a customer's date of birth indicating an age over 120 or under 12 years old would certainly raise suspicion. It is only after these reasonability checks that we would conduct spot checks to determine the actual error rate.
2 Completeness
Completeness in data is achieved when all necessary values and records are present. This does not mean the population of every data field, as some are optional. It is crucial to distinguish between mandatory components and those that are not. Data fields critical to business processes, often called Key Data Elements (KDEs), must be free of nulls, blanks, spaces, or undefined entries such as 'N/A' or 'unknown'. Conversely, fields marked as optional, like a delivery note, are not vital for transactions and do not require consistent completion.
By differentiating between mandatory and optional data components, we can swiftly identify missing data values and records. The 'consistency' category emphasizes the significance of data completeness. It's also crucial to ascertain whether optional fields are appropriately marked when unused—such as with a null, blank, or specific value—to prevent ambiguous values from disrupting analysis. For instance, if the number of dependents for an individual is unknown, assigning a value of 0 is unhelpful, as it may be misinterpreted as 'no children' instead of 'unknown,' potentially skewing the analysis..
3 Uniqueness
Each data record should represent a unique instance of an entity and only appear once in a dataset, such as an individual in a customer table. Duplicates, or multiple records for the same person, can lead to confusion, such as sending parcels to incorrect addresses or causing issues with accounts and analytics. To ensure uniqueness, we seek an attribute or combination of attributes that can serve as a unique key. However, this is often complex, especially with personal records where names are not unique, can change, and not everyone has a National Insurance Number (NINO), while email addresses and phone numbers may also vary over time. The common solution is to assign an 'artificial' key, like a customer reference number or ID. This is effective in theory, as it allows other record attributes to change while still identifying the same individual. However, it doesn't always prevent the accidental creation of multiple records for the same person by business users.
领英推荐
To address uniqueness failures, it's crucial to first ensure a robust identifier, such as a customer ID, is in place. However, logically identifying an individual can be complex due to similar-sounding names with the same date of birth, identical names within the same postcode, repeated use of an email address, and nicknames are just a few examples. Employing some degree of fuzzy logic is often necessary when dealing with human data (note that AI is particularly helpful in this area). It's vital to prevent duplicate entries in your applications from the outset, as rectifying them later can be challenging, particularly when the records are intertwined with transactions, like sales orders. Historically, finding strong, unique identifiers for individuals has been problematic for authorities, which is why there's an increasing reliance on biometrics to minimize the risks of duplication and misidentification.
4 Consistency
Consistency is achieved when there is harmony across and between datasets, it’s rooted in database theory and practice. A typical ‘transaction’ has a structure comprising several related datasets (aka tables) that might look like this
Customers --> Orders --> Order Lines <-- Products
In Orders, each record has a unique order number together with the customer id value linking it to its customer record in Customers. A single sales order may have multiple Order lines records each with an order number to link to its sales order. As each order line is for a specific product or service, it will also contain a product code to link it to a record in Products. And on it goes, there can be numerous datasets (or tables) in a transaction linked by such relationships by their unique record ‘keys’. ?
Perfect consistency is achieved when all component records for transactions and related datasets are correctly linked through their relationship keys, a state known as referential integrity. System flaws or poor data management can compromise integrity, resulting in incomplete transactions, often referred to as 'widows & orphans' (e.g., order lines without sales orders or sales orders without customers). Inconsistency leads to clutter, which in turn causes increasingly inaccurate and unpredictable outcomes, especially in reporting and analytics. Analytical models and 'lenses' may fail to detect imperfect structures and poor data quality, leading to errors. Be mindful of the principle 'Garbage In, Garbage Out' (GIGO) when making significant decisions based on reports from tools like Power BI, and to always verify the Data Quality (DQ) status first. Thankfully, referential structures within your datasets are governed by rules. Once these rules are defined, you can profile data for consistency. However, the initial step is to document these rules. Caution is advised, as data-specific relationships can be complex and may be misunderstood or overlooked in analytical reporting.
5 Timeliness
Many data components are not age sensitive, they are either non-changing attributes or comprise fixed ‘snapshots’ in time. A person’s NINO or NHS number is ‘static’ data and a sales transaction once completed (after any corrections) is fixed. However, some elements can become outdated – again, people-related, information is notoriously volatile with attributes such as name, address, telephone number as well as ‘life-stage’ (relationship & health status, dependents, income type) liable to change without warning.
If you are an online retailer your main concerns are about keeping your prices and inventory up to date (prices can change daily, inventory by the minute) and for this type of business your customers will usually tell you where to deliver when ordering. But, say, you run a rental business or management service where people data is long term, 6-8% of your customers may change address each year ….but may not tell you when they change. It’s wise to identify the likely rate of change for important attributes and records and differentiate between fast and slow-moving data so you can design procedures to actively check for change. My example at the beginning – checking for customers over a certain age to see if they should still be on file is not that far-fetched for organisations who have the same customers for decades and of course, if using a stored value for ‘number of children under 18 years living at home’, watch out, it can change within a year.
6 Validity
Does a data value conform to what you would expect? For example, is it in the right format, is it within a specified value range or equal to one of a defined list of values? Additionally, values for some attributes have to obey ‘semantic’ logic conforming to specific ‘business’ rules where there is co-dependence between data values. When you know what these rules and conditions are you can apply logic to check validity.
Simple examples of validity include: an Email address must contain text then ‘@’ then text and at least one ‘.’,? an Exam result percentage must be between 0 and 100. Maybe a Customer type should only be ‘Personal’, ‘Business’ or ‘Charity’ and an example of a semantic data rule might be when the Customer type = ‘Personal’ or ‘Charity’ the Credit Limit value must be £0.00 due to a rule stating “Only Business Customers can have credit”.
Validity can be one of the more challenging of the DQ categories as multiple rules can apply to a single attribute and detailed business knowledge may be required to define semantic type rules. There may be some ‘grey areas’ where judgement is called for – e.g. ?numeric data – you might consider a value suspect but not necessarily incorrect if it falls outside of a norm or statistical margin of error.
In the past (OK, it still happens), when users ran up against deficiencies and limitations within applications, it was commonplace to create workarounds by trying to put more information into a data field value than good design dictates – one example is ‘overloaded’ fields, as a business grew it was expedient to divide customer accounts into territories – e.g. North & South, but without a specific territory field a bright spark user would suggest manually prefixing account Ids with ‘N’ or ‘S’ (e.g. N012986 and S034581) so a user would know instantly who ‘owned’ the account. Of course, customers moved, and other territories created, maybe the system was upgraded with a separate territory code, the point here is that this semantic rule, may or may not be important for data quality assessment or analytics and historical ‘exception’ business logic may need to be taken into account.
A Practical and Proportionate approach to DQ
What most enterprises and organisations really need is DQ management that is appropriate to their scale, is practical and economic - the last thing wanted is an extra layer of IT complexity and cost.
It must be obvious now that effective Data Quality (DQ) management hinges on meticulous attention to detail, guided by straightforward rules that can be translated into logical tests to identify issues. While most data rules are straightforward and apparent, the more tailored or unique your business processes and data are, the more particular the 'business rules' you need to apply will be. This encapsulates the essence of the DQ industry. Businesses have been engaging in this type of 'data profiling' for years, typically employing basic methods such as scans with SQL scripts and spreadsheet cell formulas.
If your organization is data-centric, policies are likely in place to foster a strong data quality (DQ) culture and management system. Key managers and users should be well-versed in data governance principles, stewardship, metadata, data catalogues, and other DQ-related concepts. However, if you are new to this field, you'll want to understand how to practically manage your DQ without it becoming an overwhelming task.
Ultimately, you want to see evidence to show your data holdings are well managed and identify any areas for improvement. Tip: Compiling an 'Operating Model' document is an invaluable method for outlining the workings of your enterprise. It encapsulates your business's intellectual property (IP) and should certainly include a dedicated section on data holdings, structure, and DQ management.
In Part 2, we will explore straightforward methods to assess your Data Quality and practical steps to elevate it to the required standard. However, for a brief overview, the following two stages and steps are recommended.
Investigation
Action
Kegsoft?
Our Kegsoft partnership has deep commercial and technical experience including data architecture and process design. We are the creators of KBL (Kegsoft Business Language), designed to close the gap between business users and IT specialists. KBL is ideal for building processes to translate, manipulate and interface data particularly where different technologies have to be bridged. KBL-based solutions processes and procedures can easily be turned into tactical (or strategic) applications. Our expertise (and satisfaction) lies in problem solving particularly ‘left-field thinking’ and turning around difficult, intricate projects where there is time pressure to deliver.
If Kegsoft (or myself) can be of assistance in an advisory or technical capacity to help you with data quality or architecture, please contact us via www.kegsoft.com or me via LinkedIn.