Data Science Approaches to Data Quality: From Raw Data to Datasets
Crossing the Data Quality Chasm. Diagram: Visual Science Informatics, LLC

Data Science Approaches to Data Quality: From Raw Data to Datasets

Data Science Approaches to Data Quality: From Raw Data to Datasets

Quality of your data, a good data quality, is a necessary prerequisite to building your accurate Machine Learning (ML) model, in addition to the "Architectural Blueprints—The “4+1” View Model of Machine Learning."

  • What is data quality?
  • How do you measure data quality?
  • How can you improve your data quality?

The article’s objectives are to:

  • Articulate the challenges in turning the data quality problem into a manageable solution.
  • List recent approaches, techniques, and best practices for managing data quality by organizations, processes, and technologies.
  • Discuss the solutions of these approaches, techniques, and best practices.


Let's start with what data quality is, and what the potential causes of poor data quality are.

  • Data Quality (Goodness of Data)

“Quality data is, simply put, data that meets business needs. There are many definitions of data quality, but data is generally considered high quality if it is "fit for its intended uses in operations, decision making, and planning." [1] [2] Moreover, data is deemed of high quality if it correctly represents the real-world construct to which it refers." [3]

Data Quality Dimensions (Representative). Diagram: Visual Science Informatics, LLC

Data Quality Dimensions (Representative). Diagram: Visual Science Informatics, LLC

  • Six Dimensions of Data Quality

Data Quality Assessment (DQA) is the process of scientifically and statistically evaluating data in order to determine whether they meet the required quality and are of the right type and quantity to be able to actually support their intended use. Data quality could be assessed in regard to six characteristics:

Determining the quality of a dataset by six factors. Diagram: TechTarget

Determining the quality of a dataset takes numerous factors. Diagram: TechTarget

1) Accuracy

  • The measure or degree of agreement between a data value, or set of values, and a source assumed to be correct.
  • A qualitative assessment of freedom from error.
  • How well does the data reflect reality?

2) Completeness

  • The degree to which values are present in the attributes that require them.
  • Is all the required data present?

3) Consistency/Uniformity

  • Data are maintained so they are free from variation or contradiction and consistent within the same dataset and/or across multiple datasets.
  • The measure of the degree to which a set of data satisfies a set of constraints or specified using the same unit of measure.
  • Is the data consistent?

4) Timeliness/Currency

  • The extent to which a data item or multiple items are provided at the time required or specified.
  • A synonym for currency, the degree to which specified values are up to date.
  • Is the data up to data?

5) Uniqueness

  • The ability to establish the uniqueness of a data record and data key values.
  • Are all features unique?

6) Validity

  • The quality of the maintained data is rigorous enough to satisfy the acceptance requirements of the classification criteria, defined business rules or constraints.
  • A condition where the data values pass all edits for acceptability, producing desired results.
  • Is the data valid?

Data Quality Concept Overview. Mind map: Carsten Oliver Schmidt et al.

Data Quality Concept Overview. Mind map: Carsten Oliver Schmidt et al.

DQA can also discover technical issues such as mismatches in data types, different dimensions of data arrays, and a mixture of data values. Data quality issues can often be resolved and maintained by data scientists’ best practices, data governance processes, and data quality management tools.

Data Quality Examples. Table: Unknown Author

Data Quality Examples. Table: Unknown Author

  • What are the common data quality problems?

Lack of information standards

  • Different formats and structures across different systems

Data surprises in individual fields

  • Data misplaced in the database

Data myopia

  • Lack of consistent identifiers inhibits a single view

The redundancy nightmare

  • Duplicate records with a lack?of standards

Information buried in free-form unstructured fields

Potential sources for poor data quality are:

Data Quality and the Bottom Line. Graph: Wayne Eckerson, Data Warehousing Institute

Data Quality and the Bottom Line. Graph: Wayne Eckerson, Data Warehousing Institute

A significant percentage of time that is allocated in a machine learning project is to data preparation tasks. The data preparation process is one of the main challenging and time-consuming tasks in ML projects.

Effort distribution of organizations, ML projects, and data scientists on data quality:

Organizations' data quality effort distribution. Graph adapted: Baragoin, Corinne, et al. Mining Your Own Business in Health Care

Organizations' data quality effort distribution. Graph adapted: Baragoin, Corinne, et al. Mining Your Own Business in Health Care

Organizations spend most of their time understanding data sources and managing data (cleaning, standardizing, and harmonizing).

Percentage of time allocated to ML project tasks. Chart: TechTarget

Percentage of time allocated to ML project tasks. Chart: TechTarget

Projects of ML spend most of their time on data cleaning (25%), labeling (25%), augmentation (15%), and aggregation (15%).

Data scientists allocated time distribution. Chart: Vincent Tatan

Data scientists allocated time distribution. Chart: Vincent Tatan

Data scientists spend most of their time on data preparations. Data collection and preparation accounted for seventy-nine percent (79%) of data analytics spent time.

Therefore, it is important for you to know about Data Science Lifecycle (DSL), Data Governance (DG), Data Quality Frameworks (DQFs), Information Architecture (IA), and Data Standards (DSs). After that, it is essential to understand the most common data preparation steps, methods, and techniques.

The Data Science Lifecycle (DSL) refers to the series of steps data scientists follow to extract knowledge and insights from data. It is a structured approach that enable a project progresses efficiently and avoids common pitfalls.

Data Science Lifecycle (DSL). Diagram: Visual Science Informatics

Data Science Lifecycle (DSL). Diagram: Visual Science Informatics

Here is a breakdown of the typical stages in a DSL:

  1. Problem Definition: This initial stage involves clearly defining the business problem you are trying to solve with data science. It includes understanding the goals, success metrics, and any relevant background information.
  2. Data Acquisition and Preparation: Here, you gather the data needed for your analysis. This may involve collecting data from various sources, cleaning and pre-processing the data to ensure its quality and consistency.
  3. Data Exploration and Analysis: In this stage, you explore the data to understand its characteristics, identify patterns, and relationships. Data visualization techniques and statistical analysis are commonly used here.
  4. Model Building and Evaluation: This is where you build a machine learning model or other analytical solution based on the insights gained from exploration. You then evaluate the model's performance to assess its effectiveness in solving the problem.
  5. Deployment and Monitoring: If the model performs well, it is deployed into production where it can be used to make predictions or generate insights. This stage also involves monitoring the model's performance over time to ensure it continues to be effective.

By following a structured DSL, data scientists can ensure their projects are well-defined, efficient, and deliver valuable results that address real-world business problems.

Data Governance (DG) is essentially a set of rules and practices that ensure an organization's data is accurate, secure, and usable. It is similar to having a constitution for your data, outlining how it should be handled throughout its lifecycle, from creation to disposal.

Here is a breakdown of what DG entails:

  • Data policies: Define how data is collected, stored, shared, and disposed of. They establish things like data ownership, access controls, and security measures.
  • Data standards: Ensure consistency in how data is defined, formatted, and documented. Data standards enable the integration of data from different sources and avoid confusion.
  • Data processes: Develop Standard Operation Procedures (SOPs) for handling data, such as cleansing, transformation, and analysis. These processes and procedures ensure data quality and reliability.
  • Data people: Assign Roles and Responsibilities (R&R) within the organization for data governance. It includes a data governance council, data stewards (owners of specific datasets), and data analysts.
  • Data security: Protect data from unauthorized access, breaches, and other threats. Data security measures make sure that information is only accessed by authorized R&R. Data access establishes protocols for granting and controlling access to data (Create, Reference, Update, and Delete as a "CRUD matrix"), internally and externally, in line with classified security requirements, and comply with data privacy regulations.

The benefits of DG are numerous:

  • Improved data quality: Ensures data is accurate, complete, and consistent, leading to better decision-making.
  • Enhanced data security: Protects data from unauthorized access, misuse, and breaches.
  • Increased data accessibility: Enables authorized users to find and use relevant data.
  • Boosted compliance: Helps organizations meet regulatory requirements for data privacy and security.

Data Governance is crucial for organizations that rely on data for informed decision-making. It helps turn enterprise data into a valuable asset and promotes trust in data-driven insights.

Enterprise data is a collection of data from many different sources. It is often described by five Vs characteristics: Volume, Variety, Velocity, Veracity, and Value.

  • Volume: Captures the massive amount of data that is generated. Volume is measured in units such as terabytes (TB), petabytes (PB), or exabytes (EB).
  • Variety: Describes the wide range of data structures and formats. Data can be structured, semi-structured, or unstructured. Data variety makes data analysis complex but more informative. Examples of formats include text, numerical, or binary; categories include qualitative or quantitative; and classifications include text, multimedia, log, and metadata.
  • Velocity: Refers to the speed at which data is generated and needs to be processed. Examples include the constant stream and rapid pace of social media updates, sensor data, or stock market trades, which require analysis in near real-time.
  • Veracity: Affects the accuracy and trustworthiness of the data. With so much data coming from various sources, it is critical to ensure the information is reliable before basing decisions on it.
  • Value: Makes big data important. The goal is to extract meaningful insights from big data, which can help businesses make better decisions, improve customer service, develop new products, and gain a competitive edge. Value is what makes all the other V's worthwhile.

Some definitions also include a sixth V:

  • Variability: Acknowledges the constantly changing nature of data over time. For instance, the meaning of words and phrases can evolve, especially in social media analysis. Data governance needs to account for this variability to ensure accurate analysis.

Enterprise Architecture (EA) as a Strategy. Diagram: Visual Science Informatics

Enterprise Architecture (EA) as a Strategy. Diagram: Visual Science Informatics

Enterprise Data Architecture (EDA) is the blueprint for how an organization manages its data. It is essentially a high-level plan that defines how data will flow throughout the company, from its origin to its final use in analytics and decision-making. EA creates a foundation for business execution. EA defines your operating model. Integration and Standardization are significant dimensions of operating models.

There are four types of operating models:

  1. Diversification: Independence with shared services
  2. Coordination: Seamless access to shared data
  3. Replication: Standardize independence
  4. Unification: Standardized, integrated process


A Data Quality Framework (DQF) is a structured approach to managing and improving the quality of data within an organizational Data Governance (DG). It provides a set of guidelines, methods, and tools that can be used to assess, monitor, and improve data accuracy, completeness, consistency, and timeliness.

DQF Benefits:

  • Improved decision-making: By ensuring that your data is accurate and reliable,?you can make better decisions based on that data.
  • Reduced costs: Data quality issues can lead to a number of costs,?such as rework,?lost productivity,?and missed opportunities.?A data quality framework can help you to identify and address these issues,?which can save your organization money.
  • Increased efficiency: When your data is clean and consistent,?it is easier to work with and analyze.?This can lead to increased efficiency and productivity.
  • Improved customer satisfaction: If your data is accurate,?you can provide better service to your customers.?This can lead to increased customer satisfaction and loyalty.


A Data Quality Maturity Model (DQMM) is a framework that helps organizations assess the maturity of their Data Quality Management (DQM) practices. It provides a structured approach for identifying areas for improvement and developing a roadmap for achieving data quality excellence.

DQMM. Diagram: SSA Analytics Center of Excellence (CoE)

DQMM. Diagram: SSA Analytics Center of Excellence (CoE)

A DQMM typically consists of five maturity levels, each representing a distinct stage in an organization's DQM journey:

Data Maturity Assessment Model. Diagram: HUD

Data Maturity Assessment Model. Diagram: HUD

  • Initial: At this level, there is no formal DQM program in place. Data quality issues are identified and addressed on an ad-hoc basis.
  • Recognizing: The organization recognizes the importance of data quality and begins to take some initial steps to improve it. This may involve establishing data quality policies and procedures, or identifying critical data elements.
  • Specifying: The organization defines specific data quality requirements and starts to implement processes to measure and monitor data quality.
  • Managing: The organization has a well-defined DQM program in place, with processes for managing data quality throughout the data lifecycle.
  • Optimizing: The organization continuously monitors and improves its DQM program, using data quality metrics to drive decision-making.

Organizations can use a DQMM to benchmark their current DQM practices against industry best practices and identify areas for improvement. The model can also be used to develop a roadmap for implementing a DQM program or improving an existing one.

DQMM Benefits:

  • Provides a structured approach for assessing data quality maturity
  • Helps to identify areas for improvement
  • Provides a roadmap for implementing or improving a DQM program
  • Helps to benchmark data quality practices against industry best practices
  • Promotes a data-driven approach to decision-making


Information Architecture (IA) is the art and science of organizing and labeling the content of websites, intranets, online communities, and software to make it findable and understandable. IA is different from knowledge and data architecture, but IA must be an integral part of enterprise architecture. It's essentially the blueprint that helps users navigate through information effectively.

Data, Information, and Knowledge concepts. Diagram: Packet

Data, Information, and Knowledge concepts. Diagram: Packet

IA Principles:

  • Clarity: The organization of information should be clear and logical, making it easy for users to find what they're looking for.
  • Consistency: The labeling and navigation should be consistent throughout the website or application.
  • Usability: The IA should be designed to be intuitive and easy to use, even for novice users.
  • Accessibility: The IA should be accessible to all users, regardless of their abilities.

Types of Informatics Schemes. Diagram adapted: Louis Rosenfeld & Peter Morville

Types of Informatics Schemes. Diagram adapted: Louis Rosenfeld & Peter Morville

IA?is an organization, labeling, metadata, and navigation schemes within an information system [Information Architecture. Visual Science Informatics]. IA organizations such as:

  • Metadata
  • Classi?cation vs. Categorization
  • Controlled Vocabulary
  • Taxonomy
  • Thesaurus
  • Ontology
  • Meta-Model

IA Topologies based on IA Organizations. Diagram adapted: Leo Orbst

IA Topologies based on IA Organizations. Diagram adapted: Leo Orbst

For example, in health science informatics, there are several vocabularies of vocabularies. To improve your data quality, you need to understand the definitions and visualization of informatics architecture vocabularies.

From Searching to Knowing – Spectrum for Knowledge Representation and Reasoning Capabilities. Diagram adapted: Leo Orbst

From Searching to Knowing – Spectrum for Knowledge Representation and Reasoning Capabilities. Diagram adapted: Leo Orbst

  • Gold labels

"Data manually labeled by human beings is often referred to as gold labels, and is considered more desirable than machine-labeled data for analyzing or training models, due to relatively better data quality. This does not necessarily mean that any set of human-labeled data is of high quality. Human errors, bias, and malice can be introduced at the point of data collection or during data cleaning and processing. Check for them before analyzing.

Any two human beings may label the same example differently. The difference between human raters' decisions is called inter-rater agreement. You can get a sense of the variance in raters' opinions by using multiple raters per example and measuring inter-rater agreement." [Google]

  • Silver labels

"Machine-labeled data, where categories are automatically determined by one or more classification models, is often referred to as silver labels. Machine-labeled data can vary widely in quality. Check it not only for accuracy and biases but also for violations of common sense, reality, and intention." [Google]

"Data Standards (DSs) are created to ensure that all parties use the same language and the same approach to sharing, storing, and interpreting information. In healthcare, standards make up the backbone of interoperability — or the ability of health systems to exchange medical data regardless of domain or software provider."

In the context of health care, the term data standards encompass methods, protocols, terminologies, and specifications for the collection, exchange, storage, and retrieval of information associated with health care applications, including medical records, medications, radiological images, payment and reimbursement, medical devices and monitoring systems, and administrative processes. [Washington Publishing Company]

Levels of Interoperability in Healthcare . Diagram: Awantika

Levels of Interoperability in Healthcare . Diagram: Awantika

Levels of Interoperability:

  • Level 1: Foundational (Syntactic)
  • Level 2: Structural (Relationships)
  • Level 3: Semantic (Terminology)
  • Level 4: Organizational (Functional)


It is necessary but not sufficient to have physical and logical compatibility and interoperability. Standards need to provide an explicit representation of data semantic and lexical.

Data Standards Define:

  • Structure: Defines general classes and specializations
  • Data Types: Determines structural format of attribute data
  • Vocabulary:?Controls possible values in coded attributes

Health Data Standards. Diagram: Visual Science Informatics

Health Data Standards. Diagram: Visual Science Informatics

Types of Health Data Standards (HDSs):

  • Content: "Define the structure and organization of the electronic message or document’s content. This standard category also includes the definition of common sets of data for specific message types."
  • Identifier: "Entities use identifier standards to uniquely identify patients or providers."
  • Privacy & Security: "Aim to protect an individual's (or organization's) right to determine whether, what, when, by whom and for what purpose their personal health information is collected, accessed, used or disclosed. Security standards define a set of administrative, physical, and technical actions to protect the confidentiality, availability, and integrity of health information."
  • Terminology/Vocabulary: "Health information systems that communicate with each other rely on structured vocabularies, terminologies, code sets, and classification systems to represent health concepts."
  • Transport: "Address the format of messages exchanged between computer systems, document architecture, clinical templates, user interface, and patient data linkage. Standards center on “push” and “pull” methods for exchanging health information."

Source: HIMSS

"Thankless data work zone". Diagram: Jung Hoon Son

"Thankless data work zone". Diagram: Jung Hoon Son

Standard terminologies are helpful, but it does not mean that everyone employs or follows them. As a healthcare example, the same column name might have ICD-9, ICD-10, and potentially CPT codes. [Jung Hoon Son]

  • Anonymized Data

Acquiring Real World Data (RWD) is challenging because of regulations that HIPAA protects sensitive, confidential and Protected Health Information (PHI), and Personally Identifiable Information (PII). To use RWD, you must anonymize, remove all PHI and PII, or de-identify, encrypt, and hide protected data. In addition to data acquisition cost, HIPAA Privacy Rule for de-identification methods adds additional cost. These additional costs can be for using off-the-shelf tools for anonymization or hiring a de-identification expert with proven experience. Another barrier is the high cost for medical practitioners with the domain expertise to annotate and label the raw data, images, or audio to ML train models. Although you would think there is plenty of data generated by patient care, you cannot get that data directly. [Open Health Data]

  • Synthetic Data

Synthetic raw data, artificially generated rather than produced by real-world events, must also be preprocessed. Synthetic raw data is created using algorithms. Synthetic data can be artificially generated from real-world data, noisy data, handcrafted data, duplicated data, resampled data, bootstrapped data, augmented data, oversampled data, edge case data, simulated data, univariate data, bivariate data, multivariate data, multimodal data. [Cassie Kozyrkov] Artificial raw data can be deployed to validate mathematical models and train machine learning models. Also, data generated by a computer simulation is considered manufactured raw data. Generative AI (GAI) is the driving force behind the creation of synthetic content. GAI models are trained on massive amounts of data. Once trained, they can generate new synthetic content that is similar to the data they were trained on, such as: text, images, audio, and video.

ML Life Cycle with Synthetic Data. Diagram: Gretel

ML Life Cycle with Synthetic Data. Diagram: Gretel

Synthetic raw data can be utilized for anonymizing data, augmenting data, or optimizing accuracy. Anonymized data can filter information to prevent the compromise of the confidentiality of particular aspects.?Augmented data is generated to meet specific needs, conditions, or situations for the simulation of theoretical values, realistic behavior profiles, or unexpected results. If the collected data is imbalanced, missing values, or insufficiently numerous, synthetic data can be used to set a baseline, fill gaps, and optimize accuracy.

Let's continue with understanding the concepts of Goodness of Measurement vs. Goodness-of-Fit, and statistical biases before covering the data preparation process.

  • Goodness of Measurement vs. Goodness-of-Fit

During data acquisition, your data quality depends on your instrument goodness of measurement, statistical bias sources, and your data goodness of fit.

  • Goodness of Measurement: Reliability and Validity

The two most important and fundamental characteristics of any measurement procedure are reliability and validity, which lie at the heart of a competent and effective study. [Accuracy: The Bias-Variance Trade-off]

Bullseye Diagram: The distribution of model predictions. Diagram adapted: Domingo

Bullseye Diagram: The distribution of model predictions. Diagram adapted: Domingo

Validity is a test of how well an instrument that is developed measures the particular concept it is intended to measure. In other words, validity is concerned with whether we measure the right concept or not.

Reliable instrument is considered if a measurement device or procedure consistently assigns the same score to individuals or objects with equal values. In other words, the reliability of a measure indicates the extent to which it is without bias and hence insures consistent measurement across time and across the various items in the instruments.

Several types of validity and reliability test are used to test the goodness of measures. Data scientists use different various forms of reliability and validity and different terms to denote them.

Goodness of Data Measurement - Forms of reliability and validity. Diagram: Shweta Bajpai et al.

Goodness of Data Measurement - Forms of reliability and validity. Diagram: Shweta Bajpai et al.

  • Statistical Bias

“A major difference between machine learning and statistics is their purpose. Machine learning models are designed to make the most accurate predictions possible. Statistical models are designed for inference about the relationships between variables.”

Statistical bias is a systematic tendency that causes differences between results and facts. Statistical bias may be introduced at all stages of data analysis: data selection, hypothesis testing, estimator selection, analysis methods, and interpretation. [Interpretability/Explainability: “Seeing Machines Learn”]

Statistical bias sources from stages of data analysis. Diagram: Visual Science Informatics, LLC

Statistical bias sources from stages of data analysis. Diagram: Visual Science Informatics, LLC

Systematic error (bias) introduces noisy data with high bias but low variance. Although measurements are inaccurate (not valid), they are consistent (reliable). Repeatable systematic error is associated with faulty equipment or a flawed experimental design and influences a measurement's accuracy.

Reproducibility error (variance) introduces noisy data with low bias but high variance. Although measurements are accurate (valid) they are inconsistence (not reliable). The repeatable error is due to a measurement process and primarily influences a measurement's accuracy. Reproducibility refers to the variation in measurements made on a subject under changing conditions.

  • Sample Size

"Sample complexity of a machine learning algorithm represents the number of training samples it needs in order to successfully learn a target function." The two main categories are probability sampling and non-probability sampling methods. In probability sampling, every member of the target population has a known chance of being included. This allows researchers to make statistically significant inferences about the population from the sample. Non-probability sampling methods are used when it's not possible or practical to get a random sample, but they limit the ability to generalize the findings to the larger population. [Scenarios: Which Machine Learning (ML) to choose?]

Sampling Methods and Techniques. Diagram: Asad Naveed

Sampling Methods and Techniques. Diagram: Asad Naveed

Understanding your ML model should start with the data collection, transformation, and processing because, otherwise, you will get “Garbage In, Garbage Out” (GIGO).

  • Goodness-of-Fit

“The term goodness-of-fit refers to?a statistical test that determines how well sample data fits a distribution from a population with a normal distribution. Put simply, it hypothesizes whether a sample is skewed or represents the data you would expect to find in the actual population." [4]

  • Data Quality Scorecard (DQS)

A Data Quality Scorecard (DQS) is a tool used to measure and track the health of your data. It provides a quick and clear way to understand how well your data meets specific quality dimensions.

Data Quality Scorecard Repository Executive Summary Dashboard. Graph: Proceedings of the MIT 2007 Information Quality Industry Symposium

Data Quality Scorecard Repository Executive Summary Dashboard. Graph: Proceedings of the MIT 2007 Information Quality Industry Symposium

DQS offers:

  • Measurement: A DQS translates complex data quality metrics into a score or set of scores. These scores represent how well your data adheres to various quality dimensions like accuracy, completeness, consistency, and timeliness.
  • Tracking: A scorecard allows you to track these quality measurements over time. This helps identify trends and monitor the effectiveness of any data quality improvement initiatives.
  • Context: DQS reports go beyond just a number. They provide context for interpreting the scores by highlighting specific data elements or areas that are falling short of quality expectations. This helps data users understand the limitations of the data and make informed decisions.

DQS Benefits:

  • Improved Transparency: DQS fosters trust in data by providing clear visibility into its quality.
  • Prioritization: It helps identify the most critical data quality issues for your organization to focus on.
  • Communication: The scorecard provides a common language for discussing data quality across different teams within an organization.
  • Data-driven Decisions: By understanding data quality, businesses can make more informed decisions based on reliable information.

There are different ways to implement a DQS. Some data management platforms offer built-in DQS features, while others may require custom development. The specific design of your DQS will depend on your organization's needs and the type of data you manage.

  • Data Quality Dashboard (DQD)

A Data Quality Dashboard is a visual representation of the health of your data using key metrics and charts. It provides a centralized view of various data quality tests and their results, allowing you to monitor data quality trends and identify areas needing improvement.

Data Quality Dashboard. Chart: iMerit

Data Quality Dashboard. Chart: iMerit

DQD Elements:

1. Data Quality Dimensions:

The dashboard should represent various data quality dimensions like accuracy, completeness, consistency, timeliness, and validity.

2. Data Quality Tests:

Each dimension can be broken down into specific data quality tests, e.g., null value tests, freshness checks, and uniqueness tests (refer to the list of tests in the Data Quality Tests section discussed later).

3. Data Visualization:

The dashboard should use charts and graphs to represent the results of each test. Examples include:

  • Bar charts: Compare null value percentages across different columns.
  • Line graphs: Track data freshness over time (e.g., how long it takes to update sales data).
  • Scatter plots or scattergram charts: Visualize data distribution (e.g., percentage of customers in each state).

4. Key Performance Indicators (KPIs) & Service Level Indicators (SLIs):

KPIs and SLIs can be incorporated to set specific targets for data quality.

  • KPIs: Overall data quality score for a specific data source or table.
  • SLIs: Define acceptable thresholds for metrics like freshness (data shouldn't be older than 24 hours) or volume (number of customer records shouldn't exceed 1 million).

5. Alerts & Notifications:

The dashboard can be configured to send alerts or notifications when data quality metrics fall outside acceptable ranges, prompting investigation and corrective action.

DQD Benefits:

  • Improved Visibility: Provides a clear view of data health across different dimensions.
  • Proactive Monitoring: Enables early detection of data quality issues.
  • Prioritization: Helps identify critical areas for data quality improvement.
  • Communication: Facilitates communication of data quality status across teams.
  • Data-driven Decisions: Supports data-driven decision making based on reliable information.

DQD Design Requirements:

  • Identify data sources & quality needs: Start by understanding your data sources and the specific quality requirements for each.
  • Select data quality tests: Choose the most relevant tests based on your data and needs (refer to the list of tests in the Data Quality Tests section discussed later).
  • Choose visualization tools: Select appropriate data visualization tools to represent test results effectively.
  • Set KPIs & SLIs: Define clear KPIs and SLIs to measure success.
  • Schedule updates: Determine how often the dashboard will be updated with fresh data.

By implementing a DQD, you can proactively manage your data quality and evaluate that it meets your organization's needs.

There is a high cost of poor data quality to the success of your ML model. You will need to have a systematic method to improve your data quality. Most of the work is in your data preparation and consistency is a key to data quality.

  • Data Science 'Hierarchy of Needs'

"Collecting better data, building data pipelines, and cleaning data can be tedious, but it is very much needed to be able to make the most out of data." The Data Science Hierarchy of Needs, by Sarah Catanzaro, is a checklist for "avoiding unnecessary modeling or improving modeling efforts with feature engineering or selection." [Serg Masis]

Data Science Hierarchy of Needs. Diagram: Serg Masis

Data Science Hierarchy of Needs. Diagram: Serg Masis

Note: The "Data Science Hierarchy of Needs" is attributed to Monica Rogati; Sarah Catanzaro is known for revisiting and discussing this concept, but the original idea is credited to Rogati.

  • Bloom’s Taxonomy Adapted for ML

Bloom's Revised Taxonomy. Diagram: Vanderbilt University Center for Teaching

Bloom's Revised Taxonomy. Diagram: Vanderbilt University Center for Teaching

"Bloom's taxonomy is a set of three hierarchical models used for the classification of educational learning objectives into levels of complexity and specificity. The three lists cover the learning objectives in the cognitive, affective, and psychomotor domains. There are six levels of cognitive learning according to the revised version of Bloom's Taxonomy. Each level is conceptually different. The six levels are?remembering, understanding, applying, analyzing, evaluating, and creating." [Anderson & Krathwohl, 2001, pp. 67-68]

This Bloom's taxonomy was adapted for machine learning.

Bloom’s Taxonomy Adapted for ML. Diagram: Visual Science Informatics, LLC

Bloom’s Taxonomy Adapted for ML. Diagram: Visual Science Informatics, LLC

There are six levels of model learning in the adapted version of Bloom's Taxonomy for ML. Each level is a conceptually different learning model. The levels order is from lower-order learning to higher-order learning. The six levels are Store, Sort, Search, Descriptive, Discriminative, and Generative. Bloom’s Taxonomy adapted for ML terms are defined as:

  • Store?models capture three perspectives: Physical, Logical, and Conceptual data models. Physical data models describe the physical means by which data are stored. Logical data models describe the semantics represented by a particular data manipulation technology. Conceptual data models describe a domain's semantics in the model's scope. Extract, Transform, and Load (ETL) operations are a three-phase process where data is extracted, transformed, and loaded into store models. Collected data can be from one or more sources. ETL data can be stored in one or more models.
  • Sort?models arrange data in a meaningful order and systematic representation, which enables searching, analyzing, and visualizing.
  • Search?models solve a search problem to retrieve information stored within some data structure, or calculated in the search space of a problem domain, either with discrete or continuous values.
  • Descriptive?models specify statistics that quantitatively describe or summarize features and identify trends and relationships.
  • Discriminative?models focus on a solution and perform better for classification tasks by dividing the data space into classes by learning the boundaries.
  • Generative?models understand how data is embedded throughout space and generate new data points.


Next, you should check and analyze your data even before you train a model because you might discover data quality issues in your data. Identifying common data quality issues such as missing data, duplicated data, and inaccurate, ambiguous, or inconsistent data can help you find data anomalies and perform feature engineering.

Data Quality Strategy Outline. Flowchart: Gary McQuown

Data Quality Strategy Outline. Flowchart: Gary McQuown

Crossing the data quality chasm from raw data to a good quality dataset requires you to consider the full equation of objectives, cause, assessment, and techniques. A Data Quality Assessment (DQA) identifies potential causes of poor data quality. These data quality causes can link to your objectives to improve data quality. These objectives and the assessment can help you attain and select techniques that resolve your data quality causes.

Crossing the data quality chasm from raw data to a good quality dataset. Diagram: Visual Science Informatics, LLC

Crossing the data quality chasm from raw data to a good quality dataset. Diagram: Visual Science Informatics, LLC

  • ABC of Data Science

ML is a form of Artificial Intelligence (AI), which makes predictions and decisions from data. It is the result of training algorithms and statistical models to analyze and draw inferences from patterns in data, which are able to learn and adapt without following explicit instructions. However, you need to:

  1. Check - double check your Assumptions,
  2. Mitigate - make sure you mitigate your Biases, and
  3. Validate - take your time to validate your Constraints.

The Assumptions, Biases, and Constraints (ABC) of data science, Data, and Models of ML can be captured in this formula:

Machine Learning = {Assumptions/Biases/Constraints, Data, Models}


Let’s view the data pipeline workflow. After articulating a problem statement and defining the required data, the three main phases of an initial data pipeline are Data Acquisition, Data Exploration, and Data Preparation.

Traditional Machine Learning Workflow. Diagram adapted: Jankiram Msv

Traditional Machine Learning Workflow. Diagram adapted: Jankiram Msv

  • Data Preparation

Data preparation (preprocessing) is the process of cleaning, reducing, and transforming raw data before training a machine learning model. The data preparation process includes, for example, standardizing data formats, correcting errors, enriching source data, and removing outliers and biases. Common data preprocessing tasks include reformatting data, making corrections to data, and combining datasets to enrich data. Each data preparation steps are essential and require specific methods, techniques, and functionalities. Although data preparation is a lengthy and tedious process undertaking a significant amount of time and effort, data preparation is an essential process that mitigates "garbage in, garbage out." Therefore it enhances model performance because preprocessing data improves the data quality.

  • Feature Engineering

Feature engineering or feature extraction or feature discovery is using domain knowledge to extract or create features (characteristics, properties, and attributes) from raw data. If feature engineering is performed effectively, it improves the machine learning process. Consequently, it increases the predictive power and improves ML model accuracy, performance, and quality of results by creating extra features that effectively represent the underlying model. The feature engineering step is a fundamental part of the data pipeline, which leverages data preparation, in the machine learning workflow.

  • Data Wrangling

Data wrangling is the process of restructuring, cleaning, and enriching raw data into the preferred format for easy access and analysis. The data preparation (preprocessing) steps are performed once before applying any iterative model training, validation, and evaluation. While the Wrangling process is performed at the time of feature engineering during the iterative analysis and model building. For instance, data cleaning focuses on removing erroneous data from your dataset. In contrast, data-wrangling focuses on changing a data format by translating raw data into a usable form and structure.

In this section, we will focus on the Data Preparation (Preprocessing) phase. Within this phase, there are six key data preprocessing steps.

Data preprocessing steps. Diagram: TechTarget

Data preprocessing steps. Diagram: TechTarget

  • Steps for Data Preprocessing

1. Data profiling is examining, analyzing, and reviewing data quality assessment. The assessment objectives are to discover and investigate structure, content, and relationship data quality issues. Before any modification of a collected dataset, data scientists should baseline a data inventory by surveying and summarizing the data characteristics. Additionally, data scientists can generate a descriptive statistic summary that quantitatively describes or summarizes features from the collection of datasets.

2. Data cleansing is fixing or removing incorrect, corrupted, incorrectly formatted, duplicate, or incomplete data within a dataset.

3. Data reduction is the process that reduces the volume of original raw data. Data reduction techniques are used to obtain a reduced representation while maintaining the integrity of the original raw data.

4. Data transformation is the process of converting, restructuring, and mapping data from one format into a more usable form, typically from the format of a source system into the required format and structure of a destination system.

5. Data enrichment or augmentation enhances existing information by supplementing missing or incomplete data. Data enrichment can append and expand collected raw data with relevant context obtained from external data sources.

6. Data validation in data preprocessing but not as part of a model training is checking those data modifications (cleansing, reduction, transformation, and enrichment/augmentation) are both correct and useful.

  • Data Version Control (DVC)

Note that throughout the data preprocessing steps, we highly recommended deploying Data Version Control (DVC) utilizing data versioning tools for MLOps to help you track changes to all your datasets: raw, training, testing, evaluation, and validation. [Operations: MLOps, Continuous ML, & AutoML] Moreover, DVC allows version control of model artifacts, metadata, notations, and models. Furthermore, data must be properly labeled and defined to be meaningful. Metadata, the information describing the data, must be accurate and clear. Lastly, data preprocessing issues must be logged, tracked, and categorized.?Data preprocessing issues can be captured by major data categories, such as quantities, encoded data, structured text, and free text.

  • Data Preprocessing Methods

Data preprocessing methods, which enhances data quality, includes data cleaning, data transformation, and data reduction.

Data Preprocessing Methods. Diagram: Yulia Gavrilova & Olga Bolgurtseva

Data Preprocessing Methods. Diagram: Yulia Gavrilova & Olga Bolgurtseva

Data cleaning methods handle missing values by discarding them or applying imputation methods to replace missing data with inference values. They also detect and mitigate biases (imbalances) in the data. Additionally, data cleaning methods remove and eliminate noise, outliers, duplicates/redundant, inconsistencies, and false null values from data. Furthermore, data cleaning methods ensure data values fall within defined domains, resolve conflicts in data, ensure proper definition and use of data values, and establish and apply data standards.

Data transformation methods include scaling/normalization, standardization, smoothing, scaling, and pivoting. ?Also, they include attribute selection, discretization of numeric variables to categorical/encoded attributes, decomposing categorical attributes, reframing numerical quantities, and concept hierarchy generation. In the case of a highly skewed distribution, log transform spreads out the curve to resemble a less skewed Gaussian distribution and reduces the complexity of a model. [Complexity - Time, Space, & Sample]

Data reduction methods include decomposition, aggregation, and partitioning. Also, they include attribute subset selection and numerosity, dimensionality, and variance threshold reduction. Data sampling and partitioning can reduce a large amount of data by techniques such as sampling with/without replacement, stratified and progressive sampling, or randomly split data.

Data augmentation techniques can be applied in case of data scarcity. Data augmentation generates synthetic data that can be mixed with the collected data to enhance generalization accuracy and performance.

Data preprocessing tasks for building operational data analysis. Diagram: Fan Chang, et al. [5]

Data preprocessing tasks for building operational data analysis. Diagram: Fan Chang, et al. [5]

  • Types of Missing Values and Imputation Techniques

Missing values are a common challenge in data analysis. Understanding the different types of missing data is crucial for selecting the appropriate imputation method.

Types of Missing Values and Suitable Imputation Techniques. Table: Gemini

Types of Missing Values and Suitable Imputation Techniques. Table: Gemini

Note: While KNN is listed under both MCAR and MAR, its effectiveness can vary based on the specific dataset and missing data pattern.

Important Considerations:

  • Imputation is generally not recommended for MNAR data.
  • Always explore the reasons for missing data before imputing.
  • Evaluate the impact of imputation on your analysis.
  • Consider using multiple imputation methods and comparing results.

Additional Tips:

  • The choice of imputation technique should be based on the specific characteristics of the dataset, such as data type, distribution, and the amount of missing data.
  • If you have a large amount of missing data, consider removing the variable or observations with missing values.
  • Explore the distribution of missing values to gain insights into the missing data pattern.
  • Use domain knowledge to inform the imputation process.

By carefully considering the type of missing data and the characteristics of your dataset, you can choose the most appropriate imputation method to improve the quality of your analysis.


  • Data Quality Tests

Essential Data Quality Tests: List: Tim Osborn

Essential Data Quality Tests: List: Tim Osborn

There are numerous data quality tests, such as:

1. NULL Values Test:

  • Description: Checks for missing data entries represented by NULL values in columns.
  • Why it matters: Excessive NULL values can hinder analysis and lead to inaccurate results.
  • Example: Identifying the percentage of NULL values in a "customer email" column.

2. Freshness Checks:

  • Description: Assesses how recently data has been updated.
  • Why it matters: Stale data can lead to poor decision-making.
  • Example: Verifying if yesterday's sales data has been loaded into the system today.

3. Freshness SLIs (Service Level Indicators):

  • Description: Defines acceptable timeframes for data updates.
  • Why it matters: SLIs provide a benchmark for data freshness expectations.
  • Example: Setting an SLI that sales data should be no more than 24 hours old.

4. Volume Tests:

  • Description: Analyzes the overall amount of data in a dataset or table.
  • Why it matters: Extremely high or low volumes can indicate data integration issues.

5. Missing Data:

  • Description: Broader category encompassing NULL values and any other types of missing entries.
  • Why it matters: Missing data can skew analysis and limit the usefulness of your data.
  • Example: Identifying rows missing customer addresses or product descriptions.

6. Too Much Data:

  • Description: Checks for situations where data volume exceeds expectations or storage capacity.
  • Why it matters: Excessive data can slow down processing and increase storage costs.

7. Volume SLIs:

  • Description: Similar to Freshness SLIs, but define acceptable data volume ranges.
  • Why it matters: Ensures data volume stays within manageable and expected levels.
  • Example: Setting an SLI that the customer table shouldn't contain more than 1 million records.

8. Numeric Distribution Tests:

  • Description: Analyzes the distribution of numerical values in a column (e.g., average, standard deviation).
  • Why it matters: Identifies outliers or unexpected patterns in numerical data.
  • Example: Checking for negative values in a "price" column, which could indicate errors.

9. Inaccurate Data:

  • Description: A broader category encompassing various tests to identify incorrect data entries.
  • Why it matters: Inaccurate data leads to misleading conclusions and poor decision-making.
  • Example: Verifying if a customer's email address follows a valid email format.

10. Data Variety:

  • Description: Assesses the range of unique values present in a categorical column.
  • Why it matters: Limited data variety can restrict the insights you can extract from the data.
  • Example: Ensuring a "customer state" column captures all possible state abbreviations.

11. Uniqueness Tests:

  • Description: Checks for duplicate rows or entries within a dataset.
  • Why it matters: Duplicates inflate data volume and can skew analysis results.
  • Example: Identifying duplicate customer records based on email address or phone number.

12. Referential Integrity Tests:

  • Description: Verifies that foreign key values in one table reference existing primary key values in another table (relational databases).
  • Why it matters: Ensures data consistency and prevents orphaned records.
  • Example: Checking if "order ID" in the order details table corresponds to a valid "order ID" in the main orders table.

13. String Patterns:

  • Description: Evaluates text data for specific patterns or formats (e.g., postal codes, phone numbers).
  • Why it matters: Ensures consistency and facilitates data analysis based on specific patterns.
  • Example: Validating if all phone numbers in a "customer phone" column follow a standard format (e.g., +1-###-###-####).

By implementing these data quality tests, you can improve your data accuracy, completeness, consistency, timeliness, uniqueness, and validity, leading to better decision-making and improved outcomes.


  • Interactive Visualization Tools

Raw data for ML modeling might require massive amounts of data that can be difficult and slow to sort through, explore, and process. Identifying common data quality issues such as missing data, duplicated data, and inaccurate, ambiguous, or inconsistent data can help you find data anomalies and perform feature engineering. Interactive visualization lets you establish a visual baseline, explore vast amounts of data, examine data quality, and harmonize data.

TensorFlow Data Validation and Visualization Tools. Table: TensorFlow.org

TensorFlow Data Validation and Visualization Tools. Table: TensorFlow.org

TensorFlow Data Validation provides tools for visualizing the distribution of feature values. By examining these you can identify your data distribution, scale, or label anomalies.

OpenRefine (previously Google Refine). Table: OpenRefine.org

OpenRefine (previously Google Refine). Table: OpenRefine.org

OpenRefine data tool capabilities include data exploration, cleaning and transforming, and reconciling and matching data.

WinPure Clean & Match. Table: WinPure.com

WinPure Clean & Match. Table: WinPure.com

WinPure data tool capabilities include statistical data profiling, data quality issues discovery, cleaning processes (clean, complete, correct, standardize, and transform), and data matching reports and visualizations.

Data Preparation and Cleaning in Tableau. Animation: Tableau

Data Preparation and Cleaning in Tableau. Animation: Tableau

This animated process illustrates data-cleaning steps:

Step 1: Remove duplicate or irrelevant observations

Step 2: Fix structural errors

Step 3: Filter unwanted outliers

Step 4: Handle missing data

Step 5: Validate and QA

Data cleaning tools and software for efficiency. Animation: Tableau

Data cleaning tools and software for efficiency. Animation: Tableau

Data cleaning tools and software for efficiency.

Data preparation with data wrangling tool. Table: Trifacta

Data preparation with data wrangling tool. Table: Trifacta

Data Wrangler is a handy tool for data scientists and analysts who use Visual Studio (VS) Code and VS Code Jupyter Notebooks to work with tabular data. It streamlines the data cleaning and exploration process, allowing you to focus on the insights your data holds.

Data Wrangler Example. Animation: code.visualstudio.com

Example of opening Data Wrangler from the notebook to analyze and clean the data with the built-in operations. Then the automatically generated code is exported back into the notebook. Animation: code.visualstudio.com

Here is a breakdown of what Data Wrangler offers:

Key Features:

  • Visual Data Exploration: A user-friendly interface lets you view and analyze your data.
  • Data Insights: Get insightful statistics and visualizations for your data columns.
  • Automatic Pandas Code Generation: As you clean and transform your data, Data Wrangler automatically generates the corresponding Pandas code.
  • Easy Code Export: Export the generated Pandas code back to your Jupyter Notebook for reusability.

Working with Data Wrangler:

  • Operation Panel: This panel provides a list of built-in operations you can perform on your data.
  • Search Function: Quickly find specific columns in your dataset using the search bar.

Viewing and Editing Modes:

  • Viewing Mode: Optimized for initial exploration, allowing you to sort, filter, and get a quick overview of your data.
  • Editing Mode: Designed for data manipulation. As you apply transformations and cleaning steps, Data Wrangler generates Pandas code in the background.

Overall, Data Wrangler simplifies data preparation in VS Code by providing a visual interface and automating code generation. This saves you time and effort, letting you focus on analyzing your data and extracting valuable insights.


  • Conceptual Framework for Health Data Harmonization

Data harmonization is “the process of comparing two or more data component definitions and identifying commonalities among them that warrant they are being combined, or harmonized, into a single component.” Harmonization is often a complex and tedious operation but is an important antecedent to data analysis as it increases the sample size and analytic utility of the data. However, typical harmonization efforts are ad hoc which can lead to poor data quality or delays in the data release. To date, we are not aware of any efforts to formalize data harmonization using a pipeline process and techniques to easily visualize and assess the data quality prior to and after harmonization. [6] [Operations: MLOps, Continuous ML, & AutoML]

Conceptual Framework for Health Data Harmonization. Diagram: Lewis E. Berman, ICF International, & Yair G. Rajwan, Visual Science Informatics, LLC

Conceptual Framework for Health Data Harmonization. Diagram: Lewis E. Berman, ICF International, & Yair G. Rajwan, Visual Science Informatics, LLC [45]

Next, read the "Architectural Blueprints—The “4+1” View Model of Machine Learning" article at https://www.dhirubhai.net/pulse/architectural-blueprintsthe-41-view-model-machine-rajwan-ms-dsc.

---------------------------------------------------------

[1] Protocol for a systematic review and qualitative synthesis of information quality frameworks in eHealth

[2] Information Quality Frameworks for Digital Health Technologies: Systematic Review

[3] Data Quality

[4] What Is Goodness-of-Fit?

[5] A Review on Data Preprocessing Techniques Toward Efficient and Reliable Knowledge Discovery From Building Operational Data

[6] A Conceptual Framework for Health Data Harmonization A Conceptual Framework for Health Data Harmonization

Next, read the "Architectural Blueprints—The “4+1” View Model of Machine Learning" article at?https://www.dhirubhai.net/pulse/architectural-blueprintsthe-41-view-model-machine-rajwan-ms-dsc

回复

要查看或添加评论,请登录

社区洞察

其他会员也浏览了