Data Science Approaches to Data Quality: From Raw Data to Datasets
Crossing the Data Quality Chasm. Diagram: Visual Science Informatics, LLC

Data Science Approaches to Data Quality: From Raw Data to Datasets

Data Science Approaches to Data Quality: From Raw Data to Datasets

Quality of your data, a good data quality, is a necessary prerequisite to building your accurate Machine Learning (ML) model, in addition to the "Architectural Blueprints—The “4+1” View Model of Machine Learning."

ML Architectural Blueprints = {Scenarios, Accuracy, Complexity, Interpretability, Operations}

  • What is data quality?
  • How do you measure data quality?
  • How can you improve your data quality?

The article’s objectives are to:

  • Articulate the challenges in turning the data quality problem into a manageable solution.
  • List recent approaches, techniques, and best practices for managing data quality by organizations, processes, and technologies.
  • Discuss the solutions of these approaches, techniques, and best practices.

Let's start with what data quality is, and what the potential causes of poor data quality are.

Data Quality (Goodness of Data)

“Quality data is, simply put, data that meets business needs. There are many definitions of data quality, but data is generally considered high quality if it is "fit for its intended uses in operations, decision making, and planning." [1] [2] Moreover, data is deemed of high quality if it correctly represents the real-world construct to which it refers." [3]

Data Quality Dimensions (Representative). Diagram: Visual Science Informatics, LLC

Data Quality Dimensions (Representative). Diagram: Visual Science Informatics, LLC

Dimensions of Data Quality

Data Quality Assessment (DQA) is the process of scientifically and statistically evaluating data in order to determine whether they meet the required quality and are of the right type and quantity to be able to actually support their intended use. Data quality could be assessed in regard to six characteristics:

Determining the quality of a dataset by six factors. Diagram: TechTarget

Determining the quality of a dataset takes numerous factors. Diagram: TechTarget

1) Accuracy

  • The measure or degree of agreement between a data value, or set of values, and a source assumed to be correct.
  • A qualitative assessment of freedom from error.
  • How well does the data reflect reality?

2) Completeness

  • The degree to which values are present in the attributes that require them.
  • Is all the required data present?

3) Consistency (Uniformity)

  • Data are maintained so they are free from variation or contradiction and consistent within the same dataset and/or across multiple datasets.
  • The measure of the degree to which a set of data satisfies a set of constraints or specified using the same unit of measure.
  • Is the data consistent?

4) Validity

  • The quality of the maintained data is rigorous enough to satisfy the acceptance requirements of the classification criteria, defined business rules or constraints.
  • A condition where the data values pass all edits for acceptability, producing desired results.
  • Is the data valid?

5) Integrity

- Entity Integrity:

  • Each record in a database table must have a unique identifier (primary key).
  • This prevents duplicate records and ensures that each entity is represented only once.

- Referential Integrity:

  • Relationships between tables must be defined and maintained.
  • Foreign keys in one table must reference valid primary keys in another table.
  • This ensures data consistency across multiple tables.

- Domain Integrity:

  • Data values must conform to predefined data types and constraints.
  • For example, a "Date of Birth" field should only accept valid date formats.

- User-Defined Integrity:

  • Custom rules and constraints can be defined to enforce specific business rules.
  • These rules can be complex and tailored to unique organizational needs.

6) Uniqueness

  • The ability to establish the uniqueness of a data record and data key values.
  • Are all features unique?

Some DQAs also include additional dimensions such as:

7) Timeliness (Currency)

  • The extent to which a data item or multiple items are provided at the time required or specified.
  • A synonym for currency, the degree to which specified values are up to date.
  • Is the data up to data?

Data Quality Concept Overview. Mind map: Carsten Oliver Schmidt et al.

Data Quality Concept Overview. Mind map: Carsten Oliver Schmidt et al.

DQA can also discover technical issues such as mismatches in data types, different dimensions of data arrays, and a mixture of data values. Data quality issues can often be resolved and maintained by data scientists’ best practices, data governance processes, and data quality management tools.

Data Quality Examples. Table: Unknown Author

Data Quality Examples. Table: Unknown Author


What are the common data quality problems?

Lack of information standards

  • Different formats and structures across different systems

Data surprises in individual fields

  • Data misplaced in the database

Data myopia

  • Lack of consistent identifiers inhibits a single view

The redundancy nightmare

  • Duplicate records with a lack?of standards

Information buried in free-form unstructured fields

Potential sources for poor data quality are:

Data Quality and the Bottom Line. Graph: Wayne Eckerson, Data Warehousing Institute

Data Quality and the Bottom Line. Graph: Wayne Eckerson, Data Warehousing Institute

A significant percentage of time that is allocated in a machine learning project is to data preparation tasks. The data preparation process is one of the main challenging and time-consuming tasks in ML projects.

Effort distribution of organizations, ML projects, and data scientists on data quality:

Organizations' data quality effort distribution. Graph adapted: Baragoin, Corinne, et al. Mining Your Own Business in Health Care

Organizations' data quality effort distribution. Graph adapted: Baragoin, Corinne, et al. Mining Your Own Business in Health Care

Organizations spend most of their time understanding data sources and managing data (cleaning, standardizing, and harmonizing).

Percentage of time allocated to ML project tasks. Chart: TechTarget

Percentage of time allocated to ML project tasks. Chart: TechTarget

Projects of ML spend most of their time on data cleaning (25%), labeling (25%), augmentation (15%), and aggregation (15%).

Data scientists allocated time distribution. Chart: Vincent Tatan

Data scientists allocated time distribution. Chart: Vincent Tatan

Data scientists spend most of their time on data preparations. Data collection and preparation accounted for seventy-nine percent (79%) of data analytics spent time.


Data Quality Framework (DQF)

A Data Quality Framework (DQF) is a structured approach to managing and improving the quality of data within an organizational Data Governance (DG). It provides a set of guidelines, methods, and tools that can be used to assess, monitor, and improve data accuracy, completeness, consistency, and timeliness.

Data Quality Framework (DQF). Diagram: Visual Science Informatics

Data Quality Framework (DQF). Diagram: Visual Science Informatics

Key Components of a DQF:

  • Data Quality Dimensions: These are the key characteristics that define data quality. Common dimensions include:

  1. Accuracy: Data is free from errors and reflects reality.
  2. Completeness: All required data is present.
  3. Consistency: Data is consistent across different systems and sources.
  4. Validity: Data conforms to defined rules and formats.
  5. Integrity: Data is accurate, consistent, and reliable throughout its lifecycle, preventing unauthorized changes or corruption.
  6. Uniqueness: No duplicate data entries exist.
  7. Timeliness: Data is up-to-date and available when needed.

  • Data Science Lifecycle (DSL): Structured approach that enable a project progresses efficiently and avoids common pitfalls.
  • Enterprise Data Architecture (EDA): Blueprint for how an organization manages its data and its characteristics, essentially a high-level plan that defines how data will flow throughout the company, from origin to final use in analytics and decision-making.
  • Data Governance: Establishing roles, responsibilities, and policies for data management.
  • Data Principles: Guide enterprise data science architecture design and implementation for organizations to derive maximum value from their data assets.
  • Data Quality Maturity Model (DQMM): Provide a roadmap for assessing, identifying, and improving the maturity of their Data Quality Management (DQM) to achieve data quality excellence.
  • Information Architecture (IA): Blueprint that helps users navigate through information effectively.
  • Data Standards (DSs): Ensure that all parties use the same language and the same approach to defining, sharing, storing, and interpreting data types and information.
  • Data Quality Standards: These are specific rules and criteria that define acceptable levels of data quality for each dimension. They are measured by Data Quality Tests, and tracked by a Data Quality Scorecard (DQS) and visually represented by a Data Quality Dashboard (DQD).
  • Data Quality Processes: These are the activities involved in managing data quality, such as:

- Data Profiling: Analyzing data to identify quality issues.

- Data Cleansing: Correcting or removing inaccurate or incomplete data.

- Data Monitoring: Continuously tracking data quality metrics.

  • Data Quality Tools: These are software applications that support data quality processes, such as data profiling, cleansing, and monitoring tools.

Benefits of a DQF:

  • Improved decision-making: By ensuring that your data is accurate and reliable,?you can make better decisions based on that data.
  • Reduced costs: Data quality issues can lead to a number of costs,?such as rework,?lost productivity,?and missed opportunities.?A data quality framework can help you to identify and address these issues,?which can save your organization money.
  • Increased efficiency: When your data is clean and consistent,?it is easier to work with and analyze.?This can lead to increased efficiency and productivity.
  • Enhanced Compliance: A DQF can help organizations meet regulatory requirements for data quality.
  • Improved customer satisfaction: If your data is accurate,?you can provide better service to your customers.?This can lead to increased customer satisfaction and loyalty.

A well-designed and implemented DQF is essential for any organization that relies on data to operate effectively. It helps to ensure that data is a reliable and valuable asset that supports business goals and objectives.


Data Science Lifecycle (DSL)

Therefore, it is important for you to know about Data Science Lifecycle (DSL), Data Governance (DG), Data Quality Frameworks (DQFs), Information Architecture (IA), and Data Standards (DSs). After that, it is essential to understand the most common data preparation steps, methods, and techniques.

The Data Science Lifecycle (DSL) refers to the series of steps data scientists follow to extract knowledge and insights from data. It is a structured approach that enable a project progresses efficiently and avoids common pitfalls.

Data Science Lifecycle (DSL). Diagram: Visual Science Informatics

Data Science Lifecycle (DSL). Diagram: Visual Science Informatics

Here is a breakdown of the typical stages in a DSL:

  1. Problem Definition: This initial stage involves clearly defining the business problem you are trying to solve with data science. It includes understanding the goals, success metrics, and any relevant background information.
  2. Data Acquisition and Preparation: Here, you gather the data needed for your analysis. This may involve collecting data from various sources, cleaning and pre-processing the data to ensure its quality and consistency.
  3. Data Exploration and Analysis: In this stage, you explore the data to understand its characteristics, identify patterns, and relationships. Data visualization techniques and statistical analysis are commonly used here.
  4. Model Building and Evaluation: This is where you build a machine learning model or other analytical solution based on the insights gained from exploration. You then evaluate the model's performance to assess its effectiveness in solving the problem.
  5. Deployment and Monitoring: If the model performs well, it is deployed into production where it can be used to make predictions or generate insights. This stage also involves monitoring the model's performance over time to ensure it continues to be effective.

By following a structured DSL, data scientists can ensure their projects are well-defined, efficient, and deliver valuable results that address real-world business problems.


Enterprise Data Architecture (EDA)

Enterprise Data Architecture (EDA) is the blueprint for how an organization manages its data, essentially a high-level plan that defines how data will flow throughout the company, from origin to final use in analytics and decision-making. EDA establishes a foundation for business execution at seven specific levels, composed of three hierarchical layers, and defines four types of operating models.

Data architecture seven levels include:

  1. Enterprise: Organization-wide data strategy.
  2. Solution: Designs for specific projects/systems.
  3. Application: Data management for specific applications.
  4. Information: Information organization and management (taxonomies, metadata).
  5. Technical: Data storage, processing, and integration technologies.
  6. Data Fabrics: Integrated data management and access across an organization.
  7. Data Meshes: Scalable data systems and governance for large organizations.

Data architecture is composed of three layers:

  1. Conceptual: High-level view of business concepts and requirements (entities, relationships, rules).
  2. Logical: Detailed representation of entities, attributes, and relationships, independent of specific technology.
  3. Physical: Actual implementation in a specific technology, defining storage, data types, and indexing.

Enterprise Architecture (EA) as a Strategy. Diagram: Visual Science Informatics

Enterprise Architecture (EA) as a Strategy. Diagram: Visual Science Informatics

Integration and standardization are significant dimensions of the operating model. There are four types of operating models:

  1. Diversification: Independence with shared services
  2. Coordination: Seamless access to shared data
  3. Replication: Standardize independence
  4. Unification: Standardized, integrated process


Characteristics of Enterprise Data

Enterprise data is a collection of data from many different sources. It is often described by five key characteristics: Volume, Variety, Velocity, Veracity, and Value.

  • Volume: The massive amount of data that is generated. Volume is measured in units such as terabytes (TB), petabytes (PB), or exabytes (EB).
  • Variety: The wide range of data structures and formats. Data can be structured, semi-structured, or unstructured. This variety makes data analysis complex but more informative. Examples of formats include text, numerical, or binary; categories include qualitative or quantitative; and classifications include text, multimedia, log, and metadata.
  • Velocity: The speed at which data is generated and needs to be processed. Examples include the constant stream and rapid pace of social media updates, sensor data, or stock market trades, which require analysis in near real-time.
  • Veracity: The accuracy and trustworthiness of the data. With so much data coming from various sources, it is critical to ensure the information is reliable before basing decisions on it.
  • Value: The importance of big data. The goal is to extract meaningful insights, which can help businesses make better decisions, improve customer service, develop new products, and gain a competitive edge. Value is what makes all the other Vs worthwhile.

Beyond these core 5 Vs, other characteristics are also important, including:

  • Variability: The constantly changing nature of data over time. For instance, the meaning of words and phrases evolves, especially in social media analysis. Data governance accounts for this variability to ensure accurate analysis.
  • Validity: The accuracy and truthfulness of the data.
  • Venue: The source and location from which the data originates.
  • Vocabulary: The terminology and language used to define the data elements.
  • Vagueness: The imprecision and incompleteness present within the data.

Understanding these characteristics of enterprise data is crucial for ensuring data reliability and effective analysis.


Data Governance (DG)

Data Governance (DG) is essentially a set of rules and practices that ensure an organization's data is accurate, secure, and usable. It is similar to having a constitution for your data, outlining how it should be handled throughout its lifecycle, from creation to disposal.

Data Governance. Diagram: Visual Science Informatics

Data Governance. Diagram: Visual Science Informatics

Here is a breakdown of what DG entails:

  • Data policies: Define how data is collected, stored, shared, and disposed of. They establish things such as data ownership, access controls, and security measures.
  • Data standards: Ensure consistency in how data is defined, formatted, and documented. Data standards enable the integration of data from different sources and avoid confusion.
  • Data processes: Develop Standard Operation Procedures (SOPs) for handling data, such as cleansing, transformation, and analysis. These processes and procedures ensure data quality and reliability.
  • Data people: Assign Roles and Responsibilities (R&R) within the organization for data governance. It includes a data governance council, data stewards (owners of specific datasets), and data analysts.
  • Data security: Protect data from unauthorized access, breaches, and other threats. Data security measures make sure that information is only accessed by authorized R&R. Data access establishes protocols for granting and controlling access to data (Create, Reference, Update, and Delete as a "CRUD matrix"), internally and externally, in line with classified security requirements, and comply with data privacy regulations.

The benefits of DG are numerous:

  • Improved data quality: Ensures data is accurate, complete, and consistent, leading to better decision-making.
  • Enhanced data security: Protects data from unauthorized access, misuse, and breaches.
  • Increased data accessibility: Enables authorized users to find and use relevant data.
  • Boosted compliance: Helps organizations meet regulatory requirements for data privacy and security.

Data Governance is crucial for organizations that rely on data for informed decision-making. It helps turn enterprise data into a valuable asset and promotes trust in data-driven insights.


Principles of Enterprise Data Science Architecture

An effective enterprise data science architecture is essential for organizations to derive maximum value from their data assets. Here are some key principles to guide its design and implementation:

Guiding Principles

1. Data-Driven Culture:

  • Data as an Asset: Treat data as a valuable asset, essential for decision-making.
  • Data Literacy: Foster a data-literate culture, empowering employees to understand and utilize data. Build confidence in data-driven insights.
  • Data-Centric Processes: Integrate data science into business processes to drive innovation. Streamline data and model development processes. Ensure reproducibility of experiments.

2. Scalability and Flexibility:

  • Modular Design: Design a modular architecture to accommodate harmonized data from various sources and future growth and evolving needs.
  • Cloud-Native Architecture: Leverage cloud technologies to scale infrastructure and resources efficiently.
  • Agile Development: Adopt agile methodologies for rapid development and iteration.

3. Data Governance, Privacy, and Security:

  • Data Governance Framework: Establish a robust data governance policies framework to ensure data quality, security, and compliance.
  • Regulations Compliance: Adhere to regulatory requirements for data traceability, accountability, privacy, and security.
  • Data Security: Implement stringent security measures to protect sensitive data.
  • Data Privacy: Protect data privacy and comply with regulations such as GDPR, CCPA, and HIPAA.

4. Data Quality and Integrity:

  • Data Profiling: Conduct regular data profiling to identify and address data quality issues.
  • Data Cleaning and Transformation: Cleanse and transform data to ensure accuracy and consistency.
  • Data Validation: Implement data validation rules to maintain data integrity.
  • Data Version Control (DVC): Track data quality changes over time and manage different versions of data and metadata to ensure reproducibility and accountability.

5. Data Integration and Management:

  • Data Lake/Data Warehouse: Establish a centralized data repository for efficient data storage and retrieval.
  • Data Pipelines: Implement robust data pipelines to automate data ingestion, transformation, and loading.
  • Data Harmonization: Ensure consistency and compatibility across diverse datasets and various data sources. Align data harmonization processes with business objectives.
  • Metadata Management: Manage metadata effectively to understand data lineage and provenance.
  • Data Lineage Tracking (DLT): Track data dependencies and understand the impact analysis of data changes on model performance to improve data understanding and decision-making.

6. Model Development and Deployment:

  • Model Lifecycle Management: Define a comprehensive model lifecycle, including development, testing, deployment, and monitoring.
  • MLOps: Implement MLOps practices to automate model development, deployment, and retraining.
  • Model Governance: Establish guidelines for model development, deployment, and retirement.

7. Collaboration and Knowledge Sharing:

  • Data Science Platform: Provide a centralized platform for data scientists to facilitate collaboration and share knowledge.
  • Knowledge Management: Foster a culture of knowledge sharing and documentation.
  • Cross-Functional Collaboration: Encourage collaboration between data scientists, business analysts, and IT teams.

By adhering to these principles, organizations can build a robust and scalable data science architecture that enables them to extract maximum value from their data, drive innovation, and achieve sustainable competitive advantage.


Data Quality Maturity Model (DQMM)

A Data Quality Maturity Model (DQMM) is a framework that helps organizations assess the maturity of their Data Quality Management (DQM) practices. It provides a structured approach for identifying areas for improvement and developing a roadmap for achieving data quality excellence.

DQMM. Diagram: SSA Analytics Center of Excellence (CoE)

DQMM. Diagram: SSA Analytics Center of Excellence (CoE)

A DQMM typically consists of five maturity levels, each representing a distinct stage in an organization's DQM journey:

Data Maturity Assessment Model. Diagram: HUD

Data Maturity Assessment Model. Diagram: HUD

  • Initial: At this level, there is no formal DQM program in place. Data quality issues are identified and addressed on an ad-hoc basis.
  • Recognizing: The organization recognizes the importance of data quality and begins to take some initial steps to improve it. This may involve establishing data quality policies and procedures, or identifying critical data elements.
  • Specifying: The organization defines specific data quality requirements and starts to implement processes to measure and monitor data quality.
  • Managing: The organization has a well-defined DQM program in place, with processes for managing data quality throughout the data lifecycle.
  • Optimizing: The organization continuously monitors and improves its DQM program, using data quality metrics to drive decision-making.

Organizations can use a DQMM to benchmark their current DQM practices against industry best practices and identify areas for improvement. The model can also be used to develop a roadmap for implementing a DQM program or improving an existing one.

DQMM Benefits:

  • Provides a structured approach for assessing data quality maturity
  • Helps to identify areas for improvement
  • Provides a roadmap for implementing or improving a DQM program
  • Helps to benchmark data quality practices against industry best practices
  • Promotes a data-driven approach to decision-making


Information Architecture (IA)

Information Architecture (IA) is the art and science of organizing and labeling the content of websites, intranets, online communities, and software to make it findable and understandable. IA is different from knowledge and data architecture, but IA must be an integral part of enterprise architecture. It's essentially the blueprint that helps users navigate through information effectively.

Data, Information, and Knowledge concepts. Diagram: Packet

Data, Information, and Knowledge concepts. Diagram: Packet

IA Principles:

  • Clarity: The organization of information should be clear and logical, making it easy for users to find what they're looking for.
  • Consistency: The labeling and navigation should be consistent throughout the website or application.
  • Usability: The IA should be designed to be intuitive and easy to use, even for novice users.
  • Accessibility: The IA should be accessible to all users, regardless of their abilities.

Types of Informatics Schemes. Diagram adapted: Louis Rosenfeld & Peter Morville

Types of Informatics Schemes. Diagram adapted: Louis Rosenfeld & Peter Morville

IA?is an organization, labeling, metadata, and navigation schemes within an information system [Information Architecture. Visual Science Informatics]. IA organizations such as:

  • Metadata
  • Classi?cation vs. Categorization
  • Controlled Vocabulary
  • Taxonomy
  • Thesaurus
  • Ontology
  • Meta-Model

IA Topologies based on IA Organizations. Diagram adapted: Leo Orbst

IA Topologies based on IA Organizations. Diagram adapted: Leo Orbst

For example, in health science informatics, there are several vocabularies of vocabularies. To improve your data quality, you need to understand the definitions and visualization of informatics architecture vocabularies.

From Searching to Knowing – Spectrum for Knowledge Representation and Reasoning Capabilities. Diagram adapted: Leo Orbst

From Searching to Knowing – Spectrum for Knowledge Representation and Reasoning Capabilities. Diagram adapted: Leo Orbst

The Common Core Ontologies (CCO) are several open-source ontologies designed to represent and integrate generic classes and relations across all domains. They extend the Basic Formal Ontology (BFO), a widely used upper-level ontology. CCO helps prevent domain-specific ontologies from duplicating common concepts. The CCO files are available on GitHub.


Data Standards (DSs)

"Data Standards (DSs) are created to ensure that all parties use the same language and the same approach to sharing, storing, and interpreting information. In healthcare, standards make up the backbone of interoperability — or the ability of health systems to exchange medical data regardless of domain or software provider."

In the context of health care, the term data standards encompass methods, protocols, terminologies, and specifications for the collection, exchange, storage, and retrieval of information associated with health care applications, including medical records, medications, radiological images, payment and reimbursement, medical devices and monitoring systems, and administrative processes. [Washington Publishing Company]

Levels of Interoperability in Healthcare . Diagram: Awantika

Levels of Interoperability in Healthcare . Diagram: Awantika

Levels of Interoperability:

  • Level 1: Foundational (Syntactic)
  • Level 2: Structural (Relationships)
  • Level 3: Semantic (Terminology)
  • Level 4: Organizational (Functional)


It is necessary but not sufficient to have physical and logical compatibility and interoperability. Standards need to provide an explicit representation of data semantic and lexical.

Data Standards Define:

  • Structure: Defines general classes and specializations
  • Data Types: Determines structural format of attribute data
  • Vocabulary:?Controls possible values in coded attributes

Health Data Standards. Diagram: Visual Science Informatics

Health Data Standards. Diagram: Visual Science Informatics

Types of Health Data Standards (HDSs):

  • Content: "Define the structure and organization of the electronic message or document’s content. This standard category also includes the definition of common sets of data for specific message types."
  • Identifier: "Entities use identifier standards to uniquely identify patients or providers."
  • Privacy & Security: "Aim to protect an individual's (or organization's) right to determine whether, what, when, by whom and for what purpose their personal health information is collected, accessed, used or disclosed. Security standards define a set of administrative, physical, and technical actions to protect the confidentiality, availability, and integrity of health information."
  • Terminology/Vocabulary: "Health information systems that communicate with each other rely on structured vocabularies, terminologies, code sets, and classification systems to represent health concepts."
  • Transport: "Address the format of messages exchanged between computer systems, document architecture, clinical templates, user interface, and patient data linkage. Standards center on “push” and “pull” methods for exchanging health information."

Source: HIMSS

"Thankless data work zone". Diagram: Jung Hoon Son

"Thankless data work zone". Diagram: Jung Hoon Son

Standard terminologies are helpful, but it does not mean that everyone employs or follows them. As a healthcare example, the same column name might have ICD-9, ICD-10, and potentially CPT codes. [Jung Hoon Son]

  • Gold labels

"Data manually labeled by human beings is often referred to as gold labels, and is considered more desirable than machine-labeled data for analyzing or training models, due to relatively better data quality. This does not necessarily mean that any set of human-labeled data is of high quality. Human errors, bias, and malice can be introduced at the point of data collection or during data cleaning and processing. Check for them before analyzing. Intra-rater reliability measures the consistency of a single rater's judgments over time. In other words, it assesses how well a person can replicate their own ratings or observations for consistency, accuracy, and objectivity.

Any two human beings may label the same example differently. The difference between human raters' decisions is called inter-rater agreement. You can get a sense of the variance in raters' opinions by using multiple raters per example and measuring inter-rater agreement." [Google]

  • Silver labels

"Machine-labeled data, where categories are automatically determined by one or more classification models, is often referred to as silver labels. Machine-labeled data can vary widely in quality. Check it not only for accuracy and biases but also for violations of common sense, reality, and intention." [Google]


Anonymized Data

Acquiring Real World Data (RWD) is challenging because of regulations that HIPAA protects sensitive, confidential and Protected Health Information (PHI), and Personally Identifiable Information (PII). To use RWD, you must anonymize, remove all PHI and PII, or de-identify, encrypt, and hide protected data. In addition to data acquisition cost, HIPAA Privacy Rule for de-identification methods adds additional cost. These additional costs can be for using off-the-shelf tools for anonymization or hiring a de-identification expert with proven experience. Another barrier is the high cost for medical practitioners with the domain expertise to annotate and label the raw data, images, or audio to ML train models. Although you would think there is plenty of data generated by patient care, you cannot get that data directly. [Open Health Data]


Synthetic Data

Synthetic raw data, artificially generated rather than produced by real-world events, must also be preprocessed. Synthetic raw data is created using algorithms. Synthetic data can be artificially generated from real-world data, noisy data, handcrafted data, duplicated data, resampled data, bootstrapped data, augmented data, oversampled data, edge case data, simulated data, univariate data, bivariate data, multivariate data, multimodal data. [Cassie Kozyrkov] Artificial raw data can be deployed to validate mathematical models and train machine learning models. Also, data generated by a computer simulation is considered manufactured raw data.

Generative AI (GAI) is the driving force behind the creation of synthetic content. GAI models are trained on massive amounts of data. Once trained, they can generate new synthetic content that is similar to the data they were trained on, such as: text, images, audio, and video.

  • Synthesizing Missing Data (Imputation): This addresses the problem of incomplete datasets. Instead of simply discarding data points with missing values, GAI learns the underlying patterns in the existing data and uses this knowledge to fill in the missing entries. This is like a sophisticated form of data imputation. By creating plausible synthetic values, it results in a more complete and usable dataset, which can improve the performance of downstream AI models trained on it.
  • Augmenting Small Datasets (Data Augmentation): This tackles the issue of insufficient data. When there is not enough real-world data to train a robust AI model, GAI can create new, synthetic data points that resemble the real data. This effectively expands the dataset, giving the model more examples to learn from. This is crucial for applications where collecting real data is expensive, time-consuming, or difficult due to privacy concerns. By training on a larger, augmented dataset, the model can generalize better and make more accurate predictions.

ML Life Cycle with Synthetic Data. Diagram: Gretel

ML Life Cycle with Synthetic Data. Diagram: Gretel

Synthetic raw data can be utilized for anonymizing data, augmenting data, or optimizing accuracy. Anonymized data can filter information to prevent the compromise of the confidentiality of particular aspects.?Augmented data is generated to meet specific needs, conditions, or situations for the simulation of theoretical values, realistic behavior profiles, or unexpected results. If the collected data is imbalanced, missing values, or insufficiently numerous, synthetic data can be used to set a baseline, fill gaps, and optimize accuracy.


Data Types

Data comes in various formats, each with its own unique characteristics and challenges. Understanding the differences between structured, semi-structured, and unstructured data is crucial for effective data management and analysis.

Structured Data

  • Definition: Highly organized data with a predefined format, typically stored in relational databases.
  • Characteristics: Follows a rigid schema with clearly defined fields and data types. Easy to search, query, and analyze using traditional SQL-based tools.
  • Examples: Customer databases, financial transactions, product catalogs.

Semi-structured Data

  • Definition: Data that does not adhere to a strict schema, but has some organizational elements (tags or markers).
  • Characteristics: Often stored in formats such as JSON (JavaScript Object Notation), XML (Extensible Markup Language), or CSV files (Comma Separated Values), which allow for flexible structures. It contains tags or markers to separate data elements, making it easier to analyze than unstructured data, but requires specialized tools.
  • Examples: Social media posts, sensor data, email messages.

Unstructured Data

  • Definition: Data that does not have a predefined schema format or organization that dictates how the information is organized.
  • Characteristics: Most challenging to analyze due to its lack of structure. Often requires advanced techniques such as natural language processing (NLP) and machine learning. While images, audio, and video have file formats, these formats do not dictate the specific content within those files in the same way a database schema does.
  • Examples: Text documents, social media posts, images, audio, video.

Data Types Key Differences Summarized. Table: Gemini

Data Types Key Differences Summarized. Table: Gemini

Key Considerations

  • Data Storage: Different data types require different storage solutions. Structured data is typically stored in relational databases, while semi-structured and unstructured data often require NoSQL databases or specialized storage systems.
  • Data Analysis: The techniques used to analyze each type of data vary significantly. Structured data can be analyzed using SQL queries, while semi-structured and unstructured data often require more advanced tools and techniques.
  • Data Growth: Unstructured data is growing rapidly due to the increasing use of digital technologies. This growth presents challenges for data management and analysis.

By understanding the characteristics of each data type, organizations can make informed decisions about how to store, manage, and analyze their data effectively.


Goodness of Measurement vs. Goodness-of-Fit

Let's continue with understanding the concepts of Goodness of Measurement vs. Goodness-of-Fit, and statistical biases before covering the data preparation process.

  • Goodness of Measurement vs. Goodness-of-Fit

During data acquisition, your data quality depends on your instrument goodness of measurement, statistical bias sources, and your data goodness of fit.

  • Goodness of Measurement: Reliability and Validity

The two most important and fundamental characteristics of any measurement procedure are reliability and validity, which lie at the heart of a competent and effective study. [Accuracy: The Bias-Variance Trade-off]

Bullseye Diagram: The distribution of model predictions. Diagram adapted: Domingo

Bullseye Diagram: The distribution of model predictions. Diagram adapted: Domingo

Validity is a test of how well an instrument that is developed measures the particular concept it is intended to measure. In other words, validity is concerned with whether we measure the right concept or not.

Reliable instrument is considered if a measurement device or procedure consistently assigns the same score to individuals or objects with equal values. In other words, the reliability of a measure indicates the extent to which it is without bias and hence insures consistent measurement across time and across the various items in the instruments.

Several types of validity and reliability test are used to test the goodness of measures. Data scientists use different various forms of reliability and validity and different terms to denote them.

Goodness of Data Measurement - Forms of reliability and validity. Diagram: Shweta Bajpai et al.

Goodness of Data Measurement - Forms of reliability and validity. Diagram: Shweta Bajpai et al.


Statistical Bias

“A major difference between machine learning and statistics is their purpose. Machine learning models are designed to make the most accurate predictions possible. Statistical models are designed for inference about the relationships between variables.”

Statistical bias is a systematic tendency that causes differences between results and facts. Statistical bias may be introduced at all stages of data analysis: data selection, hypothesis testing, estimator selection, analysis methods, and interpretation. [Interpretability: “Seeing Machines Learn”]

Statistical bias sources from stages of data analysis. Diagram: Visual Science Informatics, LLC

Statistical bias sources from stages of data analysis. Diagram: Visual Science Informatics, LLC

Systematic error (bias) introduces noisy data with high bias but low variance. Although measurements are inaccurate (not valid), they are consistent (reliable). Repeatable systematic error is associated with faulty equipment or a flawed experimental design and influences a measurement's accuracy.

Reproducibility error (variance) introduces noisy data with low bias but high variance. Although measurements are accurate (valid) they are inconsistence (not reliable). The repeatable error is due to a measurement process and primarily influences a measurement's accuracy. Reproducibility refers to the variation in measurements made on a subject under changing conditions.


Sample Size

"Sample complexity of a machine learning algorithm represents the number of training samples it needs in order to successfully learn a target function." The two main categories are probability sampling and non-probability sampling methods. In probability sampling, every member of the target population has a known chance of being included. This allows researchers to make statistically significant inferences about the population from the sample. Non-probability sampling methods are used when it's not possible or practical to get a random sample, but they limit the ability to generalize the findings to the larger population. [Scenarios: Which Machine Learning (ML) to choose?]

Sampling Methods and Techniques. Diagram: Asad Naveed

Sampling Methods and Techniques. Diagram: Asad Naveed

Understanding your ML model should start with the data collection, transformation, and processing because, otherwise, you will get “Garbage In, Garbage Out” (GIGO).

  • Goodness-of-Fit

“The term goodness-of-fit refers to?a statistical test that determines how well sample data fits a distribution from a population with a normal distribution. Put simply, it hypothesizes whether a sample is skewed or represents the data you would expect to find in the actual population." [4]


Data Quality Scorecard (DQS)

A Data Quality Scorecard (DQS) is a tool used to measure and track the health of your data. It provides a quick and clear way to understand how well your data meets specific quality dimensions.

Data Quality Scorecard Repository Executive Summary Dashboard. Graph: Proceedings of the MIT 2007 Information Quality Industry Symposium

Data Quality Scorecard Repository Executive Summary Dashboard. Graph: Proceedings of the MIT 2007 Information Quality Industry Symposium

DQS offers:

  • Measurement: A DQS translates complex data quality metrics into a score or set of scores. These scores represent how well your data adheres to various quality dimensions such as accuracy, completeness, consistency, and timeliness.
  • Tracking: A scorecard allows you to track these quality measurements over time. This helps identify trends and monitor the effectiveness of any data quality improvement initiatives.
  • Context: DQS reports go beyond just a number. They provide context for interpreting the scores by highlighting specific data elements or areas that are falling short of quality expectations. This helps data users understand the limitations of the data and make informed decisions.

DQS Benefits:

  • Improved Transparency: DQS fosters trust in data by providing clear visibility into its quality.
  • Prioritization: It helps identify the most critical data quality issues for your organization to focus on.
  • Communication: The scorecard provides a common language for discussing data quality across different teams within an organization.
  • Data-driven Decisions: By understanding data quality, businesses can make more informed decisions based on reliable information.

There are different ways to implement a DQS. Some data management platforms offer built-in DQS features, while others may require custom development. The specific design of your DQS will depend on your organization's needs and the type of data you manage.


Data Quality Dashboard (DQD)

A Data Quality Dashboard is a visual representation of the health of your data using key metrics and charts. It provides a centralized view of various data quality tests and their results, allowing you to monitor data quality trends and identify areas needing improvement.

Data Quality Dashboard. Chart: iMerit

Data Quality Dashboard. Chart: iMerit

DQD Elements:

1. Data Quality Dimensions:

The dashboard should represent various data quality dimensions such as accuracy, completeness, consistency, timeliness, and validity.

2. Data Quality Tests:

Each dimension can be broken down into specific data quality tests, e.g., null value tests, freshness checks, and uniqueness tests (refer to the list of tests in the Data Quality Tests section discussed later).

3. Data Visualization:

The dashboard should use charts and graphs to represent the results of each test. Examples include:

  • Bar charts: Compare null value percentages across different columns.
  • Line graphs: Track data freshness over time (e.g., how long it takes to update sales data).
  • Scatter plots or scattergram charts: Visualize data distribution (e.g., percentage of customers in each state).

4. Key Performance Indicators (KPIs) & Service Level Indicators (SLIs):

KPIs and SLIs can be incorporated to set specific targets for data quality.

  • KPIs: Overall data quality score for a specific data source or table.
  • SLIs: Define acceptable thresholds for metrics such as freshness (data shouldn't be older than 24 hours) or volume (number of customer records shouldn't exceed 1 million).

5. Alerts & Notifications:

The dashboard can be configured to send alerts or notifications when data quality metrics fall outside acceptable ranges, prompting investigation and corrective action.

DQD Benefits:

  • Improved Visibility: Provides a clear view of data health across different dimensions.
  • Proactive Monitoring: Enables early detection of data quality issues.
  • Prioritization: Helps identify critical areas for data quality improvement.
  • Communication: Facilitates communication of data quality status across teams.
  • Data-driven Decisions: Supports data-driven decision making based on reliable information.

DQD Design Requirements:

  • Identify data sources & quality needs: Start by understanding your data sources and the specific quality requirements for each.
  • Select data quality tests: Choose the most relevant tests based on your data and needs (refer to the list of tests in the Data Quality Tests section discussed later).
  • Choose visualization tools: Select appropriate data visualization tools to represent test results effectively.
  • Set KPIs & SLIs: Define clear KPIs and SLIs to measure success.
  • Schedule updates: Determine how often the dashboard will be updated with fresh data.

By implementing a DQD, you can proactively manage your data quality and evaluate that it meets your organization's needs.

There is a high cost of poor data quality to the success of your ML model. You will need to have a systematic method to improve your data quality. Most of the work is in your data preparation and consistency is a key to data quality.


Data Science 'Hierarchy of Needs' and Bloom’s Taxonomy Adapted for ML

"Collecting better data, building data pipelines, and cleaning data can be tedious, but it is very much needed to be able to make the most out of data." The Data Science Hierarchy of Needs, by Sarah Catanzaro, is a checklist for "avoiding unnecessary modeling or improving modeling efforts with feature engineering or selection." [Serg Masis]

Data Science Hierarchy of Needs. Diagram: Serg Masis

Data Science Hierarchy of Needs. Diagram: Serg Masis

Note: The "Data Science Hierarchy of Needs" is attributed to Monica Rogati; Sarah Catanzaro is known for revisiting and discussing this concept, but the original idea is credited to Rogati.

Bloom’s Taxonomy Adapted for ML

Bloom's Revised Taxonomy. Diagram: Wikipedia

Bloom's Revised Taxonomy. Diagram: Wikipedia

"Bloom's taxonomy is a set of three hierarchical models used for the classification of educational learning objectives into levels of complexity and specificity. The three lists cover the learning objectives in the cognitive, affective, and psychomotor domains. There are six levels of cognitive learning according to the revised version of Bloom's Taxonomy. Each level is conceptually different. The six levels are?remembering, understanding, applying, analyzing, evaluating, and creating." [Anderson & Krathwohl, 2001, pp. 67-68]

This Bloom's taxonomy was adapted for machine learning.

Bloom’s Taxonomy Adapted for ML. Diagram: Visual Science Informatics, LLC

Bloom’s Taxonomy Adapted for ML. Diagram: Visual Science Informatics, LLC

There are six levels of model learning in the adapted version of Bloom's Taxonomy for ML. Each level is a conceptually different learning model. The levels order is from lower-order learning to higher-order learning. The six levels are Store, Sort, Search, Descriptive, Discriminative, and Generative. Bloom’s Taxonomy adapted for ML terms are defined as:

  • Store?models capture three perspectives: Physical, Logical, and Conceptual data models. Physical data models describe the physical means by which data are stored. Logical data models describe the semantics represented by a particular data manipulation technology. Conceptual data models describe a domain's semantics in the model's scope. Extract, Transform, and Load (ETL) operations are a three-phase process where data is extracted, transformed, and loaded into store models. Collected data can be from one or more sources. ETL data can be stored in one or more models.
  • Sort?models arrange data in a meaningful order and systematic representation, which enables searching, analyzing, and visualizing.
  • Search?models solve a search problem to retrieve information stored within some data structure, or calculated in the search space of a problem domain, either with discrete or continuous values.
  • Descriptive?models specify statistics that quantitatively describe or summarize features and identify trends and relationships.
  • Discriminative?models focus on a solution and perform better for classification tasks by dividing the data space into classes by learning the boundaries.
  • Generative?models understand how data is embedded throughout space and generate new data points.


Data Quality Strategy

Next, you should check and analyze your data even before you train a model because you might discover data quality issues in your data. Identifying common data quality issues such as missing data, duplicated data, and inaccurate, ambiguous, or inconsistent data can help you find data anomalies and perform feature engineering.

Data Quality Strategy Outline. Flowchart: Gary McQuown

Data Quality Strategy Outline. Flowchart: Gary McQuown

Crossing the data quality chasm from raw data to a good quality dataset requires you to consider the full equation of objectives, cause, assessment, and techniques. A Data Quality Assessment (DQA) identifies potential causes of poor data quality. These data quality causes can link to your objectives to improve data quality. These objectives and the assessment can help you attain and select techniques that resolve your data quality causes.

Crossing the data quality chasm from raw data to a good quality dataset. Diagram: Visual Science Informatics, LLC

Crossing the data quality chasm from raw data to a good quality dataset. Diagram: Visual Science Informatics, LLC

ABC of Data Science

ML is a form of Artificial Intelligence (AI), which makes predictions and decisions from data. It is the result of training algorithms and statistical models to analyze and draw inferences from patterns in data, which are able to learn and adapt without following explicit instructions. However, you need to:

  1. Check - double check your Assumptions,
  2. Mitigate - make sure you mitigate your Biases, and
  3. Validate - take your time to validate your Constraints.

The Assumptions, Biases, and Constraints (ABC) of data science, Data, and Models of ML can be captured in this formula:

Machine Learning = {Assumptions/Biases/Constraints, Data, Models}

Dataflow in a Traditional ML Workflow

Let’s view the data pipeline workflow. After articulating a problem statement and defining the required data, the three main phases of an initial data pipeline are Data Acquisition, Data Exploration, and Data Preparation.

Dataflow in a Traditional ML Workflow. Diagram: Visual Science Informatics

Dataflow in a Traditional ML Workflow. Diagram: Visual Science Informatics

  • Data Preparation

Data preparation (preprocessing) is the process of cleaning, reducing, and transforming raw data before training a machine learning model. The data preparation process includes, for example, standardizing data formats, correcting errors, enriching source data, and removing outliers and biases. Common data preprocessing tasks include reformatting data, making corrections to data, and combining datasets to enrich data. Each data preparation steps are essential and require specific methods, techniques, and functionalities. Although data preparation is a lengthy and tedious process undertaking a significant amount of time and effort, data preparation is an essential process that mitigates "garbage in, garbage out." Therefore it enhances model performance because preprocessing data improves the data quality.

  • Feature Engineering

Feature engineering or feature extraction or feature discovery is using domain knowledge to extract or create features (characteristics, properties, and attributes) from raw data. If feature engineering is performed effectively, it improves the machine learning process. Consequently, it increases the predictive power and improves ML model accuracy, performance, and quality of results by creating extra features that effectively represent the underlying model. The feature engineering step is a fundamental part of the data pipeline, which leverages data preparation, in the machine learning workflow.

  • Data Wrangling

Data wrangling is the process of restructuring, cleaning, and enriching raw data into the preferred format for easy access and analysis. The data preparation (preprocessing) steps are performed once before applying any iterative model training, validation, and test (evaluation). While the Wrangling process is performed at the time of feature engineering during the iterative analysis and model building. For instance, data cleaning focuses on removing erroneous data from your dataset. In contrast, data-wrangling focuses on changing a data format by translating raw data into a usable form and structure.

In this section, we will focus on the Data Preparation (Preprocessing) phase. Within this phase, there are six key data preprocessing steps.

Data preprocessing steps. Diagram: TechTarget

Data preprocessing steps. Diagram: TechTarget

  • Steps for Data Preprocessing

1. Data profiling is examining, analyzing, and reviewing data quality assessment. The assessment objectives are to discover and investigate structure, content, and relationship data quality issues. Before any modification of a collected dataset, data scientists should baseline a data inventory by surveying and summarizing the data characteristics. Additionally, data scientists can generate a descriptive statistic summary that quantitatively describes or summarizes features from the collection of datasets.

2. Data cleansing is fixing or removing incorrect, corrupted, incorrectly formatted, duplicate, or incomplete data within a dataset.

3. Data reduction is the process that reduces the volume of original raw data. Data reduction techniques are used to obtain a reduced representation while maintaining the integrity of the original raw data.

4. Data transformation is the process of converting, restructuring, and mapping data from one format into a more usable form, typically from the format of a source system into the required format and structure of a destination system.

5. Data enrichment or augmentation enhances existing information by supplementing missing or incomplete data. Data enrichment can append and expand collected raw data with relevant context obtained from external data sources.

6. Data validation in data preprocessing but not as part of a model training is checking those data modifications (cleansing, reduction, transformation, and enrichment/augmentation) are both correct and useful.


Data Version Control (DVC)

Note that throughout the data preprocessing steps, we highly recommended deploying Data Version Control (DVC) utilizing data versioning tools for MLOps to help you track changes to all your datasets: raw, training, validation, and test (evaluation). [Operations: MLOps, Continuous ML, & AutoML] Moreover, DVC allows version control of model artifacts, metadata, notations, and models. Furthermore, data must be properly labeled and defined to be meaningful. Metadata, the information describing the data, must be accurate and clear. Lastly, data preprocessing issues must be logged, tracked, and categorized.?Data preprocessing issues can be captured by major data categories, such as quantities, encoded data, structured text, and free text.


Data Lineage Tracking (DLT)

In data management, lineage tracking refers to the process of documenting the journey of data from its source to its final destination. It involves tracing the history and transformations of data entities to understand their origin, evolution, and impact. This includes tracking data transformations, integrations, and usage patterns.

Benefits of DLT:

- Improved Data Governance: By understanding data origins and transformations, organizations can better manage and govern their data assets.

- Enhanced Data Quality: Lineage tracking helps identify and rectify data quality issues by tracing their root causes.

- Facilitated Data Impact Analysis: When data changes or issues arise, lineage tracking enables organizations to quickly assess the potential impact on downstream systems and processes.

- Aided Audit and Compliance: Lineage tracking provides a detailed audit trail, making it easier to comply with regulations and industry standards.


Data Preprocessing Methods

Data preprocessing methods, which enhances data quality, includes data cleaning, data transformation, and data reduction.

Data Preprocessing Methods. Diagram: Yulia Gavrilova & Olga Bolgurtseva

Data Preprocessing Methods. Diagram: Yulia Gavrilova & Olga Bolgurtseva

Data cleaning methods handle missing values by discarding them or applying imputation methods to replace missing data with inference values. They also detect and mitigate biases (imbalances) in the data. Additionally, data cleaning methods remove and eliminate noise, outliers, duplicates/redundant, inconsistencies, and false null values from data. Furthermore, data cleaning methods ensure data values fall within defined domains, resolve conflicts in data, ensure proper definition and use of data values, and establish and apply data standards.

Data transformation methods include scaling/normalization, standardization, smoothing, scaling, and pivoting. ?Also, they include attribute selection, discretization of numeric variables to categorical/encoded attributes, decomposing categorical attributes, reframing numerical quantities, and concept hierarchy generation. In the case of a highly skewed distribution, log transform spreads out the curve to resemble a less skewed Gaussian distribution and reduces the complexity of a model. [Complexity - Time, Space, & Sample]

Data reduction methods include decomposition, aggregation, and partitioning. Also, they include attribute subset selection and numerosity, dimensionality, and variance threshold reduction. Data sampling and partitioning can reduce a large amount of data by techniques such as sampling with/without replacement, stratified and progressive sampling, or randomly split data.

Data augmentation techniques can be applied in case of data scarcity. Data augmentation generates synthetic data that can be mixed with the collected data to enhance generalization accuracy and performance.

Data preprocessing tasks for building operational data analysis. Diagram: Fan Chang, et al. [5]

Data preprocessing tasks for building operational data analysis. Diagram: Fan Chang, et al. [5]


Types of Missing Values

Missing values are a common challenge in data analysis. Understanding the different types of missing data is crucial for selecting the appropriate imputation method.

Types of Missing Values and Suitable Imputation Techniques. Table: Gemini

Types of Missing Values and Suitable Imputation Techniques. Table: Gemini

Note: While KNN is listed under both MCAR and MAR, its effectiveness can vary based on the specific dataset and missing data pattern.

Two primary types of missingness are commonly distinguished:

1. Crude Missingness

  • Missing Values: This is the simplest form of missingness, where specific data points are absent. They can occur due to various reasons, such as data entry errors, equipment malfunctions, or participant non-response.

2. Qualified Missingness

This type of missingness provides more detailed information about the reasons behind the missing data. It can be categorized as follows:

  • Non-response Rate: The overall proportion of individuals who did not respond to any part of the survey or study.
  • Refusal Rate: The proportion of individuals who declined to participate in the study or refused to answer specific questions.
  • Drop-out Rate: The proportion of individuals who started the study but did not complete it.
  • Missing Due to Specified Reason: This category encompasses various reasons for missing data, such as:

- Missing by Design: Data points that were intentionally omitted based on specific criteria.

- Technical Errors: Data that was lost due to technical issues such as software bugs or hardware failures.

- Participant Errors: Data that was incorrectly recorded or omitted by participants.

Important Considerations:

  • Imputation is generally not recommended for MNAR data.
  • Always explore the reasons for missing data before imputing.
  • Evaluate the impact of imputation on your analysis.
  • Consider using multiple imputation methods and comparing results.

Additional Tips:

  • The choice of imputation technique should be based on the specific characteristics of the dataset, such as data type, distribution, and the amount of missing data.
  • If you have a large amount of missing data, consider removing the variable or observations with missing values.
  • Explore the distribution of missing values to gain insights into the missing data pattern.
  • Use domain knowledge to inform the imputation process.

By carefully considering the type of missing data and the characteristics of your dataset, you can choose the most appropriate imputation method to improve the quality of your analysis.


Imputation Techniques

Data imputation is a crucial process in data preprocessing, aimed at filling in missing values to enable comprehensive analysis. Different techniques exist, each with its own set of challenges and limitations.

Pre-conditions and Post-conditions of Data Imputation

Pre-conditions:

  • Understanding the Missing Data Mechanism: Its vital to understand why data is missing (Missing Completely at Random - MCAR, Missing at Random - MAR, or Missing Not at Random - MNAR). This informs the choice of imputation technique.
  • Data Characteristics: The type of data (numerical, categorical), data distribution, and relationships between variables influence the suitability of different methods.

Post-conditions:

  • Reduced Bias: Effective imputation minimizes bias in subsequent analyses that could arise from simply ignoring missing data.
  • Improved Statistical Power: By increasing the amount of usable data, imputation can enhance the statistical power of analyses.
  • Preserved Data Structure: A good imputation method maintains the underlying relationships and patterns within the data.


Single Imputation Methods:

These methods fill each missing value with a single estimate.

  • Mean/Median/Mode Imputation: Replacing missing values with the mean (for numerical data), median (for numerical data with outliers), or mode (for categorical data).
  • Last Observation Carried Forward (LOCF) / Next Observation Carried Backward (NOCB): Using the last or next available value to fill the gap (commonly used in time-series data).
  • Hot-deck Imputation: Replacing missing values with values from similar records (donors) in the dataset.
  • Cold-Deck Imputation: Similar to hot-deck, but the "donors" come from an external source or a previous dataset.

Challenges and Limitations:

  • Underestimation of Variance: These methods tend to underestimate the variance of the data, leading to overly precise but potentially inaccurate results.
  • Distorted Relationships: Simple methods such as mean imputation can distort relationships between variables.
  • Sensitivity to Outliers: Mean imputation is particularly sensitive to outliers.


Multiple Imputation Methods:

These methods generate multiple plausible values for each missing entry, creating multiple complete datasets. Analysis is performed on each dataset, and the results are combined to account for the uncertainty due to imputation.

  • Multiple Imputation by Chained Equations (MICE): Imputes each missing variable using a series of regression models, conditional on other variables in the dataset.
  • Stochastic Regression Imputation: Similar to regression imputation, but adds random noise to the predicted values to account for the uncertainty in the prediction. This makes it a bridge to multiple imputation as it introduces variability.

Challenges and Limitations:

  • Computational Intensity: Multiple imputation is more computationally intensive than single imputation.
  • Model Specification: The performance of MICE depends on the correct specification of the imputation models. Convergence Issues: In some cases, the iterative process in MICE may not converge.


Model-Based Imputation Methods:

These methods use statistical models to predict missing values based on the relationships between variables.

  • Regression Imputation: Using regression models to predict missing values based on other variables.
  • Expectation-Maximization (EM) Algorithm: An iterative algorithm that estimates missing values by maximizing the likelihood of the observed data.
  • k-Nearest Neighbors (k-NN) Imputation: Imputing missing values based on the values of the k-nearest neighbors in the data.

Challenges and Limitations:

  • Model Assumptions: These methods rely on assumptions about the underlying data distribution and relationships between variables.
  • Computational Cost: Some model-based methods, such as EM, can be computationally expensive.
  • Overfitting: Complex models can lead to overfitting, where the imputed values are too closely tied to the observed data.


Time-Series Imputation Methods:

  • Linear Interpolation: Filling missing values with values calculated along a straight line between two adjacent known points.
  • Spline Interpolation: Using piecewise polynomial functions to interpolate missing values, providing smoother curves than linear interpolation.
  • Seasonal Decomposition: Decomposing the time series into trend, seasonal, and residual components and imputing missing values based on these components.
  • Autoregressive (AR), Moving Average (MA), and ARIMA Models: Utilizing time series models to predict missing values based on past observations.

Challenges and Limitations:

  • These methods assume a certain degree of regularity and predictability in the time series.
  • They are not suitable for time series with abrupt changes or high volatility.
  • The choice of method depends on the characteristics of the time series data (e.g., seasonality, trend).

Table 1: Single and Bridging Imputation Methods. Table: Gemini
Table 2: Multiple, Model-Based, and Time-Series Imputation Methods. Table: Gemini

General Considerations:

  • Missing Data Mechanism: The choice of imputation technique should be guided by the missing data mechanism. Multiple imputation is generally preferred for MAR data, while more sophisticated methods may be needed for MNAR data.
  • Data Type and Distribution: The type of data (numerical, categorical) and its distribution influence the suitability of different methods.
  • Computational Resources: Multiple imputation and some model-based methods can be computationally intensive.
  • Evaluation: Its important to evaluate the performance of imputation methods using appropriate metrics and consider the impact on downstream analyses.

In conclusion, data imputation is a critical step in handling missing data. The choice of imputation technique depends on various factors, including the missing data mechanism, data characteristics, and computational resources. Understanding the challenges and limitations of each method is essential for making informed decisions and ensuring the validity of subsequent analyses.


Data Quality Tests

Essential Data Quality Tests: List: Tim Osborn

Essential Data Quality Tests: List: Tim Osborn

There are numerous data quality tests, such as:

1. NULL Values Test:

  • Description: Checks for missing data entries represented by NULL values in columns.
  • Why it matters: Excessive NULL values can hinder analysis and lead to inaccurate results.
  • Example: Identifying the percentage of NULL values in a "customer email" column.

2. Freshness Checks:

  • Description: Assesses how recently data has been updated.
  • Why it matters: Stale data can lead to poor decision-making.
  • Example: Verifying if yesterday's sales data has been loaded into the system today.

3. Freshness SLIs (Service Level Indicators):

  • Description: Defines acceptable timeframes for data updates.
  • Why it matters: SLIs provide a benchmark for data freshness expectations.
  • Example: Setting an SLI that sales data should be no more than 24 hours old.

4. Volume Tests:

  • Description: Analyzes the overall amount of data in a dataset or table.
  • Why it matters: Extremely high or low volumes can indicate data integration issues.

5. Missing Data:

  • Description: Broader category encompassing NULL values and any other types of missing entries.
  • Why it matters: Missing data can skew analysis and limit the usefulness of your data.
  • Example: Identifying rows missing customer addresses or product descriptions.

6. Too Much Data:

  • Description: Checks for situations where data volume exceeds expectations or storage capacity.
  • Why it matters: Excessive data can slow down processing and increase storage costs.

7. Volume SLIs:

  • Description: Similar to Freshness SLIs, but define acceptable data volume ranges.
  • Why it matters: Ensures data volume stays within manageable and expected levels.
  • Example: Setting an SLI that the customer table shouldn't contain more than 1 million records.

8. Numeric Distribution Tests:

  • Description: Analyzes the distribution of numerical values in a column (e.g., average, standard deviation).
  • Why it matters: Identifies outliers or unexpected patterns in numerical data.
  • Example: Checking for negative values in a "price" column, which could indicate errors.

9. Inaccurate Data:

  • Description: A broader category encompassing various tests to identify incorrect data entries.
  • Why it matters: Inaccurate data leads to misleading conclusions and poor decision-making.
  • Example: Verifying if a customer's email address follows a valid email format.

10. Data Variety:

  • Description: Assesses the range of unique values present in a categorical column.
  • Why it matters: Limited data variety can restrict the insights you can extract from the data.
  • Example: Ensuring a "customer state" column captures all possible state abbreviations.

11. Uniqueness Tests:

  • Description: Checks for duplicate rows or entries within a dataset.
  • Why it matters: Duplicates inflate data volume and can skew analysis results.
  • Example: Identifying duplicate customer records based on email address or phone number.

12. Referential Integrity Tests:

  • Description: Verifies that foreign key values in one table reference existing primary key values in another table (relational databases).
  • Why it matters: Ensures data consistency and prevents orphaned records.
  • Example: Checking if "order ID" in the order details table corresponds to a valid "order ID" in the main orders table.

13. String Patterns:

  • Description: Evaluates text data for specific patterns or formats (e.g., postal codes, phone numbers).
  • Why it matters: Ensures consistency and facilitates data analysis based on specific patterns.
  • Example: Validating if all phone numbers in a "customer phone" column follow a standard format (e.g., +1-###-###-####).

By implementing these data quality tests, you can improve your data accuracy, completeness, consistency, timeliness, uniqueness, and validity, leading to better decision-making and improved outcomes.


Interactive Visualization Tools

Raw data for ML modeling might require massive amounts of data that can be difficult and slow to sort through, explore, and process. Identifying common data quality issues such as missing data, duplicated data, and inaccurate, ambiguous, or inconsistent data can help you find data anomalies and perform feature engineering. Interactive visualization lets you establish a visual baseline, explore vast amounts of data, examine data quality, and harmonize data.

TensorFlow Data Validation and Visualization Tools. Table: TensorFlow.org

TensorFlow Data Validation and Visualization Tools. Table: TensorFlow.org

TensorFlow Data Validation provides tools for visualizing the distribution of feature values. By examining these you can identify your data distribution, scale, or label anomalies.

OpenRefine (previously Google Refine). Table: OpenRefine.org

OpenRefine (previously Google Refine). Table: OpenRefine.org

OpenRefine data tool capabilities include data exploration, cleaning and transforming, and reconciling and matching data.

WinPure Clean & Match. Table: WinPure.com

WinPure Clean & Match. Table: WinPure.com

WinPure data tool capabilities include statistical data profiling, data quality issues discovery, cleaning processes (clean, complete, correct, standardize, and transform), and data matching reports and visualizations.

Data Preparation and Cleaning in Tableau. Animation: Tableau

Data Preparation and Cleaning in Tableau. Animation: Tableau

This animated process illustrates data-cleaning steps:

Step 1: Remove duplicate or irrelevant observations

Step 2: Fix structural errors

Step 3: Filter unwanted outliers

Step 4: Handle missing data

Step 5: Validate and QA

Data cleaning tools and software for efficiency. Animation: Tableau

Data cleaning tools and software for efficiency. Animation: Tableau

Data cleaning tools and software for efficiency.

Data preparation with data wrangling tool. Table: Trifacta

Data preparation with data wrangling tool. Table: Trifacta

Data Wrangler is a handy tool for data scientists and analysts who use Visual Studio (VS) Code and VS Code Jupyter Notebooks to work with tabular data. It streamlines the data cleaning and exploration process, allowing you to focus on the insights your data holds.

Data Wrangler Example. Animation: code.visualstudio.com

Example of opening Data Wrangler from the notebook to analyze and clean the data with the built-in operations. Then the automatically generated code is exported back into the notebook. Animation: code.visualstudio.com

Here is a breakdown of what Data Wrangler offers:

Key Features

  • Visual Data Exploration: A user-friendly interface lets you view and analyze your data.
  • Data Insights: Get insightful statistics and visualizations for your data columns.
  • Automatic Pandas Code Generation: As you clean and transform your data, Data Wrangler automatically generates the corresponding Pandas code.
  • Easy Code Export: Export the generated Pandas code back to your Jupyter Notebook for reusability.

Working with Data Wrangler

  • Operation Panel: This panel provides a list of built-in operations you can perform on your data.
  • Search Function: Quickly find specific columns in your dataset using the search bar.

Viewing and Editing Modes

  • Viewing Mode: Optimized for initial exploration, allowing you to sort, filter, and get a quick overview of your data.
  • Editing Mode: Designed for data manipulation. As you apply transformations and cleaning steps, Data Wrangler generates Pandas code in the background.

Overall, Data Wrangler simplifies data preparation in VS Code by providing a visual interface and automating code generation. This saves you time and effort, letting you focus on analyzing your data and extracting valuable insights.


Conceptual Framework for Health Data Harmonization

Data harmonization is “the process of comparing two or more data component definitions and identifying commonalities among them that warrant they are being combined, or harmonized, into a single component.” Harmonization is often a complex and tedious operation but is an important antecedent to data analysis as it increases the sample size and analytic utility of the data. However, typical harmonization efforts are ad hoc which can lead to poor data quality or delays in the data release. To date, we are not aware of any efforts to formalize data harmonization using a pipeline process and techniques to easily visualize and assess the data quality prior to and after harmonization. [6] [Operations: MLOps, Continuous ML, & AutoML]

Conceptual Framework for Health Data Harmonization. Diagram: Lewis E. Berman, ICF International, & Yair G. Rajwan, Visual Science Informatics, LLC

Conceptual Framework for Health Data Harmonization. Diagram: Lewis E. Berman, ICF International, & Yair G. Rajwan, Visual Science Informatics, LLC [50]

Next, read the "Architectural Blueprints—The “4+1” View Model of Machine Learning" article at https://www.dhirubhai.net/pulse/architectural-blueprintsthe-41-view-model-machine-rajwan-ms-dsc.

---------------------------------------------------------

[1] Protocol for a systematic review and qualitative synthesis of information quality frameworks in eHealth

[2] Information Quality Frameworks for Digital Health Technologies: Systematic Review

[3] Data Quality

[4] What Is Goodness-of-Fit?

[5] A Review on Data Preprocessing Techniques Toward Efficient and Reliable Knowledge Discovery From Building Operational Data

[6] A Conceptual Framework for Health Data Harmonization A Conceptual Framework for Health Data Harmonization

要查看或添加评论,请登录

Yair R.的更多文章

  • Analyze your health & fitness smartwatch numbers with Google Bard/AIGen/PaLM LLM

    Analyze your health & fitness smartwatch numbers with Google Bard/AIGen/PaLM LLM

    You got your health & fitness smartwatch and monitored your numbers, so what? If you're a data informatics buff, you…

    4 条评论
  • Bloom’s Taxonomy Adapted for Machine Learning (ML)

    Bloom’s Taxonomy Adapted for Machine Learning (ML)

    Learning goals and objectives are significant to establish. Organizing objectives helps to clarify objectives.

    3 条评论
  • Diagnostic Imaging

    Diagnostic Imaging

    As care gets closer to home or inside, some diagnostic imaging equipment becomes mobile, such as a wireless…

    4 条评论
  • Remote Care Spectrum: Tracking / Monitoring / Managing

    Remote Care Spectrum: Tracking / Monitoring / Managing

    Remote Care Spectrum: Tracking / Monitoring / Managing Remote Care Spectrum: Tracking / Monitoring / Managing. Mobile…

    1 条评论
  • RPM Evolution Roadmap

    RPM Evolution Roadmap

    ???The ongoing evolution & future of Remote Patient Monitoring (RPM) RPM Evolution Roadmap. Template: SlideUpLift.

  • Architectural Blueprints—The “4+1” View Model of Machine Learning

    Architectural Blueprints—The “4+1” View Model of Machine Learning

    Architectural Blueprints—The “4+1” View Model of Machine Learning Based on the inspiration from “Architectural…

    2 条评论
  • Operations: MLOps, Continuous ML (CML), & AutoML

    Operations: MLOps, Continuous ML (CML), & AutoML

    Operations: MLOps, Continuous ML (CML), & AutoML DevOps versus MLOps. Diagram: Saurabh Agarwal DevOps versus MLOps.

    2 条评论
  • Interpretability: “Seeing Machines Learn”

    Interpretability: “Seeing Machines Learn”

    Interpretability: “Seeing Machines Learn” In the article "Scenarios: Which Machine Learning (ML) to choose?" [1], as…

    1 条评论
  • Complexity: Time, Space, & Sample

    Complexity: Time, Space, & Sample

    Complexity: Time, Space, & Sample In the article “Which Machine Learning (ML) to choose?" [1], as part of the…

    2 条评论
  • Accuracy: The Bias-Variance Trade-off

    Accuracy: The Bias-Variance Trade-off

    Accuracy: The Bias-Variance Trade-off In the article “Which Machine Learning (ML) to choose?" [1], as part of the…

    3 条评论

社区洞察

其他会员也浏览了