Learning Analytics Series: Glossary of Terms Beginning with "Data _____"?
Photo by Timur Saglambilek: https://www.pexels.com/photo/analytics-text-185576

Learning Analytics Series: Glossary of Terms Beginning with "Data _____"

Introduction

This article combines terms from all four articles over the course of this series, plus one walk-on (Data Type) from a reader, for a total of 41 terms (as shown in the table below).

Alphabetical List of All Terms
Alphabetical List of All Terms

This conclusion serves as a single glossary for all those terms and provides a handy reference. These are my definitions as a hands-on practitioner, combining academic and professional real-world experience to help others better understand the data and analytics space.

Terms Beginning with "Data _____"

Data Aggregation

Data aggregation is generally done to provide data in summary form to make it easier to understand, query, and use. Aggregating data beforehand makes it easier to work with because it eliminates the guesswork from joining tables properly. This method is particularly helpful because different users will get the same trusted answer from an aggregated view of pre-joined data as opposed to users querying (and joining) multiple tables on their own. Oftentimes, data aggregation is placed in a database view that retrieves data from multiple objects (tables or views).

Data Architecture

Data architecture is the blueprint of the data landscape, showing the inputs (sources), processes (integration), outputs (targets), and hardware/software that comprise the solution. The architecture should cover everything from naming the data sources to how the data is collected and consumed, including file formats, data integration methods, processing techniques (e.g., batch vs. real-time, flush & fill vs. change data capture), and security protocols. Similar to data models, architectures may also come in three (3) forms: Conceptual, Logical, and Physical.

Data Blending

Data blending is the process of combining data from different sources to create a dataset usable for a specific purpose. For example, a data analyst may be searching for answers to very specific questions that can only be answered with a dataset that's constructed for those particular questions. Unlike enterprise data integration and data warehouse efforts where they are striving to produce a?'single source of truth'?for a multitude of possible questions, data blending produces a custom dataset, often temporal, for those particular questions.

Data Cleansing

Data cleansing is the process of improving data by fixing errors, removing duplicates (de-duplication), correcting typos/misspellings, normalizing values, standardizing formats, and handling missing values (data imputation). These techniques, and more, increase data quality to improve accuracy and confidence in the data when creating reports, dashboards, or conducting data analysis.

Data Collection

Retrieving data from various sources, which likely includes databases, to gather all the information needed to perform some level of analysis, reporting, or processing. As the name implies, you are gathering (collecting) data from wherever you find data sources of potential value. For Data Analysts, data collection generally means querying databases and running reports to gather data for analysis. For Data Engineers, data collection usually corresponds with the "E" in ETL (Extraction, Transformation, Load) where extracting (collecting) data is generally automated. For Data Scientists, data collection is typically the data acquisition (collection) process at the beginning of a data pipeline.

Data Definition Language

Data Definition Language (DDL) represents the commands used to create or modify objects in a database. Examples include CREATE (to construct new objects), ALTER (to modify existing objects), and DROP (to delete existing objects). Each of these commands, when successfully executed, physically creates or modifies objects, thereby?defining?the structure of the database. The DDL statements below illustrate each example with syntactically correct commands.

-- Create new table
CREATE TABLE students (student_id    INT
                      ,last_name     VARCHAR(50)
                      ,first_name    VARCHAR(50));

-- Alter table by adding new column for email address
ALTER TABLE students ADD email_address VARCHAR(255);

-- Delete table
DROP TABLE students;        

You can play with these DDL statements, and much more, using a free online virtual database instance from W3Schools.

Data Discovery

Data discovery involves different techniques and tools that may include spreadsheets (e.g., Microsoft Excel, Google Sheets), dashboards (e.g., Tableau, Power BI, Qlik), enterprise business intelligence tools (e.g., SAP Business Objects, MicroStrategy, IBM Cognos, Oracle OBIEE, Microsoft SSRS), and SQL. The goal of data discovery is to identity patterns, anomalies, or trends to help determine where true value exists in the data. The result of data discovery sets the stage for subsequent work such as data engineering, data aggregation, and data science.

Data Discretization

Data discretization is a useful technique for converting continuous values into discrete values that can be more easily used in models and analysis. Sometimes we are less concerned about the exact values than the category of values. In these cases, it's helpful to place ranges of continuous values into different groups (buckets).

For example, bodyweight can be any positive number and we may only be concerned with specific groupings of that bodyweight data such as?Low?(<100 lbs.),?Medium?(100-200 lbs.), and?High?(>200 lbs.) Placing continuous values into finite groupings in support of analysis/models is known as?discretizing?the data, and commonly used in data science work.

NOTE: A random name generator was used to create synthetic data for the example below.

Data Discretization
Data Discretization

Data Engineering

Data engineering is a broad term that generally covers the work required to extract data from sources, prepare data for users, and everything in between. This work includes data integration, data cleansing, writing DDL/DML statements, creating data pipelines, and more. Data engineering is responsible for the plumbing of the data architecture to ensure data is flowing smoothly and ready for usage.

Data Exploration

Data exploration involves surveying the data landscape to identify potential data sources that may yield results. Unlike data discovery where data sources are analyzed to discover patterns/anomalies/trends, exploration focuses more on identifying data sources that show promise. Once identified, then the analyst will transition from exploration to discovery mode. (Exploratory Data Analysis (EDA) is something else that is covered as part of the data mining term.)

Data Fabric

Data fabric is essentially a virtual unified data repository with multiple integrated data sources. Data integration in a data fabric is achieved by creating a network of connected data sources as opposed to creating a single physical centralized repository of data (like an operational data store or enterprise data warehouse). Data fabrics focus on unifying data through federated access layers (regardless of location and format) rather than movement of data (ETL) from different sources/formats to a single target/format. (And since we like using metaphors, data stitching is used to weave data together in the data fabric. There's a quick bonus "data _____" term for you. ??)

Data Format

In the general sense, data is available in a variety of formats that may be structured, unstructured, or semi-structured. More specifically, formats in digital form may include relational databases (e.g., Oracle, SQL Server), non-relational (NoSQL) databases (e.g., MongoDB, Cassandra), documents (e.g., Microsoft Word, Google Docs, PDF), flat files (e.g., TXT, CSV, JSON, XML), and other miscellaneous formats (e.g., e-mails, maps, photos, videos, audio recordings).

Data Governance

Data governance provides the policies and procedures for properly handling data, including the strategies and controls used by people throughout the organization. These policies and procedures include references to federal compliance such as laws, executive orders, and memorandums. Formal governance increases an organization's security posture by reducing the risk of data breaches and unauthorized access to data. Data governance defines the management of data throughout its lifecycle, ensuring the availability of high-quality information. (We'll touch upon data governance a bit more when data management is defined.)

Data Granularity

Data granularity is the level of detail represented in a dataset, object (e.g. table), or star schema in the case of dimensional models for data warehouses. Knowing the granularity is very helpful because it sets the context for the data being processed or analyzed. For example, the granularity for an employee table is a single person uniquely identified by an employee number. Granularity can be low-level detail or high-level summary data. In good database designs (e.g., 3NF), granularity generally maps to a single record; however, granularity may vary within a single record in denormalized designs (e.g., aggregated views).

Data Imputation

Data imputation is a technique used to populate missing data with the best possible information. Imputation is done to improve data because having more values is better than not enough and, in some cases, essential.

Some techniques for numeric fields include using the mean/median value, most frequent value, zero, some constant value, or random selection from known values. Some techniques for categorial (string) fields include using the most common value (mode),?random selection from known values of similar records, or a string literal like "Unknown" or "Missing." And doing nothing is always an option in both cases, leaving the value empty (NULL). All of these techniques have pros/cons so be careful of selecting the method based on each use case because you will bias your dataset, so impute with caution.

The data imputation example below uses the average bodyweight to populate missing values (for Parker McKinley and Whitney Thorburn).

Data Imputation Using Average Bodyweight
Data Imputation Using Average Bodyweight

Data Integration

Data integration is the process of combining data from disparate data sources to created unified datasets that support analysis and reporting. The primary objective of data integration is to create a consolidated view of data that's reliable and accurate, which includes processing different data formats (e.g., flat files, databases) and normalizing data across domains (e.g., unifying personnel data from payroll, human resources, and training systems).

Data Integrity

Data integrity is often conflated with data quality and although they are closely related, they are quite different. Data integrity helps to enforce data quality by ensuring the data makes sense. Some techniques include the implementation of primary keys (to ensure uniqueness), foreign keys (to establish relationships that avoid creating orphan records), and constraints (to enforce validation rules). These data integrity techniques improve data quality by preventing errors from being introduced at the point of entry.

Data Journalism

Data journalism is the art of telling stories based on data, also known as data-driven storytelling. The primary goal of data journalism is to report fact-based information in the form of stories for public interest, with the facts substantiated by data. And the more data used to construct the story, the better. The resulting information can be presented using a combination of forms, combining narrative text with graphics and charts.

One of the oldest and most highly regarded examples of data journalism is Charles Joseph Minard's graphical depiction of Napoleon's losses in the Russian campaign of 1812 (shown below). This single infographic quickly communicates so much valuable information, visually and textually. More information is available on Edward Tufte's website.

Charles Joseph Minard's Graphic Portrayal for the War of 1812
Charles Joseph Minard's Graphic Portrayal for the War of 1812

Data Lake

A data lake is a centralized repository of structured, unstructured, and semi-structured data collected from many different sources. The idea is to provide access to data for quick analysis to help determine where value may exist and serve as a unified data source for other downstream systems. For example, data scientists may access 'dirty data' to perform some quick statistical analysis to discover potentially valuable information without waiting for data to be formally built in a system (like an enterprise data warehouse). In some cases, accessing data in its rawest form is preferred to avoid using mutated data.

Data Leakage

Data leakage refers to the unexpected, or unauthorized, release of data to an outside entity or location, which can be problematic because sensitive data may be available to users without permission. Data leakage can occur as a result of data breaches, user error/oversight, relaxed data security rules, poor data integration protocols, or negligence. Formal data governance policies and robust data management practices mitigate data leakage.

Data Lineage

Data lineage is the documentation trail of a data element's journey, from source to target. As data travels from its source (origin), it may undergo some transformations before landing in its target (destination). Documenting these travels and any changes along the way is known as data lineage. The best form of data lineage is bi-directional, meaning that you can trace a data element from source to target and vice versa.

The image below shows a simplified view of data extracted from a Source Database that may undergo some form of Processing and is placed into a Target Database ready for use.

Simplified View of Data Lineage
Simplified View of Data Lineage

The idea is to be able to find the source data element(s) from the target, the target data element(s) from the source, and understand any changes that occur between those endpoints. This bi-directional traceability provides transparency and instills confidence in the data because users understand what came from where and how it landed at its destination.

Data Literacy

Data literacy involves reading, understanding, and communicating data in a consistent manner that develops a competent workforce and facilitates effective collaboration. Oftentimes, people interpret (or use) the same data in different ways, which leads to confusion and possibly incorrect results. Improving data literacy mitigates confusion and errors by educating and informing users of the data available, its meaning, and intended purpose.

Data Management

Data management enacts the policies and procedures from the data governance program to ensure the implementation matches the plan. Aligning the implementation of data management to data governance protects the organization's data with secure and reliable solutions. Data management and data governance are often confused with each other. The easiest way to remember the difference is that data governance is the functional framework whereas data management is the technical implementation supporting that functional framework.

The Venn diagram below shows some of the main components supporting Data Governance vs. Data Management. A Venn diagram is used to describe these components because they are so inter-related, and occasionally overlap.

Data Governance vs. Data Management
Data Governance vs. Data Management

Data Manipulation Language

Data Manipulation Language (DML) represents the commands used to control the contents within the objects of a database (e.g., tables). Examples include INSERT (to create records), UPDATE (to modify records), or DELETE (to remove records). Each of these commands, when successfully executed, physically change the contents in objects, thereby?manipulating?the contents of the database. The DML statements below illustrate each example with syntactically correct commands.

-- Insert record for new student named John Smith
INSERT INTO students (student_id, last_name, first_name, email_address)
              VALUES (341, 'Smith', 'John', '[email protected]');

-- Update John Smith's email address
UPDATE students
   SET email_address = '[email protected]'
 WHERE student_id = 341;

-- Delete John Smith's record
DELETE FROM students WHERE student_id = 341;        

Feel free to play with these DML statements using the W3Schools link shared above (see 'Data Definition Language'). Be sure to create and alter the table again since it was dropped at end of the sample DDL script above.???

Data Mesh

Data mesh is a decentralized and distributed architecture that treats data as a product, allowing domain-oriented teams to build their own analytical solutions. This approach doesn't suggest the organization use rogue analytics; it means the domain-oriented teams would leverage a self-service enterprise data platform to take ownership of their own data, and use the interoperability available within the architecture to integrate data across domains. Using a micro-services approach is highly recommended when building a data mesh architecture.

Data Mining

Data mining is the process of exploring and analyzing datasets to find potentially useful information, identifying relationships and patterns in the data. Data mining is particularly useful for identifying areas of interest for further analysis, separating signal from noise. Exploratory Data Analysis (EDA) is the process typically used to conduct initial investigations of data, especially with new datasets (or sources).

Data Model

Data models are used to represent data as entities (tables) and attributes (columns) as well as the relationships between those entities. These models facilitate the collection, storage, integration, analysis, and reporting of data as information. They allow data to be represented in a manner that's understandable and usable. Data models generally come in three (3) forms: Conceptual, Logical, and Physical.

Data Pipeline

A data pipeline is the process of moving data from one place (source) to another place (target). Data pipelines represent a series of automated steps that accomplish this movement and transformation of data from source to target and may include batch processing, change data capture, or streaming datasets. Common examples of a data pipeline are ETL (or ELT) routines that Extract (data from sources), Transform (extracted data), and Load (transformed data into targets).

Data Processing (Pre & Post)

Data processing includes all the tasks and procedures that handle data within a system or for other systems. Data processing generally refers to the automation used to manage that data, from start to finish. As a matter of fact, years (decades) ago, "IT" was referred to as "Automated Data Processing" (and is the name of a company (ADP) that's been around since the late 1940's).

Now that the basics are out of the way, let's get to the "Pre-" and "Post-" portions of this definition. As the prefixes imply, pre-processing deals with the actions?before?using the data. Some examples of pre-processing include data cleansing and transformation. Post-processing deals with the actions?after?using the data. Some examples of post-processing include adding derived data and saving, copying, or archiving data to another location. Pre- and post-processing play an important role in data pipelines, especially ones that feed AI/ML models.

Data Profiling

Data profiling is the systematic examination of data to gather information about its size, data types, relationships, and summary statistics (e.g., min, max, avg, length, NULLs, unique values). Data profiling is a useful first step upon receiving a new dataset to quickly understand the contents and where potentially valuable information may exist. This step is critically important to understanding the data, especially when no other useful information is available such as data models or data dictionaries.

PRO TIP: Make sure your project has current data models and data dictionaries that are treated as living and breathing artifacts. No database structure changes should be implemented until they are modeled and defined. ??

Data Quality

Data quality is a measure of how good or bad is the data, which can be measured in many different ways. Naturally, data quality is extremely important for producing results that allow stakeholders to make well informed decisions with confidence. We'll spend a little more time/space on this particular term since data quality is so important and part of every data discussion (and solution).

There are multiple dimensions of data quality but some of the more common dimensions are as follows:

  • Completeness?- All necessary data is available to perform a thorough and complete analysis. Completeness also means having more data than is minimally required. The more complete the data, the more comprehensive the analysis.
  • Conformity?- Data conforms to an established set of business rules, format, type, and range making the data valid. This dimension may also be referred to as "validity."
  • Accuracy?- Data is factual, current, and reliable as a trusted source. Data can be valid, but inaccurate. For example, the wrong number in a numeric field may be valid (conforms to the standard) but inaccurate. Accuracy means having the correct data in a field.
  • Consistency?- Like data is the same across data sources. For example, employee ids are the same across different systems for each person. Otherwise, transformations must be used to link data across systems.
  • Timeliness?- Data with minimal latency provides the best opportunity for making confident decisions. Stale data can have adverse effects on outcomes if decisions are made with old information.
  • Integrity?- Data relationships are supported through systems and changes, which usually involves keys (primary and foreign keys). Changes to a key must retain relationships with all linked records to avoid creating orphans.

Dimensions of Data Quality
Dimensions of Data Quality

Data Sampling

Data sampling is a statistical technique used to select a subset of data from a much larger dataset that can be used to perform analysis. Some datasets are too large for quick experimentation or testing, so a sampling of data is selected. The idea is to work with a much smaller, but accurate, representation of the entire dataset. There are multiple methods that can be used such as simple random sampling, stratified sampling, cluster sampling, and systematic sampling.

Data Sanitization

Data sanitization is the process of protecting data that may include destruction, removal, encryption, masking, substitution, shuffling, or scrambling. These techniques aim to mitigate data theft or, in most cases, prepare data for use in non-Production environments. More often than not, we cannot use live Production data (as-is) in lower environments (like Development or Testing), so that data must be sanitized before demoting. We need the data volume and integrity from Production, so sanitization allows us to safely use Production-quality data in non-Production environments.

Masking Phone Numbers and Bank Accounts
Masking Phone Numbers and Bank Accounts

Data Science

Data science combines mathematics/statistics, programming, and storytelling into a multi-disciplinary approach that exploits the untapped potential of data. This approach includes understanding how to process large volumes of data (big data), analyzing data with sophisticated techniques (e.g., predictive analytics, AI/ML), and explaining technical results to a non-technical audience. Data science helps to identify trends, anomalies, and correlations in data that may escape traditional, and less sophisticated, data analysis techniques.

Data Source

A physical or virtual location that contains data of interest. Data sources may include databases, systems, files (electronic or paper), conversations, etc. Anything that is spoken or written and captured in digital form is a potential data source that may be used to create a dataset for analysis and reporting, or even serve as input to other systems.

Data Type

Data type can be defined different ways but the most common definition is the type of data that may be stored in a column or variable. Data types generally fall into general categories such as String, Number, Date, and Boolean. Each of those general categories have more specific data types that may vary from one language/database to another. Some examples include VARCHAR or CHAR for Strings; Integers, Floats, or Decimals for Numbers; Dates, Times, or DateTimes for Dates; etc.

Data types enforce integrity by only allowing data in a column or variable that matches that type. For example, a string cannot be saved into a numeric data type; however, a number may be stored in a string. In the latter case, that number will be converted to a string when stored (known as implicit type conversion or type casting). Declaring data types is a useful first line of defense to assist with data integrity, helping with data quality.

Credit: Thank you Anup Kale for asking about data types.

Data Visualization

Data visualization is useful for communicating data as information so users can make sense of the data and take appropriate action. The most effective forms of visualization are interactive (e.g., dashboards), allowing users to filter, drill-down, and export data. (Yes, I said export data from a data viz … sorry, but it's true. ??) Other forms of visualization include scorecards and infographics where information is displayed in a manner that's easily understood.

Sample Dashboard with a Network Graph
Sample Dashboard with a Network Graph

Data Warehouse

A data warehouse is a central repository of cleansed data specifically designed and built for business intelligence (BI) reporting and analytics. A data warehouse generally integrates data from multiple data sources, potentially with different formats, using Extraction-Transformation-Load (ETL) routines. The ETL output produces data that is ready for reports, dashboards, and analysis. The most common data modeling technique for a data warehouse is the?star schema, containing a fact table surrounded by dimension tables (as shown below).

Data Warehouse Star Schema
Data Warehouse Star Schema

Data Wrangling

Data wrangling (also known as data munging) is the process of converting raw data into more usable formats, which usually includes data collection, data cleansing, transformations, and data integration. Data wrangling is essentially the leg work involved in data blending to provide one or more usable datasets for other purposes. Given how difficult working with data can be, "wrangling" was a term selected to describe that difficulty.

Database

Digital repository of data that is generally organized as tables with rows and columns. Other objects can exist in databases such as views, procedures, functions, and triggers to help manage data. Structured Query Language (SQL) is used to retrieve and manipulate data in the database. A simple SQL SELECT statement is shown below that retrieves all employee numbers, names, and salaries from a database table named?employees?(with fictitious data) and returns a result set in ascending order by last name, then first name.

  SELECT employee_number, first_name, last_name, salary
    FROM employees?
ORDER BY last_name, first_name;        
SQL SELECT Statement and Result Set
SQL SELECT Statement and Result Set

Dataset

Collection of data typically in the form of records (rows) and attributes (columns) that may be extracted from one or more sources. A dataset may also contain derived information to further enrich the data in support of further analysis and reporting. The idea is to create a complete set of data that can be used for a specific purpose whereby all the information needed is available and ready for use (with confidence).

Summary

I hope you enjoyed reading this series as much as I enjoyed writing it. More articles spanning a range of topics for data and analytics will follow. Stay tuned for more content that dives deeper into the wonderful of data!

Yasir Haleem

Data Modeler | Data Architect | Aspiring Data Scientist

1 年

Thanks for sharing this article. This can be a nice starting point for data enthusiasts.

Anup Kale

Data Solution Architect at Suncorp Group

2 年

Thanks Mark for providing very helpful summary on the data terminologies.

John Bonfardeci II, M.Sc.

Data Scientist, Data Engineer, Principal Software Engineer IV, Solutions Architect, Board of Directors Member

2 年

Mark, have you though about creating a knowledge graph from these terms?

要查看或添加评论,请登录

Mark DeRosa的更多文章

社区洞察

其他会员也浏览了