Learning Analytics Series: Terms Beginning with "Data _____" (Part III)
Mark DeRosa
2025 FORUM IT100 Award Winner | Data Analytics Evangelist | Innovative Thought Leader | Master Problem Solver | Agile Expert
Introduction
Welcome to the third (Advanced) installment of this series where another 10 data terms will be covered, bringing you up to 30 terms at this point. If you missed the first two articles, I recommend reading Part I (Novice) and Part II (Intermediate) first since some of the terms build upon previous definitions.
Third 10 Terms Beginning with Data _____ (in alphabetical order)
Term 21: Data Aggregation
Data aggregation is generally done to provide data in summary form to make it easier to understand, query, and use. Aggregating data beforehand makes it easier to work with because it eliminates the guesswork from joining tables properly. This method is particularly helpful because different users will get the same trusted answer from an aggregated view of pre-joined data as opposed to users querying (and joining) multiple tables on their own. Oftentimes, data aggregation is placed in a database view that retrieves data from multiple objects (tables or views).
Term 22: Data Governance
Data governance provides the policies and procedures for properly handling data, including the strategies and controls used by people throughout the organization. These policies and procedures include references to federal compliance such as laws, executive orders, and memorandums. Formal governance increases an organization's security posture by reducing the risk of data breaches and unauthorized access to data. Data governance defines the management of data throughout its lifecycle, ensuring the availability of high-quality information. (We'll touch upon data governance a bit more below with Term 28 when data management is defined.)
Term 23: Data Granularity
Data granularity is the level of detail represented in a dataset, object (e.g. table), or star schema in the case of dimensional models for data warehouses. Knowing the granularity is very helpful because it sets the context for the data being processed or analyzed. For example, the granularity for an employee table is a single person uniquely identified by an employee number. Granularity can be low-level detail or high-level summary data. In good database designs (e.g., 3NF), granularity generally maps to a single record; however, granularity may vary within a single record in denormalized designs (e.g., aggregated views).
Term 24: Data Journalism
Data journalism is the art of telling stories based on data, also known as data-driven storytelling. The primary goal of data journalism is to report fact-based information in the form of stories for public interest, with the facts substantiated by data. And the more data used to construct the story, the better. The resulting information can be presented using a combination of forms, combining narrative text with graphics and charts.
One of the oldest and most highly regarded examples of data journalism is Charles Joseph Minard's graphical depiction of Napoleon's losses in the Russian campaign of 1812 (shown below). This single infographic quickly communicates so much valuable information, visually and textually. More information is available on Edward Tufte's website.
Term 25: Data Lake
A data lake is a centralized repository of structured, unstructured, and semi-structured data collected from many different sources. The idea is to provide access to data for quick analysis to help determine where value may exist and serve as a unified data source for other downstream systems. For example, data scientists may access 'dirty data' to perform some quick statistical analysis to discover potentially valuable information without waiting for data to be formally built in a system (like an enterprise data warehouse). In some cases, accessing data in its rawest form is preferred to avoid using mutated data.
Term 26: Data Lineage
Data lineage is the documentation trail of a data element's journey, from source to target. As data travels from its source (origin), it may undergo some transformations before landing in its target (destination). Documenting these travels and any changes along the way is known as data lineage. The best form of data lineage is bi-directional, meaning that you can trace a data element from source to target and vice versa.
The image below shows a simplified view of data extracted from a Source Database that may undergo some form of Processing and is placed into a Target Database ready for use.
The idea is to be able to find the source data element(s) from the target, the target data element(s) from the source, and understand any changes that occur between those endpoints. This bi-directional traceability provides transparency and instills confidence in the data because users understand what came from where and how it landed at its destination.
Term 27: Data Literacy
Data literacy involves reading, understanding, and communicating data in a consistent manner that develops a competent workforce and facilitates effective collaboration. Oftentimes, people interpret (or use) the same data in different ways, which leads to confusion and possibly incorrect results. Improving data literacy mitigates confusion and errors by educating and informing users of the data available, its meaning, and intended purpose.
Term 28: Data Management
Data management enacts the policies and procedures from the data governance program to ensure the implementation matches the plan. Aligning the implementation of data management to data governance protects the organization's data with secure and reliable solutions. Data management and data governance are often confused with each other. The easiest way to remember the difference is that data governance is the functional framework whereas data management is the technical implementation supporting that functional framework.
The Venn diagram below shows some of the main components supporting Data Governance vs. Data Management. A Venn diagram is used to describe these components because they are so inter-related, and occasionally overlap.
Term 29: Data Profiling
Data profiling is the systematic examination of data to gather information about its size, data types, relationships, and summary statistics (e.g., min, max, avg, length, NULLs, unique values). Data profiling is a useful first step upon receiving a new dataset to quickly understand the contents and where potentially valuable information may exist. This step is critically important to understanding the data, especially when no other useful information is available such as data models or data dictionaries.
PRO TIP: Make sure your project has current data models and data dictionaries that are treated as living and breathing artifacts. No database structure changes should be implemented until they are modeled and defined. ??
Term 30: Data Sampling
Data sampling is a statistical technique used to select a subset of data from a much larger dataset that can be used to perform analysis. Some datasets are too large for quick experimentation or testing, so a sampling of data is selected. The idea is to work with a much smaller, but accurate, representation of the entire dataset. There are multiple methods that can be used such as simple random sampling, stratified sampling, cluster sampling, and systematic sampling.
Summary
That concludes the third article in this series with some advanced terms preparing you for the final article. The last article in this series (Part IV - Expert) will be published in a few weeks and cover ten (10) more terms as follows: