Learning Analytics Series: Terms Beginning with "Data _____"? (Part IV)
Photo by Timur Saglambilek: https://www.pexels.com/photo/analytics-text-185576

Learning Analytics Series: Terms Beginning with "Data _____" (Part IV)

Introduction

Welcome back for the fourth and final (Expert) installment with the last 10 data terms, totaling 40 terms across the entire series. If you missed the first three articles, I recommend reading?Part I (Novice), Part II (Intermediate), and Part III (Advanced) first since some of these terms build upon the definitions in those previous installments.

Fourth (and Final) 10 Terms Beginning with Data _____ (in alphabetical order)

Term 31: Data Blending

Data blending is the process of combining data from different sources to create a dataset usable for a specific purpose. For example, a data analyst may be searching for answers to very specific questions that can only be answered with a dataset that's constructed for those particular questions. Unlike enterprise data integration and data warehouse efforts where they are striving to produce a 'single source of truth' for a multitude of possible questions, data blending produces a custom dataset, often temporal, for those particular questions.

Term 32: Data Discretization

Data discretization is a useful technique for converting continuous values into discrete values that can be more easily used in models and analysis. Sometimes we are less concerned about the exact values than the category of values. In these cases, it's helpful to place ranges of continuous values into different groups (buckets). For example, bodyweight can be any positive number and we may only be concerned with specific groupings of that bodyweight data such as Low (<100 lbs.), Medium (100-200 lbs.), and High (>200 lbs.) Placing continuous values into finite groupings in support of analysis/models is known as discretizing the data, and commonly used in data science work.

NOTE: A random name generator was used to create synthetic data for the example below.

No alt text provided for this image
Data Discretization

Term 33: Data Fabric

Data fabric is essentially a virtual unified data repository with multiple integrated data sources. Data integration in a data fabric is achieved by creating a network of connected data sources as opposed to creating a single physical centralized repository of data (like an operational data store or enterprise data warehouse). Data fabrics focus on unifying data through federated access layers (regardless of location and format) rather than movement of data (ETL) from different sources/formats to a single target/format. (And since we like using metaphors, data stitching is used to weave data together in the data fabric. There's a quick bonus "data _____" term for you. ??)

Term 34: Data Imputation

Data imputation is a technique used to populate missing data with the best possible information. Imputation is done to improve data because having more values is better than not enough and, in some cases, essential.

Some techniques for numeric fields include using the mean/median value, most frequent value, zero, some constant value, or random selection from known values. Some techniques for categorial (string) fields include using the most common value (mode),?random selection from known values of similar records, or a string literal like "Unknown" or "Missing." And doing nothing is always an option in both cases, leaving the value empty (NULL). All of these techniques have pros/cons so be careful of selecting the method based on each use case because you will bias your dataset, so impute with caution.

The data imputation example below uses the average bodyweight to populate missing values (for Parker McKinley and Whitney Thorburn).

Data Imputation Using Average Bodyweight
Data Imputation Using Average Bodyweight

Term 35: Data Mesh

Data mesh is a decentralized and distributed architecture that treats data as a product, allowing domain-oriented teams to build their own analytical solutions. This approach doesn't suggest the organization use rogue analytics; it means the domain-oriented teams would leverage a self-service enterprise data platform to take ownership of their own data, and use the interoperability available within the architecture to integrate data across domains. Using a micro-services approach is highly recommended when building a data mesh architecture.

Term 36: Data Pipeline

A data pipeline is the process of moving data from one place (source) to another place (target). Data pipelines represent a series of automated steps that accomplish this movement and transformation of data from source to target and may include batch processing, change data capture, or streaming datasets. Common examples of a data pipeline are ETL (or ELT) routines that Extract (data from sources), Transform (extracted data), and Load (transformed data into targets).

Term 37: Data Processing (Pre & Post)

Data processing includes all the tasks and procedures that handle data within a system or for other systems. Data processing generally refers to the automation used to manage that data, from start to finish. As a matter of fact, years (decades) ago, "IT" was referred to as "Automated Data Processing" (and is the name of a company (ADP) that's been around since the late 1940's).

Now that the basics are out of the way, let's get to the "Pre-" and "Post-" portions of this definition. As the prefixes imply, pre-processing deals with the actions before using the data. Some examples of pre-processing include data cleansing and transformation. Post-processing deals with the actions after using the data. Some examples of post-processing include adding derived data and saving, copying, or archiving data to another location. Pre- and post-processing is an important part of data pipelines, especially ones that feed AI/ML models.

Term 38: Data Sanitization

Data sanitization is the process of protecting data that may include destruction, removal, encryption, masking, substitution, shuffling, or scrambling. These techniques aim to mitigate data theft or, in most cases, prepare data for use in non-Production environments. More often than not, we cannot use live Production data (as-is) in lower environments (like Development or Testing), so that data must be sanitized before demoting. We need the data volume and integrity from Production, so sanitization allows us to safely use Production-quality data in non-Production environments.

Masking Phone Numbers and Bank Accounts
Masking Phone Numbers and Bank Accounts

Term 39: Data Science

Data science combines mathematics/statistics, programming, and storytelling into a multi-disciplinary approach that exploits the untapped potential of data. This approach includes understanding how to process large volumes of data (big data), analyzing data with sophisticated techniques (e.g., predictive analytics, AI/ML), and explaining technical results to a non-technical audience. Data science helps to identify trends, anomalies, and correlations in data that may escape traditional, and less sophisticated, data analysis techniques.

Term 40: Data Wrangling

Data wrangling (also known as data munging) is the process of converting raw data into more usable formats, which usually includes data collection, data cleansing, transformations, and data integration. Data wrangling is essentially the leg work involved in data blending to provide one or more usable datasets for other purposes. Given how difficult working with data can be, "wrangling" was a term selected to describe that difficulty.

Summary

That concludes this series covering a wide range of terms commonly used in the data space. The goal of this series was to provide a practitioner's explanation of buzzwords and catch-phrases. I enjoyed distilling my hands-on experience into brief definitions to help educate others, and possibly encourage some to pursue deeper work in this exciting area.

Closing Remarks

Thanks for following along and stay tuned for more articles relating to data engineering, data science, data literacy, hyperautomation (AI/ML, RPA), and more. I plan on covering an interesting range of topics that appeal to this wonderful community of professionals, so feel free to connect and/or follow.

Anup Kale

Data Solution Architect at Suncorp Group

1 年

Thanks for a great series Mark DeRosa, how about Data Virtualization ? it's implementation can be Data Mesh/Fabric.

回复

I appreciate these aticles Mark, as many of these terms have become buzzwords that people throw around without an understanding of what they are actually saying. Yes, I'm talking about you Data Mesh and Data Fabric..

要查看或添加评论,请登录

Mark DeRosa的更多文章

社区洞察

其他会员也浏览了