登录查看更多内容

Learning Analytics Series: Terms Beginning with "Data _____" (Part IV)

Mark DeRosa

2025 FORUM IT100 Award Winner | Data Analytics Evangelist | Innovative Thought Leader | Master Problem Solver | Agile Expert

发布日期: 2023年2月21日

Introduction

Welcome back for the fourth and final (Expert) installment with the last 10 data terms, totaling 40 terms across the entire series. If you missed the first three articles, I recommend reading?Part I (Novice), Part II (Intermediate), and Part III (Advanced) first since some of these terms build upon the definitions in those previous installments.

Fourth (and Final) 10 Terms Beginning with Data _____ (in alphabetical order)

Term 31: Data Blending

Data blending is the process of combining data from different sources to create a dataset usable for a specific purpose. For example, a data analyst may be searching for answers to very specific questions that can only be answered with a dataset that's constructed for those particular questions. Unlike enterprise data integration and data warehouse efforts where they are striving to produce a 'single source of truth' for a multitude of possible questions, data blending produces a custom dataset, often temporal, for those particular questions.

Term 32: Data Discretization

Data discretization is a useful technique for converting continuous values into discrete values that can be more easily used in models and analysis. Sometimes we are less concerned about the exact values than the category of values. In these cases, it's helpful to place ranges of continuous values into different groups (buckets). For example, bodyweight can be any positive number and we may only be concerned with specific groupings of that bodyweight data such as Low (<100 lbs.), Medium (100-200 lbs.), and High (>200 lbs.) Placing continuous values into finite groupings in support of analysis/models is known as discretizing the data, and commonly used in data science work.

NOTE: A random name generator was used to create synthetic data for the example below.

No alt text provided for this image — Data Discretization

Term 33: Data Fabric

Data fabric is essentially a virtual unified data repository with multiple integrated data sources. Data integration in a data fabric is achieved by creating a network of connected data sources as opposed to creating a single physical centralized repository of data (like an operational data store or enterprise data warehouse). Data fabrics focus on unifying data through federated access layers (regardless of location and format) rather than movement of data (ETL) from different sources/formats to a single target/format. (And since we like using metaphors, data stitching is used to weave data together in the data fabric. There's a quick bonus "data _____" term for you. ??)

Term 34: Data Imputation

Data imputation is a technique used to populate missing data with the best possible information. Imputation is done to improve data because having more values is better than not enough and, in some cases, essential.

Some techniques for numeric fields include using the mean/median value, most frequent value, zero, some constant value, or random selection from known values. Some techniques for categorial (string) fields include using the most common value (mode),?random selection from known values of similar records, or a string literal like "Unknown" or "Missing." And doing nothing is always an option in both cases, leaving the value empty (NULL). All of these techniques have pros/cons so be careful of selecting the method based on each use case because you will bias your dataset, so impute with caution.

The data imputation example below uses the average bodyweight to populate missing values (for Parker McKinley and Whitney Thorburn).

Data Imputation Using Average Bodyweight

领英推荐

The Art and Science of Data Visualization: Turning Raw…

DataThick 8 个月前

4 Data science best practices for your business

Skillmine Technology Consulting 9 个月前

Sourcetable: Revolutionizing Data Management and…

Mirza Ahmad Baig 3 周前

Term 35: Data Mesh

Data mesh is a decentralized and distributed architecture that treats data as a product, allowing domain-oriented teams to build their own analytical solutions. This approach doesn't suggest the organization use rogue analytics; it means the domain-oriented teams would leverage a self-service enterprise data platform to take ownership of their own data, and use the interoperability available within the architecture to integrate data across domains. Using a micro-services approach is highly recommended when building a data mesh architecture.

Term 36: Data Pipeline

A data pipeline is the process of moving data from one place (source) to another place (target). Data pipelines represent a series of automated steps that accomplish this movement and transformation of data from source to target and may include batch processing, change data capture, or streaming datasets. Common examples of a data pipeline are ETL (or ELT) routines that Extract (data from sources), Transform (extracted data), and Load (transformed data into targets).

Term 37: Data Processing (Pre & Post)

Data processing includes all the tasks and procedures that handle data within a system or for other systems. Data processing generally refers to the automation used to manage that data, from start to finish. As a matter of fact, years (decades) ago, "IT" was referred to as "Automated Data Processing" (and is the name of a company (ADP) that's been around since the late 1940's).

Now that the basics are out of the way, let's get to the "Pre-" and "Post-" portions of this definition. As the prefixes imply, pre-processing deals with the actions before using the data. Some examples of pre-processing include data cleansing and transformation. Post-processing deals with the actions after using the data. Some examples of post-processing include adding derived data and saving, copying, or archiving data to another location. Pre- and post-processing is an important part of data pipelines, especially ones that feed AI/ML models.

Term 38: Data Sanitization

Data sanitization is the process of protecting data that may include destruction, removal, encryption, masking, substitution, shuffling, or scrambling. These techniques aim to mitigate data theft or, in most cases, prepare data for use in non-Production environments. More often than not, we cannot use live Production data (as-is) in lower environments (like Development or Testing), so that data must be sanitized before demoting. We need the data volume and integrity from Production, so sanitization allows us to safely use Production-quality data in non-Production environments.

Term 39: Data Science

Data science combines mathematics/statistics, programming, and storytelling into a multi-disciplinary approach that exploits the untapped potential of data. This approach includes understanding how to process large volumes of data (big data), analyzing data with sophisticated techniques (e.g., predictive analytics, AI/ML), and explaining technical results to a non-technical audience. Data science helps to identify trends, anomalies, and correlations in data that may escape traditional, and less sophisticated, data analysis techniques.

Term 40: Data Wrangling

Data wrangling (also known as data munging) is the process of converting raw data into more usable formats, which usually includes data collection, data cleansing, transformations, and data integration. Data wrangling is essentially the leg work involved in data blending to provide one or more usable datasets for other purposes. Given how difficult working with data can be, "wrangling" was a term selected to describe that difficulty.

Summary

That concludes this series covering a wide range of terms commonly used in the data space. The goal of this series was to provide a practitioner's explanation of buzzwords and catch-phrases. I enjoyed distilling my hands-on experience into brief definitions to help educate others, and possibly encourage some to pursue deeper work in this exciting area.

Closing Remarks

Thanks for following along and stay tuned for more articles relating to data engineering, data science, data literacy, hyperautomation (AI/ML, RPA), and more. I plan on covering an interesting range of topics that appeal to this wonderful community of professionals, so feel free to connect and/or follow.

Anup Kale

Data Solution Architect at Suncorp Group

1 年

Thanks for a great series Mark DeRosa, how about Data Virtualization ? it's implementation can be Data Mesh/Fabric.

Bob Weber

2 年

I appreciate these aticles Mark, as many of these terms have become buzzwords that people throw around without an understanding of what they are actually saying. Yes, I'm talking about you Data Mesh and Data Fabric..

1 次回应

查看更多评论

要查看或添加评论，请登录

Mark DeRosa的更多文章

Want to innovate? Consider this...

2024年11月15日

Want to innovate? Consider this...

Introduction Most, if not all, organizations talk about innovation or being innovative. Innovation is one of those…

2 条评论
Learning AI Series: Part II - Common AI Techniques

2024年10月1日

Learning AI Series: Part II - Common AI Techniques

Introduction In Part I: Demystifying Artificial Intelligence (AI), we described AI at its most basic level to establish…
Learning AI Series - Part I: Demystifying Artificial Intelligence (AI)

2024年3月26日

Learning AI Series - Part I: Demystifying Artificial Intelligence (AI)

Introduction Artificial intelligence (AI) is on the tip of everyone’s tongue these days. It seems like you can’t go…

1 条评论
Agile Analytics

2023年9月26日

Agile Analytics

Introduction We are quick to consider Agile when we think of front-end development (e.g.

3 条评论
How to Become a Rockstar Database Developer

2023年8月13日

How to Become a Rockstar Database Developer

Introduction In order to become a great database developer, one must first understand the characteristics of building…

2 条评论
Success as a Chief Data Officer (CDO)

2023年8月8日

Success as a Chief Data Officer (CDO)

Introduction This article covers the basic tenets required for an Office of the Chief Data Officer (OCDO) to achieve…

10 条评论
Learning Analytics Series: Glossary of Terms Beginning with "Data _____"

2023年3月7日

Learning Analytics Series: Glossary of Terms Beginning with "Data _____"

Introduction This article combines terms from all four articles over the course of this series, plus one walk-on (Data…

6 条评论
Learning Analytics Series: Terms Beginning with "Data _____" (Part III)

2023年2月7日

Learning Analytics Series: Terms Beginning with "Data _____" (Part III)

Introduction Welcome to the third (Advanced) installment of this series where another 10 data terms will be covered…
Learning Analytics Series: Terms Beginning with "Data _____" (Part II)

2023年1月24日

Learning Analytics Series: Terms Beginning with "Data _____" (Part II)

Introduction Welcome to the second (Intermediate) installment of this series where another 10 data terms will be…

3 条评论
Learning Analytics Series: Terms Beginning with "Data _____" (Part I)

2023年1月10日

Learning Analytics Series: Terms Beginning with "Data _____" (Part I)

Introduction There are plenty of keywords, phrases, and cute metaphors in the world of data analytics. And some of them…

12 条评论

See all articles

Learning Analytics Series: Terms Beginning with "Data _____" (Part IV)

Mark DeRosa

2025 FORUM IT100 Award Winner | Data Analytics Evangelist | Innovative Thought Leader | Master Problem Solver | Agile Expert

Introduction

Fourth (and Final) 10 Terms Beginning with Data _____ (in alphabetical order)

Term 31: Data Blending

Term 32: Data Discretization

Term 33: Data Fabric

Term 34: Data Imputation

领英推荐

Term 35: Data Mesh

Term 36: Data Pipeline

Term 37: Data Processing (Pre & Post)

Term 38: Data Sanitization

Term 39: Data Science

Term 40: Data Wrangling

Summary

Closing Remarks

Mark DeRosa的更多文章

社区洞察

其他会员也浏览了

AI FOR DATA MANAGEMENT

Today's Prompt: Data Analysis and Visualization

Business-driven data culture - What is the key to success?

Reducing Project Costs with Azure Machine Learning: A Guide for Data Analytics Teams

How Data Science can help a business

Mastering Data Manipulation: The Essential Toolkit for Boosting Your Data Analytics Game

Understanding Your Data: Beginner’s Guide to Mastering in Data Analytics/Data Science

Expert Data Science Services for Your Business

Data Munging

05.08.2023 Executive Data Bytes – Applying Machine Learning For Proper Data Management

Introduction

Fourth (and Final) 10 Terms Beginning with Data _____ (in alphabetical order)

Term 31: Data Blending

Term 32: Data Discretization

Term 33: Data Fabric

Term 34: Data Imputation

领英推荐

Term 35: Data Mesh

Term 36: Data Pipeline

Term 37: Data Processing (Pre & Post)

Term 38: Data Sanitization

Term 39: Data Science

Term 40: Data Wrangling

Summary

Closing Remarks

Mark DeRosa的更多文章

Want to innovate? Consider this...

Learning AI Series: Part II - Common AI Techniques

Learning AI Series - Part I: Demystifying Artificial Intelligence (AI)

Agile Analytics

How to Become a Rockstar Database Developer

Success as a Chief Data Officer (CDO)

Learning Analytics Series: Glossary of Terms Beginning with "Data _____"

Learning Analytics Series: Terms Beginning with "Data _____" (Part III)

Learning Analytics Series: Terms Beginning with "Data _____" (Part II)