Learning Analytics Series: Terms Beginning with "Data _____"? (Part II)
Photo by Timur Saglambilek: https://www.pexels.com/photo/analytics-text-185576

Learning Analytics Series: Terms Beginning with "Data _____" (Part II)

Introduction

Welcome to the second (Intermediate) installment of this series where another 10 data terms will be covered. If you missed the first (Novice) installment, you can read Part I here. I recommend reading these articles in order since some of the terms build upon previous definitions.

Second 10 Terms Beginning with Data _____ (in alphabetical order)

Term 11: Data Architecture

Data architecture is the blueprint of the data landscape, showing the inputs (sources), processes (integration), outputs (targets), and hardware/software that comprise the solution. The architecture should cover everything from naming the data sources to how the data is collected and consumed, including file formats, data integration methods, processing techniques (e.g., batch vs. real-time, flush & fill vs. change data capture), and security protocols. Similar to data models (Part I), architectures may also come in three (3) forms: Conceptual, Logical, and Physical.

Term 12: Data Cleansing

Data cleansing is the process of improving data by fixing errors, removing duplicates (de-duplication), correcting typos/misspellings, normalizing values, standardizing formats, and handling missing values (data imputation, Part IV). These techniques, and more, increase data quality to improve accuracy and confidence in the data when creating reports, dashboards, or conducting data analysis.

Term 13: Data Definition Language

Data Definition Language (DDL) represents the commands used to create or modify objects in a database. Examples include CREATE (to construct new objects), ALTER (to modify existing objects), and DROP (to delete existing objects). Each of these commands, when successfully executed, physically creates or modifies objects, thereby defining the structure of the database. The DDL statements below illustrate each example with syntactically correct commands.

-- Create new table
CREATE TABLE students (student_id    INT
                      ,last_name     VARCHAR(50)
                      ,first_name    VARCHAR(50));

-- Alter table by adding new column for email address
ALTER TABLE students ADD email_address VARCHAR(255);

-- Delete table
DROP TABLE students;        

You can play with these DDL statements, and much more, using a free online virtual database instance from W3Schools.

Term 14: Data Engineering

Data engineering is a broad term that generally covers the work required to extract data from sources, prepare data for users, and everything in between. This work includes data integration, data cleansing, writing DDL/DML statements, creating data pipelines (Part IV), and more. Data engineering is responsible for the plumbing of the data architecture to ensure data is flowing smoothly and ready for usage.

Term 15: Data Integration

Data integration is the process of combining data from disparate data sources to created unified datasets (Part I) that support analysis and reporting. The primary objective of data integration is to create a consolidated view of data that's reliable and accurate, which includes processing different data formats (e.g., flat files, databases) and normalizing data across domains (e.g., unifying personnel data from payroll, human resources, and training systems).

Term 16: Data Integrity

Data integrity is often conflated with data quality and although they are closely related, they are quite different. Data integrity helps to enforce data quality by ensuring the data makes sense. Some techniques include the implementation of primary keys (to ensure uniqueness), foreign keys (to establish relationships that avoid creating orphan records), and constraints (to enforce validation rules). These data integrity techniques improve data quality by preventing errors from being introduced at the point of entry.

Term 17: Data Leakage

Data leakage refers to the unexpected, or unauthorized, release of data to an outside entity or location, which can be problematic because sensitive data may be available to users without permission. Data leakage can occur as a result of data breaches, user error/oversight, relaxed data security rules, poor data integration protocols, or negligence. Formal data governance policies and robust data management practices, which will be defined in Part III, mitigate data leakage.

Term 18: Data Manipulation Language

Data Manipulation Language (DML) represents the commands used to control the contents within the objects of a database (e.g., tables). Examples include INSERT (to create records), UPDATE (to modify records), or DELETE (to remove records). Each of these commands, when successfully executed, physically change the contents in objects, thereby manipulating the contents of the database. The DML statements below illustrate each example with syntactically correct commands.

-- Insert record for new student named John Smith
INSERT INTO students (student_id, last_name, first_name, email_address)
              VALUES (341, 'Smith', 'John', '[email protected]');

-- Update John Smith's email address
UPDATE students
   SET email_address = '[email protected]'
 WHERE student_id = 341;

-- Delete John Smith's record
DELETE FROM students WHERE student_id = 341;        

Feel free to play with these DML statements using the W3Schools link shared above (see 'Data Definition Language'). Be sure to create and alter the table again since it was dropped at end of the sample DDL script above in Term 13. ??

Term 19: Data Mining

Data mining is the process of exploring and analyzing datasets to find potentially useful information, identifying relationships and patterns in the data. Data mining is particularly useful for identifying areas of interest for further analysis, separating signal from noise. Exploratory Data Analysis (EDA) is the process typically used to conduct initial investigations of data, especially with new datasets (or sources).

Term 20: Data Warehouse

A data warehouse is a central repository of cleansed data specifically designed and built for business intelligence (BI) reporting and analytics. A data warehouse generally integrates data from multiple data sources, potentially with different formats, using Extraction-Transformation-Load (ETL) routines. The ETL output produces data that is ready for reports, dashboards, and analysis. The most common data modeling technique for a data warehouse is the star schema, containing a fact table surrounded by dimension tables (as shown below).

Data Warehouse Star Schema
Data Warehouse Star Schema

Summary

That concludes the second article in this series with some intermediate terms preparing you for the remaining two articles. The next article in this series (Part III - Advanced) will be published in a few weeks and cover ten (10) more terms as follows:

  • Data Aggregation
  • Data Governance
  • Data Granularity
  • Data Journalism
  • Data Lake
  • Data Lineage
  • Data Literacy
  • Data Management
  • Data Profiling
  • Data Sampling

Alexa Tsui

#GovCom Influencer/Community Builder/Human Speakeasy for talent

2 年

Thanks for posting, Mark! This is so helpful! ??

要查看或添加评论,请登录

Mark DeRosa的更多文章

社区洞察

其他会员也浏览了