Learning Analytics Series: Terms Beginning with "Data _____"? (Part I)
Photo by Timur Saglambilek: https://www.pexels.com/photo/analytics-text-185576

Learning Analytics Series: Terms Beginning with "Data _____" (Part I)

Introduction

There are plenty of keywords, phrases, and cute metaphors in the world of data analytics. And some of them are the same things expressed different ways. This article is the first in a series of articles over the next few months that are intended to bring some clarity to common terms you may see or hear in the data analytics space. This series includes four (4) articles, published every few weeks, with each article defining 10 terms that begin with "Data _____" and presented at the following levels:

  • Part I - Novice
  • Part II - Intermediate
  • Part II - Advanced
  • Part IV - Expert

You will learn 40 terms by the end of this series. There are hundreds of terms, so I selected terms that seem to be asked about most often by people seeking to learn more about data analytics. These definitions are based on my experience over the years working many different projects using/implementing these terms, building systems and platforms with many different tools/technologies, answering customer questions, responding to RFIs/RFPs (Request for Information/Request for Proposal), teaching data warehouse classes across the nation, and working with talented staff along the way.

If you are a beginner or simply want to understand what data professionals are talking about, then this series is perfect for you. If you are already knee-deep in data work, then you can use this series as a refresher or simply reflect on what these terms mean to you. In either case, I welcome your feedback and questions on these terms as we progress through the series. I will also wrap up the series with a final article that combines all the terms in one place and incorporates reader feedback (with credit to those that contribute ??).

Without further ado, let’s begin with the first ten terms…

First 10 Terms Beginning with Data _____ (in alphabetical order)

Term 1: Database

Digital repository of data that is generally organized as tables with rows and columns. Other objects can exist in databases such as views, procedures, functions, and triggers to help manage data. Structured Query Language (SQL) is used to retrieve and manipulate data in the database. A simple SQL SELECT statement is shown below that retrieves all employee numbers, names, and salaries from a database table named employees (with fictitious data) and returns a result set in ascending order by last name, then first name.

  SELECT employee_number, first_name, last_name, salary
    FROM employees?
ORDER BY last_name, first_name;        
SQL SELECT Statement and Result Set
SQL SELECT Statement and Result Set

Term 2: Dataset

Collection of data typically in the form of records (rows) and attributes (columns) that may be extracted from one or more sources. A dataset may also contain derived information to further enrich the data in support of further analysis and reporting. The idea is to create a complete set of data that can be used for a specific purpose whereby all the information needed is available and ready for use (with confidence).

Term 3: Data Collection

Retrieving data from various sources, which likely includes databases, to gather all the information needed to perform some level of analysis, reporting, or processing. As the name implies, you are gathering (collecting) data from wherever you find data sources of potential value. For Data Analysts, data collection generally means querying databases and running reports to gather data for analysis. For Data Engineers, data collection usually corresponds with the "E" in ETL (Extraction, Transformation, Load) where extracting (collecting) data is generally automated. For Data Scientists, data collection typically is the data acquisition (collection) process at the beginning of a data pipeline (Part IV).

Term 4: Data Discovery

Data discovery involves different techniques and tools that may include spreadsheets (e.g., Microsoft Excel, Google Sheets), dashboards (e.g., Tableau, Power BI, Qlik), enterprise business intelligence tools (e.g., SAP Business Objects, MicroStrategy, IBM Cognos, Oracle OBIEE, Microsoft SSRS), and SQL. The goal of data discovery is to identity patterns, anomalies, or trends to help determine where true value exists in the data. The result of data discovery sets the stage for subsequent work such as data engineering (Part II), data aggregation (Part III), and data science (Part IV).

Term 5: Data Exploration

Data exploration involves surveying the data landscape to identify potential data sources that may yield results. Unlike data discovery where data sources are analyzed to discover patterns/anomalies/trends, exploration focuses more on identifying data sources that show promise. Once identified, then the analyst will transition from exploration to discovery mode. (Exploratory data analysis (EDA) is something else that will be covered as part of the data mining term in Part II.)

Term 6: Data Format

In the general sense, data is available in a variety of formats that may be structured, unstructured, or semi-structured. More specifically, formats in digital form may include relational databases (e.g., Oracle, SQL Server), non-relational (NoSQL) databases (e.g., MongoDB, Cassandra), documents (e.g., Microsoft Word, Google Docs, PDF), flat files (e.g., TXT, CSV, JSON, XML), and other miscellaneous formats (e.g., e-mails, maps, photos, videos, audio recordings).

Term 7: Data Model

Data models are used to represent data as entities (tables) and attributes (columns) as well as the relationships between those entities. These models facilitate the collection, storage, integration, analysis, and reporting of data as information. They allow data to be represented in a manner that's understandable and usable. Data models generally come in three (3) forms: Conceptual, Logical, and Physical.

Term 8: Data Quality

Dimensions of Data Quality
Dimensions of Data Quality

Data quality is a measure of how good or bad is the data, which can be measured in many different ways. Naturally, data quality is extremely important for producing results that allow stakeholders to make well informed decisions with confidence. We'll spend a little more time/space on this particular term since data quality is so important and part of every data discussion (and solution).

There are multiple dimensions of data quality but some of the more common dimensions are as follows:

  • Completeness - All necessary data is available to perform a thorough and complete analysis. Completeness also means having more data than is minimally required. The more complete the data, the more comprehensive the analysis.
  • Conformity - Data conforms to an established set of business rules, format, type, and range making the data valid. This dimension may also be referred to as "validity."
  • Accuracy - Data is factual, current, and reliable as a trusted source. Data can be valid, but inaccurate. For example, the wrong number in a numeric field may be valid (conforms to the standard) but inaccurate. Accuracy means having the correct data in a field.
  • Consistency - Like data is the same across data sources. For example, employee ids are the same across different systems for each person. Otherwise, data transformations must be used to link them across systems.
  • Timeliness - Data with minimal latency provides the best opportunity for making confident decisions. Stale data can have adverse effects on outcomes if decisions are made with old information.
  • Integrity - Data relationships are supported through systems and changes, which usually involves keys (primary and foreign keys). Changes to a key must retain relationships with all linked records to avoid creating orphans. (Part II of this series will define data integrity.)

Term 9: Data Source

A physical or virtual location that contains data of interest. Data sources may include databases, systems, files (electronic or paper), conversations, etc. Anything that is spoken or written and captured in digital form is a potential data source that may be used to create a dataset for analysis and reporting, or even serve as input to other systems.

Term 10: Data Visualization

Sample Dashboard with a Network Graph
Sample Dashboard with a Network Graph

Data visualization is useful for communicating data as information so users can make sense of the data and take appropriate action. The most effective forms of visualization are interactive (e.g., dashboards), allowing users to filter, drill-down, and export data. (Yes, I said export data from a data viz … sorry, but it's true. ??) Other forms of visualization include scorecards and infographics where information is displayed in a manner that's easily understood.

Summary

That wraps up the first article in this series with some introductory terms setting the stage for the remaining three articles. The next article in this series (Part II - Intermediate) will be published in a few weeks and cover ten (10) more terms as follows:

  • Data Architecture
  • Data Cleansing
  • Data Definition Language
  • Data Engineering
  • Data Integration
  • Data Integrity
  • Data Leakage
  • Data Manipulation Language
  • Data Mining
  • Data Warehouse

This is great, Mark! I will definitely be sharing this with my team. These fundamental articles will really help them.

Anup Kale

Data Solution Architect at Suncorp Group

2 年

Nice intro Mark, what's your take on Data Type ? Especially in the era of Big Data 'Variety' means so many data types.

回复

I'd like to add "Data Superhero" to this list, and nominate Mark DeRosa. His insight has been invaluable to me in my own journey of understanding all things data related.

要查看或添加评论,请登录

Mark DeRosa的更多文章

社区洞察

其他会员也浏览了