登录查看更多内容

Data Vocabulary

Nigel Shaw

Creating A Shared Language Of Data

发布日期: 2023年5月7日

3NF (3rd Normal Form) - a database modeling technique to ensure that redundant data is not stored in the database. This is the most common form of relational data model - tables and columns, parents and children, reference tables.

6NF (6th Normal Form aka KVP aka Key Value Pairs) - data is stored at the attribute level - attribute name and attribute value. It creates beautiful, elegant databases. It gives you history at the attribute level. It is an exquisite form of torture for anybody trying to write SQL to query the data for reporting.

Analytics Coverage - analytics defined to represent a 360 degree view of your data. It will ensure that any metrics calculated can be trusted. Metrics are balanced and correlated and will highlight any data quality issues.

Analytics Paralysis - you want to define analytics for your data, but the data quality is so poor that any analytic will be misleading. So you spend hours contemplating all possible analytics for the dataset.

Attribute - a field in data modeling terms. It is used interchangeably with field or column.

Column Slip - where the number of columns in your data feed varies. Typically, it causes a majestic screw up in your data load. If undetected you will end up with chronic data pollution. If detected you will spend hours isolating the problem, and then hours fixing it. It can be caused by delimiter problems.

Data Analytics - measures calculated for data. They can be at the row level (where all the fields in the calculation are in the row) or aggregated (where the formula goes across multiple rows). The are fun when they work properly, annoying when the logic is not right or the data is screwy. Often you will need to build out sets of analytics to prove that the target analytic is actually correct.

Data Archive - the place where old transactions are retired (once they are no longer needed). Availability depends upon if you think you might need it or you are sure you will never need it. Most people never purge data (it makes them nervous). There are multiple versions of archive. Online (immediately available), Nearline (run next door and turn on the server), Offline (catch the bus, grab the tape and back to the office).

DA (Data Architect) - the plumber of the databases at the enterprise level. Picks all the nodes and pipelines. Likes to speak in multiple data languages concurrently. Has a habit of obfuscation, but really wants to be loved by all. Takes great pride in the harmony of data flowing across the enterprise. Knows things that you didn't even realize were possible to know.

Data Audit - we all love internal control when they check that we are following the defined processes for governance, and they check that our data exhibits the required audit trails and lineage and aligns with the source data. I used to think this was the fun part ( I spent some time in internal control), but people with sloppy habits treat this as torture.

Data Dictionary - a formal document detailing the tables and columns in the database. It should be a good reference for all people who want to use or query the database. Like all reference documents it should be clearly versioned to align with the database.

Data Domain - all possible values for a single column. One of the most important concepts in data. More important than most people realize.

Data Governance - the processes and people that ensure that your data is complete, valid, accurate, well-defined and secure. Poor governance will always result in fragile data that is not readily trusted.

Data Lifecycle - birth, usage and death of data. Data does not live forever, we always need to design for a specific databases size and related retention.

Data Misalignment - a common problem when two dependent data sources are out of sync. You can end up with orphan data when there is no value for the FK or dropped data when you exclude the data from a load view.

Data Pollution - where the column loading fails, and some source columns end up loading to the wrong columns. If you have an all text table it can go un detected. It is typically caused by delimiter problems.

Data Quality - a measure that indicates the percentage of rows that meet the requirements for valid data. It will be expressed in percentage and number of rows.

Data Retention - the rules for retention and archive of data. You physical server is designed for a specific retention (n years of data) and load (n customers with n transactions). Any change to these values will ultimately break the server.

Data Silo - a little world of data that has no possible keys to connect to anything outside of that world. A great example is customer sales data with no customer key. You have sales data but no access to any customer data. It is often of very limited use and value.

Data Table - defined in a database to store a dataset with records in the same format.

Data Type - identifies what type of data the field is used for (date/time, number, character). It also drives how the data is stored physically in the database.

Data Visualization - transforming data into a visual representation that informs and sometimes excites (a true magician). The most common tools are Power BI and Tableau.

Data Warehouse - a reporting database where transaction data from multiple systems is transformed into a set of related reporting structures. This data is typically used for reporting and dashboards. These tend to be very large databases. A data warehouse may have ten years of data. The systems feeding it may just have one years worth of data.

Database Backup - a scheduled task managed by the DBAs. It will be a combination of full backups and incremental backups. maybe a full backup once a week or once a month. On a quarterly basis you do need to test that you can restore from the backups.

Database Block - a database is physically an area of disk space divided up into equal size blocks (4K or 8K or 16K or ... bytes). When you create a table it is assigned to a block. As the table grows it is assigned more blocks from the database free space. These blocks may be scattered around the database as you are. sharing the space with many tables. When you optimize the database it moves the blocks for a table so they are adjacent. For a distributed database it will spread the table across all the nodes again.

Database Design - it looks easier than it really is. Many people design for a static sample of the data. You need to design for data at rest AND data in motion. The most beautiful data model is a paperweight if it doesn't perform under load.

Database Locks - when you update a record it is locked for the duration of the update. Many, many record locks at the same time can escalate the locking to a table lock - which can make your day really suck, as the database grinds to a slow crawl. Smart designs try to go insert only, and/or minimize updates.

Database Model - the formal definition of tables and columns with the relationships between each table. It will be presented as both an ERD (Entity Relationship Diagram) and a SQL script for actual build of the database structures. There will be a logical model (which is database agnostic) and a physical model (which is tailored for the target database).

领英推荐

Introduction to Data Modeling Basics

Doug Rose 5 个月前

Data Modeling Decoded: Crafting Effective Data…

Enterprise DNA 1 年前

What's the Difference Between Data Wrangling vs Data…

Dr. RVS Praveen Ph.D 1 年前

DBA (Database Administrator) - the guardian of the database. They administer the databases - performance, reliability, security, etc. Tends to have an opinion about everything database related. Responds well to nice words but not flattery. If you make friends they will let you see the stats you really want to see.

Distributed Database - a database that is physically spread across multiple nodes. Each node has CPU, memory and disk. When a query executes it executes across each node and then combines the results to send back to the requestor. The secret is to keep each node executing in harmony. You don't want one that is crazy busy and others just idling. Database physical design is key here.

Entity - a table in modeling terms. Table is really the physical structure. These terms are commonly used interchangeably.

ELT (Extract Load Transform) - see ETL. Used by people who just like to be different.

ETL (Extract Transform Load) - software to take data from one source, adjust the format for the target and then loads the data. The process will normally be scheduled on a nightly run. But, it could be on any schedule, or it could be triggered by a condition in the source data. You will normally see two variants for each data source. An initial load and then an incremental load (that will be used on a daily basis)

Explain Plan - when a SQL query executes it creates an execution plan. You can review this plan to see the decisions that the query engine took and the time and rows included in each step. This is always the first step in tuning SQL.

False Hope - all of your analytics produce positive results, but they are all directly related to skew in the data. It is a direct result of your failure to address coverage for your analytics. Skew should never drive happiness.

Foreign Key - a foreign key is used for lookups (code to reference table) and parent-child.

Grain - the level that your data is stored at. Such as Customer-level or Transaction-level.

Identity Value - A column designated as an identity with an integer start value and an increment value. It is used to generate a unique value for each row.

Indexes - there are several types of indexes, but the intent an index is to speed up access to a row or a group of rows. Indexes are created for one or more columns. They can be unique or non-unique. An index is a physical object in the database. A really good data modeler will ask you lots of questions about your query paths. Too few indexes presents poor performance, too many indexes is also poor performance. The one you will be most familiar with is a B-tree index. Once you start to get into it you will see bitmap indexes and bloom filters.

Metadata - data about your data. A good database has more metadata than actual values. Description, definition, usage, source, lineage, security, visibility, history and on and on.

Misplaced Confidence - You have no data quality measures and no data controls. The assumption is that all feeds are complete and that an ETL refresh with no errors will yield perfect data. This requires a lot of smiling and appearing confident in all meetings. Prayer is required to prevent close scrutiny.

Music - an essential part of data life. It often defies all expectations - one of my favorite data engineers was a Vietnamese Black Sabbath fan.

MWOT (Magnificent Waste Of Time) - the hours spent tuning a query that is so complex it will never be maintainable and will never scale properly. You should be able to recognize this in the first 15 minutes. The query should be abandoned and completely rethought.

OLTP (Online Transaction Processing) - build for speed. The most well known OLTP systems are sales systems, but banks also fall into this category. You design to touch the data as lightly as possible and avoid record locking which escalates to table locking.

Orphan Data - through a data alignment problem you end up with a parent with no children or children with no parent (many load processes disable FKs during load).

Parent-Child - the relationship between two tables (such as header and line item). The primary key of the parent table is the root of the primary key for the child table. There can be zero, one or many children. If the rule is only one child for each parent they may have identical primary keys.

Primary Key - the set of fields (one or more) in a data table that represent uniqueness for each row. In a lot of situations an identity value will be used to create a unique integer value for each row.

Referential Integrity - Primary keys and foreign keys will be used to enforce referential integrity (duplicates and relationships between tables). You cannot have duplicate primary keys, you can't have children without a parent.

Relational Database - a database based upon a relational model. Tables are defined and stored with relationships. The most well known ones are Oracle and SQL Server (on single servers).

Some Data Issues - we heard there is a problem, but we are really not sure what it is. We will check for a couple of hours and then schedule a meeting for next week to tell you we think the problem is resolved. A really good data team will be completely transparent as they unravel any data issue. A bad data team will provide as little info as possible.

SQL (Structured Query Language) - the language used to interact with the database. Building, managing and querying data. You can learn the basics in a couple of hours. Getting really really good can take years.

T-shirts - often a competitive sport. Obscure bands, vintage data products and kittens.

Trigger - an action that is initiated when a specific event occurs. It can be as simple as the insert of a record. There are many different schools of thought about triggers - love or hate. Personally, I will use them if they fit the requirement. But, they will always be lightweight. Bad things happen when lots of triggered events happen at the same time.

Unstable Analytics - you publish your metrics daily based upon incomplete data. The remaining data is included in the data for the next day. The numbers for the previous day will always change when you post the current day. This produces head banging in the business users and teeth grinding. It is a great example of when you need to maintain good lineage data to ensure you can replicate results at any point in time.

Version Control. Version Control Software works really nicely with applications, not so nicely with databases. But you do need to version your databases AND all the documentation that aligns with the databases AND you need to stash a little version control table in your database that tells you the history and the current version.

WORM Database. Write Once Read Many. This is a great example of a database built for speed. It is great for queries, fast for inserts, but painful slow for updates. Updates become a logical delete followed by an insert. Really fun for database modeling because you try to design for insert only.

Nigel Shaw

Creating A Shared Language Of Data

1 年

For Amna Ahmed Anna Wu

要查看或添加评论，请登录

Nigel Shaw的更多文章

Integer Bitmaps - Data Modeling

2024年8月18日

Integer Bitmaps - Data Modeling

One of the greatest qualities in data modeling is elegance - finding a solution that is beautiful and simple. Integer…

2 条评论
Adventures In Data

2024年7月11日

Adventures In Data

I actually think it is harder now to be a data analyst - there are so many tools and possibilities. It must hard to…
Being A Great IT Manager

2024年7月8日

Being A Great IT Manager

You were selected to lead. Just being a manager barely meets any job requirement.
Data Quality - The Elephant In The Room

2024年6月21日

Data Quality - The Elephant In The Room

I think there is a general reticence to speak the truth; data quality issues are ultimately defects, and what we should…
No Room For Heroes

2024年6月19日

No Room For Heroes

When you work with data you need a real team with both depth and width. There is often a tendency to make your team…
Data Undressed

2024年6月14日

Data Undressed

Data at the source is wrapped in layers of metadata; definition, lineage, value, security and on and on. Once we start…
Noor Inayat Khan

2024年6月11日

Noor Inayat Khan

If you ever visit Dachau, you will see a simple memorial plaque in the Memorial Hall for Noor Inayat Khan. She is one…
The Eternal Battle - Data vs App

2024年6月11日

The Eternal Battle - Data vs App

If you look at the root of most data quality issues it is normally embedded within apps - either poor data design (how…

1 条评论
Dashboard Trust Transparency

2024年6月8日

Dashboard Trust Transparency

When you publish a dashboard to your stakeholders you are making a commitment to them to provide timely and accurate…
Domains, Lists And Data

2024年6月3日

Domains, Lists And Data

Mastering data is all about mastering the basics, and there is no better place to start than domains. A domain is…

See all articles

Data Vocabulary

Nigel Shaw

Creating A Shared Language Of Data

领英推荐

Nigel Shaw的更多文章

社区洞察

其他会员也浏览了

Big Wishes for Big Data

Standardizing Data Delivery with Data as a Product

What is a Data Structure?

Benefits Of Using Data Modeling Tool For Data Analysts

Data Analysis Tools For Perfect Data Management

"Foundations of Data Mastery: Choosing the Right Tools for Success"

From Toolkit to Transformation: Mastering Data Analysis Techniques

Tales of A Data Mom: Data Stewardship – Are You Curious?

Data Stack as a Service (2/3)

Hive Optimization 50 Tips

领英推荐

Nigel Shaw的更多文章

Integer Bitmaps - Data Modeling

Adventures In Data

Being A Great IT Manager

Data Quality - The Elephant In The Room

No Room For Heroes

Data Undressed

Noor Inayat Khan

The Eternal Battle - Data vs App

Dashboard Trust Transparency

Domains, Lists And Data

社区洞察

其他会员也浏览了

Big Wishes for Big Data

Standardizing Data Delivery with Data as a Product

What is a Data Structure?

Benefits Of Using Data Modeling Tool For Data Analysts

Data Analysis Tools For Perfect Data Management

"Foundations of Data Mastery: Choosing the Right Tools for Success"

From Toolkit to Transformation: Mastering Data Analysis Techniques

Tales of A Data Mom: Data Stewardship – Are You Curious?

Data Stack as a Service (2/3)

Hive Optimization 50 Tips