Data Vocabulary
3NF (3rd Normal Form) - a database modeling technique to ensure that redundant data is not stored in the database. This is the most common form of relational data model - tables and columns, parents and children, reference tables.
6NF (6th Normal Form aka KVP aka Key Value Pairs) - data is stored at the attribute level - attribute name and attribute value. It creates beautiful, elegant databases. It gives you history at the attribute level. It is an exquisite form of torture for anybody trying to write SQL to query the data for reporting.
Analytics Coverage - analytics defined to represent a 360 degree view of your data. It will ensure that any metrics calculated can be trusted. Metrics are balanced and correlated and will highlight any data quality issues.
Analytics Paralysis - you want to define analytics for your data, but the data quality is so poor that any analytic will be misleading. So you spend hours contemplating all possible analytics for the dataset.
Attribute - a field in data modeling terms. It is used interchangeably with field or column.
Column Slip - where the number of columns in your data feed varies. Typically, it causes a majestic screw up in your data load. If undetected you will end up with chronic data pollution. If detected you will spend hours isolating the problem, and then hours fixing it. It can be caused by delimiter problems.
Data Analytics - measures calculated for data. They can be at the row level (where all the fields in the calculation are in the row) or aggregated (where the formula goes across multiple rows). The are fun when they work properly, annoying when the logic is not right or the data is screwy. Often you will need to build out sets of analytics to prove that the target analytic is actually correct.
Data Archive - the place where old transactions are retired (once they are no longer needed). Availability depends upon if you think you might need it or you are sure you will never need it. Most people never purge data (it makes them nervous). There are multiple versions of archive. Online (immediately available), Nearline (run next door and turn on the server), Offline (catch the bus, grab the tape and back to the office).
DA (Data Architect) - the plumber of the databases at the enterprise level. Picks all the nodes and pipelines. Likes to speak in multiple data languages concurrently. Has a habit of obfuscation, but really wants to be loved by all. Takes great pride in the harmony of data flowing across the enterprise. Knows things that you didn't even realize were possible to know.
Data Audit - we all love internal control when they check that we are following the defined processes for governance, and they check that our data exhibits the required audit trails and lineage and aligns with the source data. I used to think this was the fun part ( I spent some time in internal control), but people with sloppy habits treat this as torture.
Data Dictionary - a formal document detailing the tables and columns in the database. It should be a good reference for all people who want to use or query the database. Like all reference documents it should be clearly versioned to align with the database.
Data Domain - all possible values for a single column. One of the most important concepts in data. More important than most people realize.
Data Governance - the processes and people that ensure that your data is complete, valid, accurate, well-defined and secure. Poor governance will always result in fragile data that is not readily trusted.
Data Lifecycle - birth, usage and death of data. Data does not live forever, we always need to design for a specific databases size and related retention.
Data Misalignment - a common problem when two dependent data sources are out of sync. You can end up with orphan data when there is no value for the FK or dropped data when you exclude the data from a load view.
Data Pollution - where the column loading fails, and some source columns end up loading to the wrong columns. If you have an all text table it can go un detected. It is typically caused by delimiter problems.
Data Quality - a measure that indicates the percentage of rows that meet the requirements for valid data. It will be expressed in percentage and number of rows.
Data Retention - the rules for retention and archive of data. You physical server is designed for a specific retention (n years of data) and load (n customers with n transactions). Any change to these values will ultimately break the server.
Data Silo - a little world of data that has no possible keys to connect to anything outside of that world. A great example is customer sales data with no customer key. You have sales data but no access to any customer data. It is often of very limited use and value.
Data Table - defined in a database to store a dataset with records in the same format.
Data Type - identifies what type of data the field is used for (date/time, number, character). It also drives how the data is stored physically in the database.
Data Visualization - transforming data into a visual representation that informs and sometimes excites (a true magician). The most common tools are Power BI and Tableau.
Data Warehouse - a reporting database where transaction data from multiple systems is transformed into a set of related reporting structures. This data is typically used for reporting and dashboards. These tend to be very large databases. A data warehouse may have ten years of data. The systems feeding it may just have one years worth of data.
Database Backup - a scheduled task managed by the DBAs. It will be a combination of full backups and incremental backups. maybe a full backup once a week or once a month. On a quarterly basis you do need to test that you can restore from the backups.
Database Block - a database is physically an area of disk space divided up into equal size blocks (4K or 8K or 16K or ... bytes). When you create a table it is assigned to a block. As the table grows it is assigned more blocks from the database free space. These blocks may be scattered around the database as you are. sharing the space with many tables. When you optimize the database it moves the blocks for a table so they are adjacent. For a distributed database it will spread the table across all the nodes again.
Database Design - it looks easier than it really is. Many people design for a static sample of the data. You need to design for data at rest AND data in motion. The most beautiful data model is a paperweight if it doesn't perform under load.
Database Locks - when you update a record it is locked for the duration of the update. Many, many record locks at the same time can escalate the locking to a table lock - which can make your day really suck, as the database grinds to a slow crawl. Smart designs try to go insert only, and/or minimize updates.
Database Model - the formal definition of tables and columns with the relationships between each table. It will be presented as both an ERD (Entity Relationship Diagram) and a SQL script for actual build of the database structures. There will be a logical model (which is database agnostic) and a physical model (which is tailored for the target database).
领英推荐
DBA (Database Administrator) - the guardian of the database. They administer the databases - performance, reliability, security, etc. Tends to have an opinion about everything database related. Responds well to nice words but not flattery. If you make friends they will let you see the stats you really want to see.
Distributed Database - a database that is physically spread across multiple nodes. Each node has CPU, memory and disk. When a query executes it executes across each node and then combines the results to send back to the requestor. The secret is to keep each node executing in harmony. You don't want one that is crazy busy and others just idling. Database physical design is key here.
Entity - a table in modeling terms. Table is really the physical structure. These terms are commonly used interchangeably.
ELT (Extract Load Transform) - see ETL. Used by people who just like to be different.
ETL (Extract Transform Load) - software to take data from one source, adjust the format for the target and then loads the data. The process will normally be scheduled on a nightly run. But, it could be on any schedule, or it could be triggered by a condition in the source data. You will normally see two variants for each data source. An initial load and then an incremental load (that will be used on a daily basis)
Explain Plan - when a SQL query executes it creates an execution plan. You can review this plan to see the decisions that the query engine took and the time and rows included in each step. This is always the first step in tuning SQL.
False Hope - all of your analytics produce positive results, but they are all directly related to skew in the data. It is a direct result of your failure to address coverage for your analytics. Skew should never drive happiness.
Foreign Key - a foreign key is used for lookups (code to reference table) and parent-child.
Grain - the level that your data is stored at. Such as Customer-level or Transaction-level.
Identity Value - A column designated as an identity with an integer start value and an increment value. It is used to generate a unique value for each row.
Indexes - there are several types of indexes, but the intent an index is to speed up access to a row or a group of rows. Indexes are created for one or more columns. They can be unique or non-unique. An index is a physical object in the database. A really good data modeler will ask you lots of questions about your query paths. Too few indexes presents poor performance, too many indexes is also poor performance. The one you will be most familiar with is a B-tree index. Once you start to get into it you will see bitmap indexes and bloom filters.
Metadata - data about your data. A good database has more metadata than actual values. Description, definition, usage, source, lineage, security, visibility, history and on and on.
Misplaced Confidence - You have no data quality measures and no data controls. The assumption is that all feeds are complete and that an ETL refresh with no errors will yield perfect data. This requires a lot of smiling and appearing confident in all meetings. Prayer is required to prevent close scrutiny.
Music - an essential part of data life. It often defies all expectations - one of my favorite data engineers was a Vietnamese Black Sabbath fan.
MWOT (Magnificent Waste Of Time) - the hours spent tuning a query that is so complex it will never be maintainable and will never scale properly. You should be able to recognize this in the first 15 minutes. The query should be abandoned and completely rethought.
OLTP (Online Transaction Processing) - build for speed. The most well known OLTP systems are sales systems, but banks also fall into this category. You design to touch the data as lightly as possible and avoid record locking which escalates to table locking.
Orphan Data - through a data alignment problem you end up with a parent with no children or children with no parent (many load processes disable FKs during load).
Parent-Child - the relationship between two tables (such as header and line item). The primary key of the parent table is the root of the primary key for the child table. There can be zero, one or many children. If the rule is only one child for each parent they may have identical primary keys.
Primary Key - the set of fields (one or more) in a data table that represent uniqueness for each row. In a lot of situations an identity value will be used to create a unique integer value for each row.
Referential Integrity - Primary keys and foreign keys will be used to enforce referential integrity (duplicates and relationships between tables). You cannot have duplicate primary keys, you can't have children without a parent.
Relational Database - a database based upon a relational model. Tables are defined and stored with relationships. The most well known ones are Oracle and SQL Server (on single servers).
Some Data Issues - we heard there is a problem, but we are really not sure what it is. We will check for a couple of hours and then schedule a meeting for next week to tell you we think the problem is resolved. A really good data team will be completely transparent as they unravel any data issue. A bad data team will provide as little info as possible.
SQL (Structured Query Language) - the language used to interact with the database. Building, managing and querying data. You can learn the basics in a couple of hours. Getting really really good can take years.
T-shirts - often a competitive sport. Obscure bands, vintage data products and kittens.
Trigger - an action that is initiated when a specific event occurs. It can be as simple as the insert of a record. There are many different schools of thought about triggers - love or hate. Personally, I will use them if they fit the requirement. But, they will always be lightweight. Bad things happen when lots of triggered events happen at the same time.
Unstable Analytics - you publish your metrics daily based upon incomplete data. The remaining data is included in the data for the next day. The numbers for the previous day will always change when you post the current day. This produces head banging in the business users and teeth grinding. It is a great example of when you need to maintain good lineage data to ensure you can replicate results at any point in time.
Version Control. Version Control Software works really nicely with applications, not so nicely with databases. But you do need to version your databases AND all the documentation that aligns with the databases AND you need to stash a little version control table in your database that tells you the history and the current version.
WORM Database. Write Once Read Many. This is a great example of a database built for speed. It is great for queries, fast for inserts, but painful slow for updates. Updates become a logical delete followed by an insert. Really fun for database modeling because you try to design for insert only.
Creating A Shared Language Of Data
1 年For Amna Ahmed Anna Wu