Data Modeling/Dimension Modeling
Nagaraju Juluru
Lead Data Engineer | Cloud & Big Data Expert | AI-Driven Data Solutions | AWS, GCP, Snowflake, Databricks | Apache Spark | Real-Time Streaming | ETL/ELT | Data Lakehouse | Terraform | CI/CD Automation
Data Modeling Fundamentals
What is Data Modeling?
- What is Data Modeling? Representation of what the data is in the real-world. Provide us insights into characteristics and rules that apply to the data.
Lifecycle of Data Modeling
OLTP – Online Transactional Processing
- Optimized for inserts and updates
- Optimal for application use
- Schema structure might fundamentally change for different application needs
- “Focus on customers entering dataâ€
OLAP – Online Analytical Processing
- Optimized for heavy reads
- Optimal for business structure, understandable by business people
- Schema structure should be consistent and flexible for different business needs
The Building Blocks of Data Modeling
Data Subjects/Entities
- Commonly called “entitiesâ€
- Others call it “objects†or “classesâ€
- Analogous to database tables
Attributes
- Analogous to a database column
- Typically associated with Data Subjects
- Attribute types often “shared†across multiple entities“
- Bottom-up†data modeling once semi-popular
- Go down to all the attributes then model them back to entities
- If there are common attributes if you want you can make the attributes more descriptive.
Attributes have descriptions & rules
- Data types and size
- Whether NULL values allowed
- Permissible values
Attribute Tips
- Decompose but carefully
- Ex: “where a student lives
- â€â€œSTUDENT_ADDR_AND_CITY†= Street address + City + State + Zip Code all in one attribute/field (NOT BEST PRACTICE)
- “STUDENT_ADDR†+ “STUDENT_CITY†+ “STUDENT_STATE†+ “STUDENT_ZIP†each as a separate attribute/field (BEST PRACTICE)
Relationships among data subjects
- Defining the relationship amongst the entities
Business rules for data
- Cardinality
- Mandatory or optional relationships
- Permissible attribute values(including NULLs)
- “Data change dynamicsâ€
Hierarchies in Entities/Data Subjects
Hierarchy
- Special kind of entity
- A special type of relationship
- Think “specializtionâ€
Hierarchy use case
- Two or more entities
- That have “a lot†in common
- But also “at least a little bit†different
Strong vs Weak Entities/Data Subjects
Strong entity “exists on its own termsâ€
- Exists independent of any other entity
- Does not require any other entity instances to help identify its own instance
Weak entity “needs some helpâ€
- To identify specific instance of that entity
- Can’t exist without an instance of another entity
- or both
Multiple Relationships Between Entities
- Multiple Relationships between 2 entities
-->Recursive Relationship
--> Ternary Relationship
Data Modeling Gerund
- Special Type of relationship that acts like an entity.
- Relationship that have specific attributes specific to the relationship itself
Cardinality
Cardinality: "the number of something"
Maximum Cardinality
- Number of instances of both sides of a relationship
- Typical values: “1†or “Mâ€(many)
- Can also be a specific numeric value
1:1 Relationship
Specific number of Cardinality
Minimum Cardinality
Sometimes referred to as “participation constraintâ€
- Total Participation ( Min. cardinality = 1)
- Partial Participation ( Min. cardinality = 0)
领英推è
Mandatory vs optional relationship
- Mandatory relationship (Min. cardinality = 1)
- Optional relationship (Min. cardinality = 0)
3rd possible value for min. cardinality
- 0: optional/partial participation
- 1: mandatory/total participation
- (n): some explicit number of minimum instances“A full time lecturer must teach at least 6 classesâ€â€œA full professor must advise at least 2 other faculty membersâ€
Crow's Foot Notations for Cardinality
Normalization
Normalization “The Key, the Whole Key, Nothing but the Key…â€
1st Normal Forms
- Every row (tuple) must be unique
- NO repeating groups
- Multi-valued attributes are in violation of 1N
2nd Normal Forms
- Must be in 1NF“
- No partial key dependenciesâ€
- Applies if composite primary key
- If single-column (field) primary key then already in 2NF
3rd Normal Forms
- Must be in 2NF“
- No non-key dependencyâ€
Forward Engineering
Typical conceptual -> logical transformaitions
- Address violations of normalization
Transform M:M relationships
- Add “intersection entity (or table)†to your model
- Also referred to as “associative entity (or table)â€
- Or “bridge entity (or table)â€
- Purpose: decompose M:M relationship into multiple “semantically equivalent†relationships
- Semantically equivalent…but “artificialâ€
- Add foreign keys
Typical logical -> physical transformation
Denormalization
- Violating normalization rules deliberate for performance gains
- Aggregates
- Materialized Views
- Join across various tables that physically creates the result of a query
- Optimized storage placement(e.g. partitioning)
- Database Indices
Dimensional Modeling
"Dimensional Modeling is a design technique for databases intended to support end-user queries in a data warehouse"
Key Terms
Surrogate Keys:
- Artificially created keys (usually integers) used only by the data warehouse to uniquely identify a row of dimension table
- Required to implement history of slowly changing dimensions
- Avoids conflicts among backend source systems
- Insulates the data warehouse from source systems
Dimension Table:
- By what we measure things
- The who, what, when, where etc. of things
- What users would want to sort, group, and filter on
Fact Table:
- Also called a Measure
- Measurable metric with is described by the dimensions
- An observation or event
Grain:
- determines what each fact row contains and in what detail
- defined by dimensions in the fact table, and their details
Steps of Dimensional Modeling
Choose the business process
- Describe the business process which the model builds on.
Declare the Grain
- The grain of the model is the exact description of what the dimensional model should be focusing on. To clarify, you should pick the central process and describe it with one sentence
Identify the Dimensions
- The dimensions must be defined within the grain. Dimensions are the foundation of the fact table, and is where the data for the fact table is collected. Typically dimensions are nouns like date, store, inventory, etc.?
Identify the Facts
- Identify numeric facts that will populate each fact table row
Star Schema
- Marriage of Fact Schema to dimension schema
- Dimensions relate directly to fact table only
- Dimensions are deformalized. Does not have a related region lookup tables as an OLTP design
- Usually dimension keys are NOT keys from the source systems, rather they are generated by the data warehouse load process(Surrogate Keys)
- Dimension attributes you define determine granularity called the grain of the facts
Snowflake Schema
- Dimensions relate to another dimension you have a snowflake
- Snowflake causes a number of performance and usability issues and are rarely justified
- The principle behind snowflaking is normalization of the dimension tables by removing low cardinality attributes and forming separate tables.
Slowly Changing Dimensions
Type 0:?
- Never update. Keep the original value
- Useful for original based tracking
Type 1 :
- Overwrite the record row
- Useful when history is not a factor
- Modeling and querying only by current state
- Reporting will reflect the current value only
Type 2 :
- Maintain History
- Track complete history of dimension
- Adding 3 columns to maintain type 2 tables
- 'effective_date' - when the new row becomes the truth
- 'expiration_date' - when the row expired due to new update
- 'is_current' - is this row the current system truth
Best Practices
- Using surrogate key is recommended regardless but essential with Type 2 SCD
- Expiration date should be in the distant future(9999-01-01)
- Use only for true slowly changing dimensions
- Fast changing attributes lead to inflated dimension tables (Fix can be found using Type 4 SCD)
- Big Dimensions lead to poor performance and slow filtering
Type 3:
- Keep Limited History?
- Add new column to table to keep old value
- old_value
- new_value
- Typically used for one-time, cross data change
Type 4 :
- Maintain a separate history table
- Addresses Type 2 scaling issues
- Fast changing dimensions that are not facts
- they are dimensions for modeling purpose(filter by)
- they are dimensions for not changing fast enough
- The mini dimension is tracked through time via the fact table
- Creates dependence on the fact table to exist and never fundamentally change
Type 5:
- Separate changing values into mini dimensions
- Builds on type 4 SCD by embedding a mini-dimension table that is of type 1 SCD
- Allows for currently assigned mini-dimension tables to be accessed along with the base dimension table without being linked to a fact table
Type 6 :
- combination of type 1, 2 and 3