ç™»å½•æŸ¥çœ‹æ›´å¤šå†…å®¹

Data Modeling/Dimension Modeling

Nagaraju Juluru

Lead Data Engineer | Cloud & Big Data Expert | AI-Driven Data Solutions | AWS, GCP, Snowflake, Databricks | Apache Spark | Real-Time Streaming | ETL/ELT | Data Lakehouse | Terraform | CI/CD Automation

å‘å¸ƒæ—¥æœŸ: 2024å¹´4æœˆ20æ—¥

+ å…³æ³¨

Data Modeling Fundamentals

Data Modeling Fundamentals

What is Data Modeling?

What is Data Modeling? Representation of what the data is in the real-world. Provide us insights into characteristics and rules that apply to the data.

Lifecycle of Data Modeling

OLTP â€“ Online Transactional Processing

Optimized for inserts and updates
Optimal for application use
Schema structure might fundamentally change for different application needs
â€œFocus on customers entering dataâ€

OLAP â€“ Online Analytical Processing

Optimized for heavy reads
Optimal for business structure, understandable by business people
Schema structure should be consistent and flexible for different business needs

The Building Blocks of Data Modeling

Data Subjects/Entities

Commonly called â€œentitiesâ€
Others call it â€œobjectsâ€ or â€œclassesâ€
Analogous to database tables

Attributes

Analogous to a database column
Typically associated with Data Subjects
Attribute types often â€œsharedâ€ across multiple entitiesâ€œ
Bottom-upâ€ data modeling once semi-popular
Go down to all the attributes then model them back to entities

If there are common attributes if you want you can make the attributes more descriptive.

Attributes have descriptions & rules

Data types and size
Whether NULL values allowed
Permissible values

Attribute Tips

Decompose but carefully
Ex: â€œwhere a student lives
â€â€œSTUDENT_ADDR_AND_CITYâ€ = Street address + City + State + Zip Code all in one attribute/field (NOT BEST PRACTICE)
â€œSTUDENT_ADDRâ€ + â€œSTUDENT_CITYâ€ + â€œSTUDENT_STATEâ€ + â€œSTUDENT_ZIPâ€ each as a separate attribute/field (BEST PRACTICE)

Relationships among data subjects

Defining the relationship amongst the entities

Business rules for data

Cardinality
Mandatory or optional relationships
Permissible attribute values(including NULLs)
â€œData change dynamicsâ€

Hierarchies in Entities/Data Subjects

Hierarchy

Special kind of entity
A special type of relationship
Think â€œspecializtionâ€

Hierarchy use case

Two or more entities
That have â€œa lotâ€ in common
But also â€œat least a little bitâ€ different

Strong vs Weak Entities/Data Subjects

Strong entity â€œexists on its own termsâ€

Exists independent of any other entity
Does not require any other entity instances to help identify its own instance

Weak entity â€œneeds some helpâ€

To identify specific instance of that entity
Canâ€™t exist without an instance of another entity
or both

Multiple Relationships Between Entities

Multiple Relationships between 2 entities

-->Recursive Relationship

--> Ternary Relationship

Data Modeling Gerund

Special Type of relationship that acts like an entity.
Relationship that have specific attributes specific to the relationship itself

Cardinality

Cardinality: "the number of something"

Maximum Cardinality

Number of instances of both sides of a relationship
Typical values: â€œ1â€ or â€œMâ€(many)
Can also be a specific numeric value

1:1 Relationship

Specific number of Cardinality

Minimum Cardinality

Sometimes referred to as â€œparticipation constraintâ€

Total Participation ( Min. cardinality = 1)
Partial Participation ( Min. cardinality = 0)

é¢†è‹±æŽ¨è

The 7 Best Data Visualization Tools For 2018

Bernard Marr 6 å¹´å‰

Data Pipeline: Purpose, Types, Components and More

Lyftrondata 7 ä¸ªæœˆå‰

Data Modeling for Mere Mortals â€“ Part 3: All we need is a Data Lakehouse?!

Data Modeling for Mere Mortals â€“ Part 3: All we needâ€¦

Nikola Ilic 1 å¹´å‰

Mandatory vs optional relationship

Mandatory relationship (Min. cardinality = 1)
Optional relationship (Min. cardinality = 0)

3rd possible value for min. cardinality

0: optional/partial participation
1: mandatory/total participation
(n): some explicit number of minimum instancesâ€œA full time lecturer must teach at least 6 classesâ€â€œA full professor must advise at least 2 other faculty membersâ€

Crow's Foot Notations for Cardinality

Normalization

Normalization â€œThe Key, the Whole Key, Nothing but the Keyâ€¦â€

1st Normal Forms

Every row (tuple) must be unique
NO repeating groups
Multi-valued attributes are in violation of 1N

2nd Normal Forms

Must be in 1NFâ€œ
No partial key dependenciesâ€
Applies if composite primary key
If single-column (field) primary key then already in 2NF

3rd Normal Forms

Must be in 2NFâ€œ
No non-key dependencyâ€

Forward Engineering

Typical conceptual -> logical transformaitions

Address violations of normalization

Transform M:M relationships

Add â€œintersection entity (or table)â€ to your model
Also referred to as â€œassociative entity (or table)â€
Or â€œbridge entity (or table)â€
Purpose: decompose M:M relationship into multiple â€œsemantically equivalentâ€ relationships
Semantically equivalentâ€¦but â€œartificialâ€

Add foreign keys

Typical logical -> physical transformation

Denormalization

Violating normalization rules deliberate for performance gains
Aggregates
Materialized Views
Join across various tables that physically creates the result of a query
Optimized storage placement(e.g. partitioning)
Database Indices

Dimensional Modeling

"Dimensional Modeling is a design technique for databases intended to support end-user queries in a data warehouse"

Key Terms

Surrogate Keys:

Artificially created keys (usually integers) used only by the data warehouse to uniquely identify a row of dimension table
Required to implement history of slowly changing dimensions
Avoids conflicts among backend source systems
Insulates the data warehouse from source systems

Dimension Table:

By what we measure things
The who, what, when, where etc. of things
What users would want to sort, group, and filter on

Fact Table:

Also called a Measure
Measurable metric with is described by the dimensions
An observation or event

Grain:

determines what each fact row contains and in what detail
defined by dimensions in the fact table, and their details

Steps of Dimensional Modeling

Choose the business process

Describe the business process which the model builds on.

Declare the Grain

The grain of the model is the exact description of what the dimensional model should be focusing on. To clarify, you should pick the central process and describe it with one sentence

Identify the Dimensions

The dimensions must be defined within the grain. Dimensions are the foundation of the fact table, and is where the data for the fact table is collected. Typically dimensions are nouns like date, store, inventory, etc.?

Identify the Facts

Identify numeric facts that will populate each fact table row

Star Schema

Marriage of Fact Schema to dimension schema
Dimensions relate directly to fact table only
Dimensions are deformalized. Does not have a related region lookup tables as an OLTP design
Usually dimension keys are NOT keys from the source systems, rather they are generated by the data warehouse load process(Surrogate Keys)
Dimension attributes you define determine granularity called the grain of the facts

Snowflake Schema

Dimensions relate to another dimension you have a snowflake
Snowflake causes a number of performance and usability issues and are rarely justified
The principle behind snowflaking is normalization of the dimension tables by removing low cardinality attributes and forming separate tables.

Slowly Changing Dimensions

Type 0:?

Never update. Keep the original value
Useful for original based tracking

Type 1 :

Overwrite the record row
Useful when history is not a factor
Modeling and querying only by current state
Reporting will reflect the current value only

Type 2 :

Maintain History
Track complete history of dimension
Adding 3 columns to maintain type 2 tables
'effective_date' - when the new row becomes the truth
'expiration_date' - when the row expired due to new update
'is_current' - is this row the current system truth

Best Practices

Using surrogate key is recommended regardless but essential with Type 2 SCD
Expiration date should be in the distant future(9999-01-01)
Use only for true slowly changing dimensions
Fast changing attributes lead to inflated dimension tables (Fix can be found using Type 4 SCD)
Big Dimensions lead to poor performance and slow filtering

Type 3:

Keep Limited History?
Add new column to table to keep old value
old_value
new_value
Typically used for one-time, cross data change

Type 4 :

Maintain a separate history table
Addresses Type 2 scaling issues
Fast changing dimensions that are not facts
they are dimensions for modeling purpose(filter by)
they are dimensions for not changing fast enough
The mini dimension is tracked through time via the fact table
Creates dependence on the fact table to exist and never fundamentally change

Type 5:

Separate changing values into mini dimensions
Builds on type 4 SCD by embedding a mini-dimension table that is of type 1 SCD
Allows for currently assigned mini-dimension tables to be accessed along with the base dimension table without being linked to a fact table

Type 6 :

combination of type 1, 2 and 3

è¦æŸ¥çœ‹æˆ–æ·»åŠ è¯„è®ºï¼Œè¯·ç™»å½•

Nagaraju Juluruçš„æ›´å¤šæ–‡ç«

What is a data pipeline?

2025å¹´1æœˆ29æ—¥

What is a data pipeline?

A data pipeline is a series of tasks, such as transformations, filters, aggregations, and merging multiple sourcesâ€¦
AWS: IAM (Identity and Access Management)

2024å¹´7æœˆ6æ—¥

AWS: IAM (Identity and Access Management)

IAM acts as basic security feature that is integrated with almost every service in AWS. IAM consists of: Users Groupsâ€¦
AWS: DynamoDB

2024å¹´7æœˆ2æ—¥

AWS: DynamoDB

DynamoDB is a serverless NoSQL distributed database which scales horizontally. And the data inside DynamoDB is storedâ€¦
AWS: Cloud Formation

2024å¹´7æœˆ1æ—¥

AWS: Cloud Formation

AWS Cloud Formation is a service that lets us create, update and manage AWS Infrastructure with the help ofâ€¦
AWS: Basics

2024å¹´6æœˆ30æ—¥

AWS: Basics

Amazon Web Services is a cloud computing platform that provides customers with a wide array of cloud services. We canâ€¦
Types of Machine Learning Models

2024å¹´6æœˆ15æ—¥

Types of Machine Learning Models

What is AI/ML? At a very high-level, you can think of AI/ML as prediction. Prediction is the process of filling inâ€¦
SPARK - Partitioning

2024å¹´4æœˆ21æ—¥

SPARK - Partitioning

Why It Matters Performance is greatly impacted if a developer does not consider infrastructure while developing a SPARKâ€¦

1 æ¡è¯„è®º
Event Driven AWS Services

2024å¹´4æœˆ8æ—¥

Event Driven AWS Services

Introduction Event-driven architectures have grown in popularity as they address some of the challenges in building theâ€¦
Function Structure

2024å¹´4æœˆ8æ—¥

Function Structure

Structure of a FaaS Lambda Lifecycle TouchpointsExamplesS3 Output Frequently Asked Questions How do I get multipleâ€¦
AWS Transfer for SFTP

2024å¹´4æœˆ4æ—¥

AWS Transfer for SFTP

The Secure File Transfer Protocol (SFTP) is a widely used method to exchange data between third parties and insideâ€¦

See all articles

Data Modeling/Dimension Modeling

Nagaraju Juluru

Lead Data Engineer | Cloud & Big Data Expert | AI-Driven Data Solutions | AWS, GCP, Snowflake, Databricks | Apache Spark | Real-Time Streaming | ETL/ELT | Data Lakehouse | Terraform | CI/CD Automation

Data Modeling Fundamentals

What is Data Modeling?

Lifecycle of Data Modeling

OLTP â€“ Online Transactional Processing

OLAP â€“ Online Analytical Processing

The Building Blocks of Data Modeling

Hierarchies in Entities/Data Subjects

Strong vs Weak Entities/Data Subjects

Multiple Relationships Between Entities

Cardinality

Maximum Cardinality

Minimum Cardinality

é¢†è‹±æŽ¨è

Normalization

Forward Engineering

Typical conceptual -> logical transformaitions

Typical logical -> physical transformation

Dimensional Modeling

Key Terms

Steps of Dimensional Modeling

Star Schema

Snowflake Schema

Slowly Changing Dimensions

Nagaraju Juluruçš„æ›´å¤šæ–‡ç«

ç¤¾åŒºæ´žå¯Ÿ

å…¶ä»–ä¼šå‘˜ä¹Ÿæµè§ˆäº†

Data Modeling and Design: A Comprehensive Guide

Azure Synapse Analytics and Power BI for Data Engineering

The Magical Data Leadershipâ€™s Path

Data Transformation with Power Query in Power BI

DATA MODELLING: DEFINITION, TYPES, IMPORTANCE, AND BENEFITS

Extracting Tableau Metadata using GraphiQL

Level Up Your Data Career: Your Roadmap to Becoming an Analytics Engineer

Data Modeling Techniques for Effective Data Management

Data Modeling: Building a Strong Foundation for Data Architecture Part 1

What is Data Modeling? Types, Process and Benefits

Data Modeling Fundamentals

What is Data Modeling?

Lifecycle of Data Modeling

OLTP â€“ Online Transactional Processing

OLAP â€“ Online Analytical Processing

The Building Blocks of Data Modeling

Hierarchies in Entities/Data Subjects

Strong vs Weak Entities/Data Subjects

Multiple Relationships Between Entities

Cardinality

Maximum Cardinality

Minimum Cardinality

é¢†è‹±æŽ¨è

Normalization

Forward Engineering

Typical conceptual -> logical transformaitions

Typical logical -> physical transformation

Dimensional Modeling

Key Terms

Steps of Dimensional Modeling

Star Schema

Snowflake Schema

Slowly Changing Dimensions

Nagaraju Juluruçš„æ›´å¤šæ–‡ç«

What is a data pipeline?

AWS: IAM (Identity and Access Management)

AWS: DynamoDB

AWS: Cloud Formation

AWS: Basics

Types of Machine Learning Models

SPARK - Partitioning

Event Driven AWS Services

Function Structure

AWS Transfer for SFTP

ç¤¾åŒºæ´žå¯Ÿ

å…¶ä»–ä¼šå‘˜ä¹Ÿæµè§ˆäº†

Data Modeling and Design: A Comprehensive Guide

Azure Synapse Analytics and Power BI for Data Engineering

The Magical Data Leadershipâ€™s Path

Data Transformation with Power Query in Power BI

DATA MODELLING: DEFINITION, TYPES, IMPORTANCE, AND BENEFITS

Extracting Tableau Metadata using GraphiQL

Level Up Your Data Career: Your Roadmap to Becoming an Analytics Engineer

Data Modeling Techniques for Effective Data Management

Data Modeling: Building a Strong Foundation for Data Architecture Part 1

What is Data Modeling? Types, Process and Benefits

é¢†è‹±æŽ¨è

å…¶ä»–ä¼šå‘˜ä¹Ÿæµè§ˆäº†