Data Modeling/Dimension Modeling

Data Modeling/Dimension Modeling

Data Modeling Fundamentals

Cardinality

Normalization

Forward Engineering

Dimensional Modeling

Data Modeling Fundamentals

What is Data Modeling?

  • What is Data Modeling? Representation of what the data is in the real-world. Provide us insights into characteristics and rules that apply to the data.

Lifecycle of Data Modeling

OLTP – Online Transactional Processing

  • Optimized for inserts and updates
  • Optimal for application use
  • Schema structure might fundamentally change for different application needs
  • “Focus on customers entering data”

OLAP – Online Analytical Processing

  • Optimized for heavy reads
  • Optimal for business structure, understandable by business people
  • Schema structure should be consistent and flexible for different business needs


The Building Blocks of Data Modeling

Data Subjects/Entities

  • Commonly called “entities”
  • Others call it “objects” or “classes”
  • Analogous to database tables

Attributes

  • Analogous to a database column
  • Typically associated with Data Subjects
  • Attribute types often “shared” across multiple entities“
  • Bottom-up” data modeling once semi-popular
  • Go down to all the attributes then model them back to entities

  • If there are common attributes if you want you can make the attributes more descriptive.

Attributes have descriptions & rules

  • Data types and size
  • Whether NULL values allowed
  • Permissible values

Attribute Tips

  • Decompose but carefully
  • Ex: “where a student lives
  • ”“STUDENT_ADDR_AND_CITY” = Street address + City + State + Zip Code all in one attribute/field (NOT BEST PRACTICE)
  • “STUDENT_ADDR” + “STUDENT_CITY” + “STUDENT_STATE” + “STUDENT_ZIP” each as a separate attribute/field (BEST PRACTICE)

Relationships among data subjects

  • Defining the relationship amongst the entities

Business rules for data

  • Cardinality
  • Mandatory or optional relationships
  • Permissible attribute values(including NULLs)
  • “Data change dynamics”

Hierarchies in Entities/Data Subjects

Hierarchy

  • Special kind of entity
  • A special type of relationship
  • Think “specializtion”

Hierarchy use case

  • Two or more entities
  • That have “a lot” in common
  • But also “at least a little bit” different

Strong vs Weak Entities/Data Subjects

Strong entity “exists on its own terms”

  • Exists independent of any other entity
  • Does not require any other entity instances to help identify its own instance

Weak entity “needs some help”

  • To identify specific instance of that entity
  • Can’t exist without an instance of another entity
  • or both

Multiple Relationships Between Entities

  • Multiple Relationships between 2 entities

-->Recursive Relationship

--> Ternary Relationship

Data Modeling Gerund

  • Special Type of relationship that acts like an entity.
  • Relationship that have specific attributes specific to the relationship itself

Cardinality

Cardinality: "the number of something"

Maximum Cardinality

  • Number of instances of both sides of a relationship
  • Typical values: “1” or “M”(many)
  • Can also be a specific numeric value

1:1 Relationship

Specific number of Cardinality

Minimum Cardinality

Sometimes referred to as “participation constraint”

  • Total Participation ( Min. cardinality = 1)
  • Partial Participation ( Min. cardinality = 0)

Mandatory vs optional relationship

  • Mandatory relationship (Min. cardinality = 1)
  • Optional relationship (Min. cardinality = 0)

3rd possible value for min. cardinality

  • 0: optional/partial participation
  • 1: mandatory/total participation
  • (n): some explicit number of minimum instances“A full time lecturer must teach at least 6 classes”“A full professor must advise at least 2 other faculty members”

Crow's Foot Notations for Cardinality

Normalization

Normalization “The Key, the Whole Key, Nothing but the Key…”

1st Normal Forms

  • Every row (tuple) must be unique
  • NO repeating groups
  • Multi-valued attributes are in violation of 1N

2nd Normal Forms

  • Must be in 1NF“
  • No partial key dependencies”
  • Applies if composite primary key
  • If single-column (field) primary key then already in 2NF

3rd Normal Forms

  • Must be in 2NF“
  • No non-key dependency”

Forward Engineering

Typical conceptual -> logical transformaitions

  • Address violations of normalization

Transform M:M relationships

  • Add “intersection entity (or table)” to your model
  • Also referred to as “associative entity (or table)”
  • Or “bridge entity (or table)”
  • Purpose: decompose M:M relationship into multiple “semantically equivalent” relationships
  • Semantically equivalent…but “artificial”

  • Add foreign keys

Typical logical -> physical transformation

Denormalization

  • Violating normalization rules deliberate for performance gains
  • Aggregates
  • Materialized Views
  • Join across various tables that physically creates the result of a query
  • Optimized storage placement(e.g. partitioning)
  • Database Indices


Dimensional Modeling

"Dimensional Modeling is a design technique for databases intended to support end-user queries in a data warehouse"

Key Terms

Surrogate Keys:

  • Artificially created keys (usually integers) used only by the data warehouse to uniquely identify a row of dimension table
  • Required to implement history of slowly changing dimensions
  • Avoids conflicts among backend source systems
  • Insulates the data warehouse from source systems

Dimension Table:

  • By what we measure things
  • The who, what, when, where etc. of things
  • What users would want to sort, group, and filter on

Fact Table:

  • Also called a Measure
  • Measurable metric with is described by the dimensions
  • An observation or event

Grain:

  • determines what each fact row contains and in what detail
  • defined by dimensions in the fact table, and their details

Steps of Dimensional Modeling

Choose the business process

  • Describe the business process which the model builds on.

Declare the Grain

  • The grain of the model is the exact description of what the dimensional model should be focusing on. To clarify, you should pick the central process and describe it with one sentence

Identify the Dimensions

  • The dimensions must be defined within the grain. Dimensions are the foundation of the fact table, and is where the data for the fact table is collected. Typically dimensions are nouns like date, store, inventory, etc.?

Identify the Facts

  • Identify numeric facts that will populate each fact table row

Star Schema

  • Marriage of Fact Schema to dimension schema
  • Dimensions relate directly to fact table only
  • Dimensions are deformalized. Does not have a related region lookup tables as an OLTP design
  • Usually dimension keys are NOT keys from the source systems, rather they are generated by the data warehouse load process(Surrogate Keys)
  • Dimension attributes you define determine granularity called the grain of the facts

Snowflake Schema

  • Dimensions relate to another dimension you have a snowflake
  • Snowflake causes a number of performance and usability issues and are rarely justified
  • The principle behind snowflaking is normalization of the dimension tables by removing low cardinality attributes and forming separate tables.

Slowly Changing Dimensions

Type 0:?

  • Never update. Keep the original value
  • Useful for original based tracking

Type 1 :

  • Overwrite the record row
  • Useful when history is not a factor
  • Modeling and querying only by current state
  • Reporting will reflect the current value only

Type 2 :

  • Maintain History
  • Track complete history of dimension
  • Adding 3 columns to maintain type 2 tables
  • 'effective_date' - when the new row becomes the truth
  • 'expiration_date' - when the row expired due to new update
  • 'is_current' - is this row the current system truth

Best Practices

  • Using surrogate key is recommended regardless but essential with Type 2 SCD
  • Expiration date should be in the distant future(9999-01-01)
  • Use only for true slowly changing dimensions
  • Fast changing attributes lead to inflated dimension tables (Fix can be found using Type 4 SCD)
  • Big Dimensions lead to poor performance and slow filtering

Type 3:

  • Keep Limited History?
  • Add new column to table to keep old value
  • old_value
  • new_value
  • Typically used for one-time, cross data change

Type 4 :

  • Maintain a separate history table
  • Addresses Type 2 scaling issues
  • Fast changing dimensions that are not facts
  • they are dimensions for modeling purpose(filter by)
  • they are dimensions for not changing fast enough
  • The mini dimension is tracked through time via the fact table
  • Creates dependence on the fact table to exist and never fundamentally change

Type 5:

  • Separate changing values into mini dimensions
  • Builds on type 4 SCD by embedding a mini-dimension table that is of type 1 SCD
  • Allows for currently assigned mini-dimension tables to be accessed along with the base dimension table without being linked to a fact table

Type 6 :

  • combination of type 1, 2 and 3



要查看或添加评论,请登录

社区洞察

其他会员也浏览了