登录查看更多内容

Dimensional Modeling - Part 2 of 2: Creating a Successful Model

Paul Needleman

The Power of Connections | Data | People | Snowflake

发布日期: 2015年7月14日

In my previous post, I introduced dimensional modeling, compared with transactional modeling, and provided some overall benefits. In this post, I will provide my thoughts on how to design a successful dimensional data model for analytical applications. These ideas come from my years of experience in the business intelligence field, married with Kimball’s best practice standards, which have been commonplace for over 20 years.

Work with Subject Matter Experts (SME) to understand business needs

Too often I see data architects and developers try to create the data model in a vacuum. This may work in certain transactional environments, but not when creating a dimensional model. If you recall from my previous post, the goal of dimensional modeling is to allow business users to utilize their data to create trends, find correlations, and make informed decisions. You may be thinking: "Good thing I have a business analyst on my team!" Even if you work with a business analyst, the data modeler must be engaged in these discussions. I cannot stress this enough.

To be successful, the modeler must engage the business to fully understand business rules, objectives, and the goals of the data to be retrieved.

A few talking points when talking with SMEs

What fields are needed for ad-hoc analysis vs reporting?
Do you need to see data historically? A common example is to view customer data, specifically addresses, over a period of time.
What defines a unique transaction?
Do you plan to report on your data daily, monthly, or yearly?
How do you define this? Many times different parts of the organization can define the same term differently. When this happens you will need to work different departments to come up with a standard definition or formula of a term.
Which data would you like to be summarized or calculated?
Do you foresee having any data that you would like to report on or for creating correlations in the future?

Understand the source data

Even if you can create an exceptional dimensional model, the data the business would like to retrieve may not exist in the source database. It’s very important to understand the limitations of the data within the organization. To determine the feasibility of the data requirements, data profiling should be completed on the source data. Profiling is the process in which data is examined and statistics are collected. When profiling data consider the following:

Write SQL against the source system to find data values, patterns, inconsistences, or other data anomalies.
Determine the relationships and cardinality of data between the tables, unique keys, and what type of data is populated in the source database.
- A quick trick is to count the distinct number of unique records. This can help determine if the key is truly unique. You can use the query below to help determine uniqueness:

COUNT UNIQUE_KEY, COUNT (*)

FROM TABLE_NAME

GROUP BY UNIQUE_KEY

HAVING COUNT (*) > 1

Do not leverage, copy, or make use of the transactional data structures. Remember the dimensional model structures should be denormalized for efficient reporting.

Once this data is gathered, it's best to document this information into a source database analysis document. This will provide the input for the source to target mapping, as well as a roadmap for converting a conceptual model into a physical one.

Determine the grain of your model

Once requirements are gathered, you should be able to determine the grain of you data. The grain is what level your data is stored at, and should be described in business language.

Let’s pause for a moment and look and the following sales receipt

Your fact should always be at the lowest grain possible - This makes aggregating or reporting on a summary level data fast and easy. For this example, the lowest grain is the sales order item, thus the grain is: “a line item for each sales order transaction”.

Watch out: One common mistake that I’ve seen is a model where grains are mixed. This is a big no-no and will be explained when I discuss creating the fact table. In this example, a mixed grain would be storing a line item amount and a total amount in the same fact row.

Decide which dimensions are needed to support your metrics

Once the grain is determined, you can then begin the discussion around what dimensions relate to that grain of data. Dimensions help describe the who, what, where, when, why questions around your data. They are also used for grouping and filtering metrics.

When discussing our example around sales orders; products, customers, employees, store locations, and transaction date dimensions all make sense at that order line level. The combination of these dimensions will uniquely make up the grain, or a transaction in your fact table.

When creating dimensions, it’s important to understand the concept of conformed dimensions. Conformed dimensions are preferred in dimensional modeling so that data is common and consistent between dimension tables. Conformed dimensions can be used to analyze data across multiple facts because they have common dimensionality. This allows for enterprise reporting of dimensions and metrics throughout the organization. It’s also less development to have different dimensions with similar or the same meaning.

Data governance and standardization are also really important when creating dimensions. If conformed dimensions will be used throughout an organization for things such as customers and products, decisions around naming standards and data policies are needed. Many times buy in will be needed from multiple divisions throughout the organization.

Watch out: One common mistake that I’ve seen is trying to re-invent the wheel when creating a dimension. Look for common dimensions that already exist in your enterprise. If you have something that already exists, can it be leveraged for your design or does a new dimension need to be created?

Decide on the fact

Facts contain the measures that are to be aggregated, summed, or calculated by the dimensions. Because we now know the grain of data, and the dimensions, we can decide what metrics we would like stored in our fact table. Each metric must be at the grain of data declared earlier.

Often dimensions need to be associated with a metric that is not at the same grain. If so, there are two common options:

Allocate the metric at the header level down to the detailed level based on a business rule, such as a weighted average
Remove the dimension that breaks the grain of the data

Refer back to the sales order receipt when we declared the grain. A common mistake would be to store the discount amount in the fact table as it. This will break the grain of data because the total discount amount of $3 cannot be applied to every single item. Given that the grain was declared at the line level, we don’t have a choice to remove the dimension (product) that breaks the grain. Instead, we can use a simple allocation formula to store the discount amount at the line item level. This is fairly straightforward since the $3 discount can be applied to each item, based on the subtotal. There are different allocation techniques, but I primarily use a weighted average. The weighted average for this example would be:

(Line Amount / SUM of all line amounts per order) * Discount Amount

Our fact table would look like this:

Note that in our fact table, the dimensions are represented as keys. These are randomly generated keys that relate back to the dimension's primary key.

Also note that the line amount is not stored since it can easily be calculated by multiplying the unit price by quantity.

Watch out: One common mistake that I’ve seen is creating a fact table for a particular report or set of reports. Fact tables should not just created for a specific reports. Remember that users may not know what questions they want answered from the BI solution. Having data at a consistent grain allows the model to be scaled for growth and for multiple combinations of answers to be returned.

Peer review and iteration

Now that we have engaged the business, assessed the source data, declared the grain, and determined dimensions, and facts, you would think you would be done. Unfortunately, data modeling is an iterative process. Your model should be reviewed by peers, SMEs, business analysts, and other stakeholders.

Don’t think of the data model as a technical document, but instead it's a way to explain the business process back to the business users. You also want to get a few other data modelers to take a look to make sure you haven’t missed anything.

Please share your ideas and suggestions below.

For continued reading on the topic, the following articles provide a more detailed analysis on dimensional modeling.

Kimball, R. (1997, August 2). A Dimensional Modeling Manifesto. Retrieved June 25, 2015, from https://www.kimballgroup.com/1997/08/a-dimensional-modeling-manifesto/

Kimball, R. (2013, August 1). Kimball Dimensional Modeling Techniques. Retrieved June 25, 2015, from https://www.kimballgroup.com/wp-content/uploads/2013/08/2013.09-Kimball-Dimensional-Modeling-Techniques11.pdf

Moody, D., & Kortink, M. (2003). From ER Models to Dimensional Models: Bridging the Gap between OLTP and OLAP Design, Part I. Business Intelligence Journal, 7-24. Retrieved June 25, 2015, from https://sonra.io/wp-content/uploads/2010/10/From_ER_Models_to_Dimensional_Models_Part_I.pdf

The views expressed in this post are entirely my own, and don't represent that of my employer or any of its employees.

Dante Barba

Java Developer

6 年

Excellent post. Really helpful for identifying the correct grain on fact tables.

Thomas Nourse

7 年

Excellent post. I've seen all of that as well and try to educate people that fact tables are not to be designed around reports, they should be designed around the natural dimensionality of the business process (in your model, a single sales transaction).... some get it, some are stuck in "the way we've always done it"...

3 次回应

Greg Jones

Leader - Data Management and Analytics Solutions Expert

9 年

Nice topic Paul. Check out Chris Adamson's blog and book also if you're digging dimensional models...

Lauri Putt Needleman

Chief People Officer | Empowering People in service of Small Business

9 年

Great post! How true this is: "users may not know what questions they want answered."

1 次回应

Krishna Akella

Delivery Manager at Databricks

9 年

Thanks for sharing part 2, Paul. I was anxiously waiting for you to post it. I like all of your points here. I especially am a fan of always catering to the organization as whole and having all business users involved.

查看更多评论

要查看或添加评论，请登录

Paul Needleman的更多文章

The (Modern) Big Data Platform

2022年2月2日

The (Modern) Big Data Platform

Disclaimer: The views expressed here are mine alone and do not necessarily reflect the view of my current, former, or…
Empowering the Self-Service Analytics Workforce

2020年4月27日

Empowering the Self-Service Analytics Workforce

Disclaimer: The views expressed here are mine alone and do not necessarily reflect the view of my current, former, or…

3 条评论
Data Management in the Era of Big Data

2018年11月18日

Data Management in the Era of Big Data

Disclaimer: The views expressed here are mine alone and do not necessarily reflect the view of my current, former, or…

4 条评论
3 Tips for Achieving Success at Work

2018年6月25日

3 Tips for Achieving Success at Work

Disclaimer: The views expressed here are mine alone and do not necessarily reflect the view of my current, former, or…

1 条评论
Data Warehousing in the era of Big Data

2018年6月12日

Data Warehousing in the era of Big Data

Disclaimer: The views expressed here are mine alone and do not necessarily reflect the view of my current, former, or…

26 条评论
Dimensional Modeling - Part 1 of 2: An Overview

2015年6月30日

Dimensional Modeling - Part 1 of 2: An Overview

I believe that the data model is the most important part of any data driven solution. Because I work in the Business…

6 条评论

See all articles

Work with Subject Matter Experts (SME) to understand business needs

Understand the source data

COUNT UNIQUE_KEY, COUNT (*)

FROM TABLE_NAME

GROUP BY UNIQUE_KEY

HAVING COUNT (*) > 1

Determine the grain of your model

Decide which dimensions are needed to support your metrics

Decide on the fact

Peer review and iteration

Paul Needleman的更多文章

The (Modern) Big Data Platform

Empowering the Self-Service Analytics Workforce

Data Management in the Era of Big Data

3 Tips for Achieving Success at Work

Data Warehousing in the era of Big Data

Dimensional Modeling - Part 1 of 2: An Overview