Optimizing datasets for BI tools using SQL and Dimensional Modeling
A guide on how to reduce your data while covering all requirements from your stakeholders. (Author: Victor Ribeiro)

Optimizing datasets for BI tools using SQL and Dimensional Modeling

I already filtered everything! Why is my dataset still big?

Have you ever had a huge dashboard you needed to optimize, but you already went through all processes since filtering the info until reducing timeframes and using a different data architecture?

Know that one of the alternatives is solving this through a simple SQL structure. But first, let’s learn a bit about the entire process of optimizing datasets (from my point of view): (1) what is dimensional modeling; (2) how to use it; (3) Why and (4) how do it through SQL and not from the BI tool;

This approach?can be implemented in all tools available in the market that uses SQL.

Please note that: the objective is giving a brief and simple view about the topics in order to give you a base to solve a specific issue - reduce and optimize a dataset - so I won't be diving deep into each of them, for this, please refer to footnotes :)

1.What is dimensional modeling?

“ Dimensional modeling?(DM) is part of the Business Dimensional Lifecycle methodology developed by?Ralph Kimball which includes a set of methods, techniques and concepts for use in?data warehouse?design.”?[1]

In a simple way, dimensional modeling is: separating your data between entities and events related to those entities. Three important concepts are dimension and fact tables, and measures.

  • Fact tables:?a collection of data that represents a business event (a fact). It has connections to measures and dimensions to get more details.
  • Measures:?numeric attributes of the fact table. It can be either calculations, or simply values related to an event, and helps with questions that can be answered through numbers such as: How much? At which rate?
  • Dimension tables:?entities that answers specific business questions about the fact table: Who? What? Where? When? Why?

Let’s look at the image below.

* Image representing a star schema, a data architecture type. [2]

* Image representing a?star schema, a data architecture type. [2]

In this schema, the FactResellerSales is - obviously - the fact table. But why? It is a fact because it has information about a specific event related to an order, and an order is an event (fact); it also has keys to connect to other tables that will answer us questions of when, who, where, etc. - the dimension tables.

Dimensions, in this schema, are: DimEmployee, DimSalesTerritory, DimDate, DimReseller and DimProduct. They are classified like this because they are telling us more details about that specific event, like the name of the employee, when the order happened, where it happened, etc. Those details are what help us transform that data into information.

What about this star?

This is a star schema?[3]. A type of dimensional modeling characterized by having dimensions, facts and measures that, when connected in a entity-relationship model, looks like a star.

2.How can dimensional modeling help us reducing datasets?

Simple! Imagine if you had to get all of this data from a single source. It would be a big table, right?

With dimensional modeling you are reducing the amount of repeated information, and with this, you are freeing up space in your storage. In other words: you are getting (i) an improved performance; (ii) reducing storage costs and (iii) creating an organized database structure where it is easy to find information about a specific subject.

N?o foi fornecido texto alternativo para esta imagem

REMEMBER

With the correct architecture (not only star, but others such as?snowflake schema [4]), your database will stop being a black box with lots of floating information to become a source of, not only data, but also information and insights.

3.Why do it through SQL and not from the BI tool?

This is something hard to define and that I’ve been searching for a long time. There’s no correct or wrong answer, really.

From my previous experiences, having a huge dataset inserted in a BI tool and treating it after importing is really bad both in terms of performance and storage usage, and can give you a heavy headache (and some after-hour calls). I’ll tell you why.

  • Doing it through SQL makes it easier for you to adapt it to different BI tools.

If you work in a company that uses many different reporting tools, you know what I’m saying. Once in a while you need to migrate a report or dashboard to a different one, and.. ohh.. doing it from scratch takes a lot of time!

If you already have a star schema, for example, ready to use in your database, then your migration is almost done. Just import them to your BI tool and?voilà!?just use your time now to build the charts and make improve the UX.

  • Heavy calculations will be done in “backend” and not in “front”

If you pre-process everything leaving just simple sums, subtractions or divisions, for example, to be done in the BI tool the performance will be higher. Why?

  1. You won’t need complex formulas - one of the biggest offenders in performance in BI tools;
  2. By handling a bigger amount of info in SQL, you will only send to the report the aggregated and pre-treated rows almost ready to use, which makes the report lighter;
  3. If you have heavy queries to run in your report, you might end up impacting the refresh time and even the whole BI server infrastructure! It’ll take more time to update the information and if there’s no controlled environment in terms of timeouts or query performance treatment in the server side, your dash can even prevent other ones from updating, impacting the whole pipeline by generating waiting queues;
  4. With pre-calculated information you can make sure there’ll be no information disparities across different teams, helping with data governance. This can also be done by using views?[5] instead of?data marts?[6] (but you’ll lose performance).

4.Ok, and how do I do it?

N?o foi fornecido texto alternativo para esta imagem

Get your coffee and let’s go to the cool part!

Imagine you have to insert 1 year of information about sales, in store level (about 10K stores) for a marketplace company.

Well, as expected, the first thing we need to do is filter the information to get only the data we’ll need. We won’t go in details at this part, since it would be a bunch of different WHERE statements.

Now, imagine that this data is already filtered and treated accordingly, so the second step now would be to reduce this dataset even more. For this we’ll use UNIONs and aggregation functions, but before this, have more details about your stakeholders:

  1. At which frequency they’ll see this info?
  2. How long is the timeframe they need in daily, weekly, monthly and yearly visuals?
  3. What is the lowest grain needed for this report? Or, which grains are gonna be present?

Once all those questions are answered you have all information needed to reduce your dataset.

Check, try testing and understanding the SQL code below:

P.s.: if it is hard to read, copy and paste in your preferred coding tool :)

/* 

The code was created based on Snowflake SQL, but this can me modified and applied to all kinds of database. Note that the important part here is the logic!

*/




-- This part of the code means: total sales by store, by DAY, for last 60 days


SELECT


/* DIMENSION COLUMNS */

'1 - DAILY'                     AS DATE_TYPE            -- Level of aggregation in terms of time
, DATE_TRUNC(DAY, DATE_FIELD)   AS ANALYZED_TIMEFRAME   -- Time value in which the data is aggregated (when those orders happened?)
, '1 - STORE'                   AS GRAIN_LEVEL          -- This column defines the grain level of the information. Can be hierarchical or no
, STORE_ID                      AS GRAIN                -- Grain value in which the data is aggregated (where those orders happened?) - Can be a Surrogate Key


/* MEASURES */

, SUM(SALES_AMOUNT)             AS TOTAL_SALES_AMOUNT   -- Total sales amount for that specified grain



FROM RAW_TABLE


WHERE 1=1


-- Now, here's where we apply the business questions you've answered before
-- How long should be the daily timeframe for this case? For this case, 60 days would be enough for daily aggregated info

AND DATE_TRUNC(DAY, DATE_FIELD)::DATE >= DATE_TRUNC(DAY, DATEADD(DAY, -60, CURRENT_DATE))::DATE


GROUP BY 1,2,3,4




-- Now, instead of aggregating daily, weekly and monthly info inside the BI tool, we'll be joining them directly in SQL using UNION clause
UNION ALL




-- This part of the code means: total sales by store, by WEEK, for last 16 weeks
-- P.s.: I've removed the comments in this part to make it easier to read.


SELECT

/* DIMENSION COLUMNS */

  '2 - WEEKLY'                  AS DATE_TYPE            
, DATE_TRUNC(WEEK, DATE_FIELD)  AS ANALYZED_TIMEFRAME
, '1 - STORE'                   AS GRAIN_LEVEL
, STORE_ID                      AS GRAIN 


/* MEASURES */

, SUM(SALES_AMOUNT)             AS TOTAL_SALES_AMOUNT 



FROM RAW_TABLE


WHERE 1=1

-- How long should be the weekly timeframe for this case? For this case, 16 weeks would be enough for weekly aggregated info

AND DATE_TRUNC(WEEK, DATE_FIELD)::DATE >= DATE_TRUNC(WEEK, DATEADD(WEEK, -16, CURRENT_DATE))::DATE


GROUP BY 1,2,3,4




/****************************************************************************

In the rest of the code, just keep playing using unions and different types of aggregations and grains!

Don't forget to transform this into a datamart later to have an optimized table   :)

****************************************************************************/        
“Ok, now I understand the code, but why is this so good to reduce a dataset?”

Imagine that, in this case, we had to include Store ID level of information, about 10K stores, for the last 18 months.

  • If we brought info in a daily basis, we’d have: 10K stores * (18 months * 30 days) =?~ 5.4MM rows
  • Like this, we reduce the amount of rows to: 10K stores * (18 months + 16 weeks + 60 days) =?~940K rows (-82% !!)

Conclusion

N?o foi fornecido texto alternativo para esta imagem

It is a heavy code, but if you create a datamart, it’ll make it easier and faster to use

Since?this reduction is: (i) what makes your dataset faster and smaller, if you understand the exact timeframes your stakeholders will need, you’ll be covering all needs, with less data and greater performance; (ii) other BIs can use this same datamart, with pre-calculated info and make just simple calculations in the BI tool, making other people’s work easier!

Please feel free to give other alternatives if you feel appropriate! The community is needing this :)

Footnotes / References

[1] Dimensional data modeling (Towards Data Science):?https://towardsdatascience.com/dimensional-data-modeling-49038b96d95a

[2] Importance of star schema for Power BI (Microsoft):?https://docs.microsoft.com/en-us/power-bi/guidance/star-schema

[3] Star schema (Wikipedia):?https://en.wikipedia.org/wiki/Star_schema

[4] Snowflake schema (Wikipedia):?https://en.wikipedia.org/wiki/Snowflake_schema#:~:text=In%20computing%2C%20a%20snowflake%20schema,diagram%20resembles%20a%20snowflake%20shape.&text=When%20it%20is%20completely%20normalized,fact%20table%20in%20the%20middle.

[5] View - SQL (Wikipedia):?https://en.wikipedia.org/wiki/View_(SQL)

[6] Data mart (Wikipedia):?https://en.wikipedia.org/wiki/Data_mart

Max Segovia

Data Engineer en Amazon

3 年

I learned this from you! It’s a great tip!! Awesome lecture

Juan David Bolívar Vargas

Founder @ Kaleido & LuminaLabs | LC14 | Ex-Rappi | Ex-Uber

3 年

Very useful Vic, thanks!

Paulo Mann

Senior Product Manager | AI & Machine Learning

3 年

demais Vic!

chora bill gates, você é fera demais amigo

要查看或添加评论,请登录

Victor Ribeiro的更多文章

社区洞察

其他会员也浏览了