Optimizing datasets for BI tools using SQL and Dimensional Modeling
Victor Ribeiro
Gerente em Data & Analytics | Business Intelligence | Data Science | Estratégia & Performance | Gera??o de Leads B2B
I already filtered everything! Why is my dataset still big?
Have you ever had a huge dashboard you needed to optimize, but you already went through all processes since filtering the info until reducing timeframes and using a different data architecture?
Know that one of the alternatives is solving this through a simple SQL structure. But first, let’s learn a bit about the entire process of optimizing datasets (from my point of view): (1) what is dimensional modeling; (2) how to use it; (3) Why and (4) how do it through SQL and not from the BI tool;
This approach?can be implemented in all tools available in the market that uses SQL.
Please note that: the objective is giving a brief and simple view about the topics in order to give you a base to solve a specific issue - reduce and optimize a dataset - so I won't be diving deep into each of them, for this, please refer to footnotes :)
1.What is dimensional modeling?
“ Dimensional modeling?(DM) is part of the Business Dimensional Lifecycle methodology developed by?Ralph Kimball which includes a set of methods, techniques and concepts for use in?data warehouse?design.”?[1]
In a simple way, dimensional modeling is: separating your data between entities and events related to those entities. Three important concepts are dimension and fact tables, and measures.
Let’s look at the image below.
* Image representing a?star schema, a data architecture type. [2]
In this schema, the FactResellerSales is - obviously - the fact table. But why? It is a fact because it has information about a specific event related to an order, and an order is an event (fact); it also has keys to connect to other tables that will answer us questions of when, who, where, etc. - the dimension tables.
Dimensions, in this schema, are: DimEmployee, DimSalesTerritory, DimDate, DimReseller and DimProduct. They are classified like this because they are telling us more details about that specific event, like the name of the employee, when the order happened, where it happened, etc. Those details are what help us transform that data into information.
What about this star?
This is a star schema?[3]. A type of dimensional modeling characterized by having dimensions, facts and measures that, when connected in a entity-relationship model, looks like a star.
2.How can dimensional modeling help us reducing datasets?
Simple! Imagine if you had to get all of this data from a single source. It would be a big table, right?
With dimensional modeling you are reducing the amount of repeated information, and with this, you are freeing up space in your storage. In other words: you are getting (i) an improved performance; (ii) reducing storage costs and (iii) creating an organized database structure where it is easy to find information about a specific subject.
REMEMBER
With the correct architecture (not only star, but others such as?snowflake schema [4]), your database will stop being a black box with lots of floating information to become a source of, not only data, but also information and insights.
3.Why do it through SQL and not from the BI tool?
This is something hard to define and that I’ve been searching for a long time. There’s no correct or wrong answer, really.
From my previous experiences, having a huge dataset inserted in a BI tool and treating it after importing is really bad both in terms of performance and storage usage, and can give you a heavy headache (and some after-hour calls). I’ll tell you why.
If you work in a company that uses many different reporting tools, you know what I’m saying. Once in a while you need to migrate a report or dashboard to a different one, and.. ohh.. doing it from scratch takes a lot of time!
If you already have a star schema, for example, ready to use in your database, then your migration is almost done. Just import them to your BI tool and?voilà!?just use your time now to build the charts and make improve the UX.
领英推荐
If you pre-process everything leaving just simple sums, subtractions or divisions, for example, to be done in the BI tool the performance will be higher. Why?
4.Ok, and how do I do it?
Get your coffee and let’s go to the cool part!
Imagine you have to insert 1 year of information about sales, in store level (about 10K stores) for a marketplace company.
Well, as expected, the first thing we need to do is filter the information to get only the data we’ll need. We won’t go in details at this part, since it would be a bunch of different WHERE statements.
Now, imagine that this data is already filtered and treated accordingly, so the second step now would be to reduce this dataset even more. For this we’ll use UNIONs and aggregation functions, but before this, have more details about your stakeholders:
Once all those questions are answered you have all information needed to reduce your dataset.
Check, try testing and understanding the SQL code below:
P.s.: if it is hard to read, copy and paste in your preferred coding tool :)
/*
The code was created based on Snowflake SQL, but this can me modified and applied to all kinds of database. Note that the important part here is the logic!
*/
-- This part of the code means: total sales by store, by DAY, for last 60 days
SELECT
/* DIMENSION COLUMNS */
'1 - DAILY' AS DATE_TYPE -- Level of aggregation in terms of time
, DATE_TRUNC(DAY, DATE_FIELD) AS ANALYZED_TIMEFRAME -- Time value in which the data is aggregated (when those orders happened?)
, '1 - STORE' AS GRAIN_LEVEL -- This column defines the grain level of the information. Can be hierarchical or no
, STORE_ID AS GRAIN -- Grain value in which the data is aggregated (where those orders happened?) - Can be a Surrogate Key
/* MEASURES */
, SUM(SALES_AMOUNT) AS TOTAL_SALES_AMOUNT -- Total sales amount for that specified grain
FROM RAW_TABLE
WHERE 1=1
-- Now, here's where we apply the business questions you've answered before
-- How long should be the daily timeframe for this case? For this case, 60 days would be enough for daily aggregated info
AND DATE_TRUNC(DAY, DATE_FIELD)::DATE >= DATE_TRUNC(DAY, DATEADD(DAY, -60, CURRENT_DATE))::DATE
GROUP BY 1,2,3,4
-- Now, instead of aggregating daily, weekly and monthly info inside the BI tool, we'll be joining them directly in SQL using UNION clause
UNION ALL
-- This part of the code means: total sales by store, by WEEK, for last 16 weeks
-- P.s.: I've removed the comments in this part to make it easier to read.
SELECT
/* DIMENSION COLUMNS */
'2 - WEEKLY' AS DATE_TYPE
, DATE_TRUNC(WEEK, DATE_FIELD) AS ANALYZED_TIMEFRAME
, '1 - STORE' AS GRAIN_LEVEL
, STORE_ID AS GRAIN
/* MEASURES */
, SUM(SALES_AMOUNT) AS TOTAL_SALES_AMOUNT
FROM RAW_TABLE
WHERE 1=1
-- How long should be the weekly timeframe for this case? For this case, 16 weeks would be enough for weekly aggregated info
AND DATE_TRUNC(WEEK, DATE_FIELD)::DATE >= DATE_TRUNC(WEEK, DATEADD(WEEK, -16, CURRENT_DATE))::DATE
GROUP BY 1,2,3,4
/****************************************************************************
In the rest of the code, just keep playing using unions and different types of aggregations and grains!
Don't forget to transform this into a datamart later to have an optimized table :)
****************************************************************************/
“Ok, now I understand the code, but why is this so good to reduce a dataset?”
Imagine that, in this case, we had to include Store ID level of information, about 10K stores, for the last 18 months.
Conclusion
It is a heavy code, but if you create a datamart, it’ll make it easier and faster to use
Since?this reduction is: (i) what makes your dataset faster and smaller, if you understand the exact timeframes your stakeholders will need, you’ll be covering all needs, with less data and greater performance; (ii) other BIs can use this same datamart, with pre-calculated info and make just simple calculations in the BI tool, making other people’s work easier!
Please feel free to give other alternatives if you feel appropriate! The community is needing this :)
Footnotes / References
[1] Dimensional data modeling (Towards Data Science):?https://towardsdatascience.com/dimensional-data-modeling-49038b96d95a
[2] Importance of star schema for Power BI (Microsoft):?https://docs.microsoft.com/en-us/power-bi/guidance/star-schema
[3] Star schema (Wikipedia):?https://en.wikipedia.org/wiki/Star_schema
[4] Snowflake schema (Wikipedia):?https://en.wikipedia.org/wiki/Snowflake_schema#:~:text=In%20computing%2C%20a%20snowflake%20schema,diagram%20resembles%20a%20snowflake%20shape.&text=When%20it%20is%20completely%20normalized,fact%20table%20in%20the%20middle.
[5] View - SQL (Wikipedia):?https://en.wikipedia.org/wiki/View_(SQL)
[6] Data mart (Wikipedia):?https://en.wikipedia.org/wiki/Data_mart
Data Engineer en Amazon
3 年I learned this from you! It’s a great tip!! Awesome lecture
Founder @ Kaleido & LuminaLabs | LC14 | Ex-Rappi | Ex-Uber
3 年Very useful Vic, thanks!
Senior Product Manager | AI & Machine Learning
3 年demais Vic!
chora bill gates, você é fera demais amigo