Optimization of Snowflake for BI Products
The business world is rushing as technology and innovation enter different areas. Any organization that wants to remain competitive and stand out to customers must make real-time decisions. Of course, these decisions must be supported by the judgment of business experts and backed by data.
If our experts and managers can access relevant and up-to-date information using a data strategy, it can yield incredible results for growing the business. This is where business intelligence has its first impact, managing to describe the causes and results of current actions, as well as providing options to take different paths in the near future. It considers historical data while making predictions or simply analyzing trends.
Every data strategy, from a technical perspective, involves different components, starting with the sources of information and the ingestion of these data from transactional systems into a data repository or data warehouse with analytical features. Once the data is stored in the data warehouse, the data cleansing, transformation, and enrichment processes come. Then, it is followed by correct data modeling so final users can use their visualization tools to display dashboards with strategic information, KPIs, and trends.? That provides inputs to decision-making. Data governance processes and business objectives must accompany all these technical elements. If one part does not complement the other, the data strategy would be incomplete and inefficient.
The technical heart of the data strategy is the data warehouse. So, let's take some time to expand on the concepts related to this and its optimization to optimize the queries that business intelligence analysts perform.
A data warehouse (DW or DWH), also known as an enterprise data warehouse (EDW), is a system for reporting and analyzing structured and semi-structured data from multiple data sources.
This is where our data is located and where we connect our visualization tools. The figure above shows that the data warehouse must have several layers. However, I will not explain in this article the different layers and components that must be included in the data warehouse. I will proceed with the topic of improving performance when consuming data.
To improve the performance of data queries, we must analyze three essential columns:
1. How we have stored our data
2. The human part.
3. The capacity or computing power to process the queries
Storage
Knowing how our data is physically stored and how our database engine retrieves is a starting point for any data project. Remember that in a very general way, we have two prominent types of structured databases: transactional and analytical.
Transactional databases focus on high performance for recurring insertions, changes, or deletions of small transactions, such as points of sale, call centers that contact different clients every minute and need to record data for each one, customer service, inventories, and any CRM and ERP. They are speedy with individual transactions. On the other hand, analytical engines focus on high-performance insertions or data queries of millions of records in minutes or seconds in batch processes. Neither is better than the other. They are focused on different use cases.
Snowflake is currently working on other storage formats to allow transactions; however, we want to analyze its characteristics as an analytical engine.
Physical data storage within Snowflake is based on the concept of micro partitions. A table logically represents data comprising one or more micro partitions. Each micro partition is a group of records ordered either by natural data ingestion or by an established cluster key that allows continuous data reordering according to the fields that make up the cluster key.
Considering this, our first step in optimization is to review the data ingestion processes and validate the order of the records that arrive through files or integrations. If this order corresponds to one of our query keys, we are already on the way to good optimization. If there is no key field to order the ingestion, we have two alternatives: the first and most efficient in terms of the long term would be to modify our processes for creating files that will be ingested into Snowflake. If this is not possible, we have the second alternative, creating a Cluster Key manually to indicate the order the data should have when physically stored.
This solves a use case for querying our data, preferably the most frequently used case. But what about users who query the data using another set of columns as key fields?
One feature of Snowflake is materialized views. You can create different materialized views, each with a distinct cluster key. The magic is that you can query them individually?or the original table all the time, and the Snowflake query optimizer recognizes the key structures and clusters of the materialized views. Suppose the service detects that using the materialized view instead of the table is more efficient. In that case, it automatically rewrites the query and retrieves the data using the materialized view. This works with large tables in terms of Terabytes. In small tables, the data will constantly be retrieved directly from the object indicated in the SQL statement since the optimization is null or minimal.
When working on the data structure analysis process, we can identify other objects that can be very useful depending on the specific use case. Sometimes, we can rely on zero-copy cloning to have millisecond copies of the data in a table that we can use for testing or QA. In other cases, we can have dynamic tables in our intermediate layer or our data model layer that do not require customized update processes since they are designed to perform this task through the script that creates them, with the additional advantage over materialized views that they do allow having more than one table as data sources. Here are several links to Snowflake documentation to learn about the different types of objects that will enable you to store data.
Materialized Views https://docs.snowflake.com/en/user-guide/views-materialized
Zero-Copy cloning https://community.snowflake.com/s/article/Best-Practices-for-Agile-Data-Deployment-using-Snowflake-Cloning-and-Access-Controls
Optimizing storage for performance: https://docs.snowflake.com/en/user-guide/performance-query-storage
Human part
From a human perspective, we will always be involved in some part of the process, whether at the beginning, designing the architecture of the entire process or writing code to carry out the whole data flow.
The first step is crucial: having a design on the general architecture of our solution so as not to go working and resolving issues that could be avoided, as well as thinking about integration with the entire data ecosystem, with other data products and considering all the points to manage the project within a framework that enables data governance. This part is developed by humans, where the experience and knowledge of data platforms and the tools used can provide a competitive advantage and improve implementation time.
Many warehouse implementation projects do not consider the design and architecture part, and although they are not destined to fail, they will have to overcome many challenges, which will increase the time and resources invested.
Another strategic point is the data architecture itself.
The data architecture within the platform must have at least three layers, starting with the RAW layer, also known as the STAGE layer, which covers the raw data ingested from our data sources.
The second layer is where we move our information by performing transformation and cleaning processes. We can combine various data sources and enrich the data. This is part of the ELT paradigm, where the transformations are performed directly in the data warehouse, not in the application that performs the ingestion.
Finally, we have our third layer, where we store the data in a form that tools and end users can consume. This is where we perform the modeling of the different data marts. Whether the data modeling method is Kimball, Data Vault, or another does not matter. The important thing is to have a layer where this activity is developed.
For security purposes, any data visualization or analysis tool and business analysts should connect to this final layer. Data scientists can often connect to the first layers because they need to run different models and experiments. However, any model that wants to be put into production must be connected to the data in the final, consumption, or modeling layers.
领英推荐
Another aspect of performance is the way the code is written. Programmers are used to using cursors; however, in many cases, cursors can be replaced with SQL code using grouping and conditional options. Cursors are effective for performing single transactions; an analytical data warehouse can process millions of records if the query is implemented correctly in a quarter of the time or less than that of a cursor.
Another common practice is including many subqueries levels within a query; this can make the query slow. Subqueries should not have order by.
No functions that transform data should be used in the where statements. Or at least we should avoid using them unless we have no alternative. A straightforward example is that we have a table with 20 million records, and we want to retrieve the records from October 1, 2024. The table has 25 micro partitions to complete the example, and the order_date is part of the cluster key. These two queries work:
Select id, order_date
from tabla1
where '2024-10-01' = TO_VARCHAR(order_date, 'yyyy-mm-dd');
?
Select id, order_date_time
from tabla1
where TO_DATE('2024-10-01') = order_date;
?
The difference is the response time. In the first example, for all records, 20 million, the date conversion is performed to determine which records should be displayed. The function is executed 20 million times, and the entire table is scanned. The function applied to the column does not allow the use of metadata from the micro partitions or the use of the Snowflake optimizer.
In the second example, the function is executed only once. The converted data is compared with the native value of the column. Additionally, it only scans a part of the table, only the micro partitions that have a date within the minimum and maximum of that column, which could be 5 instead of 25. This second query can take a quarter of the time or less than the first.
We can significantly reduce response time by applying best practices throughout our code.
Compute Power
I have left the processing power part for last because it is the most expensive. However, it is the easiest to configure and scale. For this reason, it is usually the first thing worked on to have short-term results but greatly increasing costs. A good performance analysis and its respective application must contemplate the previous stages first and lastly focus on the processing resource.
Snowflake offers different options. The first and easiest is to increase the size of our virtual warehouse, which comes in sizes XS, S, Medium, Large, and XL up to 6XL. Each size doubles the amount of resources but simultaneously doubles the cost.
If we have complex queries, many transformations, multiple joins, or a large amount of data recovered from a few queries, increasing the size is typically enough. However, having a lot of concurrency, whether it is many simultaneous users or many queries running simultaneously, could not be resolved by increasing the size. In this case, we must work on the multi-clusters instead of the size to opt for better results.
The concept of multi-clustering is to have a minimum number of servers running at a specific size that can increase as there is a demand for resources caused by more users or more queries, up to a maximum number of configured virtual warehouses.
Here, we can apply two types of policies: one that focuses on saving money (economy), where a new cluster is activated only if there are several queues, or the policy that optimizes time (standard), where queries are attended to but if any remain in the queue, a new cluster is activated as quickly as possible.
The number of queries a virtual warehouse can handle concurrently depends on its size and the "MAX_CONCURRENCY_LEVEL" setting. By default, a single-cluster warehouse can handle around eight concurrent queries, as this is the standard value for MAX_CONCURRENCY_LEVEL; larger warehouses can handle more concurrent queries due to increased computing power. Remember that the complexity of each query also affects how many can run concurrently.
The magic of multi-clustering is that you set up a minimum and maximum number of clusters, allowing you to control your budget. Snowflake automatically activates or deactivates them based on the increase in queries during your platform's peak times of use.
This configuration is simple, but you should take the time to analyze your normal workload and during peak hours to determine the appropriate size and number of clusters required to handle concurrency peaks.
Other services within Snowflake, in combination with the warehouses, include the search optimization service, which allows you to accelerate queries to a table that does not have a specific query pattern. Some users perform queries using the cluster key, but many?do not have a pattern. They can search by any column and retrieve small or medium amounts of records with specific queries. This service can significantly improve performance and the time required to return results.
The QAS, or Query Acceleration Service, is an option that you can activate in the warehouse settings. It can improve the speed of query execution by accelerating parts of the query workload and reducing the impact of outlier queries.
By default, it is disabled. Before activating it, it is necessary to analyze the queries that the warehouse regularly processes to take three possible options: the first is that it is not needed to activate it, the second is that the queries that we have in that workload are candidates to improve performance with this service or the third is that we must separate the warehouse into two workloads, that is, into two different warehouses, where one will have the queries that can be improved with this service and the other will resolve all the other types of queries.
Mult-cluster warehouses https://docs.snowflake.com/en/user-guide/warehouses-multicluster
Query Acceleration Service https://docs.snowflake.com/en/user-guide/query-acceleration-service
Search Optimization Service (https://docs.snowflake.com/en/user-guide/search-optimization-service??? https://docs.snowflake.com/en/user-guide/search-optimization/text-queries )
Optimizing warehouses for performance: https://docs.snowflake.com/en/user-guide/performance-query-warehouse
I hope I have given you an overview of the options Snowflake offers. Each has different impacts and improvements on our performance and, at the same time, our costs. However, an essential point is always that if our users and end customers are happy with the performance and use of our data warehouse, it means that we are moving in the right direction. If this is not the case, it is time to analyze why they are unhappy and whether this is related to our platform not responding in the expected time or because our data products take too long to reach the market.
You can use the official Snowflake documentation to guide you through these processes or contact us. We will happily assess your Snowflake environment and optimize it for performance and costs.
Victor Gomez
Principal Data Consultant
Isn't it adorable how Earthlings still think of Snowflake like it's 1999? Try auto-suspend before it becomes an intergalactic waste of cloud bucks!
Great advice!