Microsoft Fabric Storage Cost Optimization using a Storage Lifecycle Strategy

Microsoft Fabric Storage Cost Optimization using a Storage Lifecycle Strategy

Currently, Microsoft Fabric storage costs are linear. There is no tier-based storage, and it is highly available at all times. This means it can be said that from a cost perspective, when looking at highly growing storage over time, it is not sustainable.

This poses a problem, costs will grow as data will grow, without the possibility to archive or lower the storage tiers.

However, there is a concept we developed that can help you with this problem.

What if you focus on a compression strategy, rather than depending on a storage tier update? Let me explain.

Say your organization mostly accesses data from the last two years, making YoY KPI analyses, and accesses data from the last two years regularly. In that case, we could say that any data from before two years can be stored differently, as it has different requirements. Older data needs to be accessed less frequently, not in real time and certainly not be updated on a daily basis. This means that we can make a compromise.

Microsoft Fabric offers various compression formats; Snappy, ZSTD, Gzip, LZO.

Looking at these, and taking into account Power BI compatibility, Snappy is compatible and is optimized for read and write. It is balanced between compression and speed, lightweight and efficient for read-heavy workloads. It also has native support in Power BI for reading Delta Parquet files. Finally, it has low CPU usage, making it optimal for real-time or interactive workloads. So, for example, Direct Lake or Direct Query would work great with Snappy. ?Snappy has a medium compression ratio.

Let's look at ZSTD. It is a modern compression algorithm, it has relatively fast decompression and relatively high compression ratio. This makes this compression the most cost effective for storage. However, this compression type is only partially supported by Power BI. This compression is preferred for large datasets that needs to be queried less frequently. It makes this a good option for import mode in Power BI, as this is also the only compatible way to query it using Power BI. For example, during testing I had issues accessing this compression type in Direct Lake mode. However Direct Query and Import mode seem to work (see below for proof of concept).

Looking at the summary below, other compression types are not relevant for the use case, as they are either not compatible with Power BI or have a low compression ratio.

Here is a summary table of the compression types:

So, what does this mean? Can we combine compression types – using different compression for older data with different requirements? Yes. We can. We can combine for example ZSTD, which has a high compression ratio and medium speed and CPU usage, with Snappy for current data that is needed frequently and in real-time or near real-time. But how will this work?

1.)??? Snappy for current data:

Use a notebook in Fabric that updates a Delta table to only include the data from the last 2 years, updating every day with new data, and respectively moving the older than 2 years data into a different Delta table. Then execute the vacuum command to remove deleted files. Saving storage. This data will be queried using direct query mode in Microsoft Power BI. It can also be queried using Import mode if data is not necessary in the most up to date state.

2.)??? ZSTD for older data:

The data that is removed from the Snappy delta table gets appended to a separate ZSTD compressed delta table that serves as a table for older data. This table is accessed in Power BI via import mode, due to compatibility limitations. This reduces the size to a minimum for data older than 2 years.

3.)??? Composite Model in Power BI:

Now imagine users want access to all data. You can create a composite model in Power BI, by creating a union query further upstream, such as in the SQL Endpoint as a view. You can now in Power BI import data older than 2 years, and direct query data within the last two years. This ensures compatibility.

Here is an example of a test using the different compression types for the same source parquet file:


You can see ZSTD uses about 60mb of storage for a delta table with about 12.5m rows.

You can see snappy uses about 72.5mb of storage for a delta table with about 12.5m rows.

How it looks in Power BI:

Proof of concept for the tables visualized in Power BI:

The final concept result is that we would optimize storage and costs for data older than 2 years, compress yet meet requirements for current data. On the long run this will pay off. You will be saving costs on storage, and make sure that data is still accessible. Imagine at some point Microsoft releases an update where tier-based storage is possible in Fabric. You could combine the compression strategy with the tier-based storage, optimizing even further. This would enhance your ability to save costs in the short and long term.

要查看或添加评论,请登录

Nadim Abou-Khalil的更多文章

社区洞察

其他会员也浏览了