Deep-dive into Direct Lake
Mathias Halkj?r Petersen
Data & Business Intelligence | Microsoft Data Platform MVP—Sharing my stories, tips and tricks
With Microsoft Fabric we have a new storage mode to choose from, Direct Lake ?– combining the best of both worlds between Import mode and DirectQuery.
Let me preface by thanking a few people who in different ways has helped me writing this article and shaped my understanding of the topic:
Preface - Storage modes
As a little bit of background knowledge let’s quickly recap the previous storage modes in Power BI:
?? Import: Copies the data from the source, either when scheduled, triggered (API) or manually refreshed. Upon refreshing it will import the entire data into memory in the VertiPaq format, enabling blazing fast query performance – but on “stale” data.
?? DirectQuery: Contrary to Import mode DirectQuery will translate and run the queries directly on the source in the sources native query language – ensuring real-time data fetching. However, since the data needs to move from the source and are not necessarily memory-cached it will have a slower query performance.
?? Dual: A storage mode that either acts as DirectQuery or Import mode (i.e. cached or not cached) based on the context of the query.
Often datasets are either Import or DirectQuery, but Composite models opened up the possibility to have multiple storage modes in one dataset allowing for more granular tuning. So shortly put, Import mode for fast querying and DirectQuery for real-time updates, and composites with dual mode if you want to hit a compromise between the two.
Introducing Direct Lake
Direct Lake makes it a little bit easier for us to choose between Import and DirectQuery, by offering us the performance of Import mode and the recency of DirectQuery. To make this possible Direct Lake connects directly to delta tables in OneLake caching its data directly from parquet into the dataset memory-cache.
?? Direct Lake: A storage mode for Lakehouse tables that combines an Import-like cache with a DirectQuery-like approach to data fetching on a column by column basis – ultimately offering us blazing fast queries on real-time data, cached into memory on a need-to-use basis.
Requirements
First of all, although Direct Lake exists as a storage mode for Power BI datasets it is a Fabric feature and requires Microsoft Fabric, specifically it needs the following:
Fabric capacity: Since Direct Lake is relying on both Fabric artifacts and consumption it requires a Fabric capacity and neither the Premium Per User (PPU), Pro or the free tier will suffice. Furthermore the Fabric capacity must be connected to the workspace the dataset is running in.
Available Fabric Consumption Units: The Direct Lake dataset will use Fabric capacity consumption, both during refresh, caching and querying, therefore it may suffer from performance degradation if running on a capacity running close to its limits.
Fabric storage artifact: Direct Lake is built for delta tables in OneLake and can only connect to managed delta tables in a Fabric Lakehouse. Other sources are not compatible with Direct Lake, although very much so indirectly by being loaded into a Lakehouse. The whole deal about Direct Lake is to prepare the modelled data (the so-called “gold layer”) inside Lakehouse instead of inside the dataset itself.
Managed Delta Table: The Lakehouse is divided into two top-level folders; Files and Tables. It is possible to store both CSV files, Parquet files, and even Delta folders inside the files folder – but neither of them would be supported by Direct Lake. Only the managed Delta tables listed under “Tables” with the delta table icon can be included in the Direct Lake dataset.
Inside the Direct Lake Engine
At the surface Direct Lake seems like a simple but elegant solution to a long-time challenge in the industry, speed vs. latency. In many way is it elegant and simple in its overall concept, but only works so well because of a lot of different underlying design choices in the Microsoft Fabric platform.
In this segment we will explore some of the different components that makes up the engine of Direct Lake. However, we only know what is available to us from the documentation and system logs and cannot dig into the inner workings of how exactly direct lake does what it does, but we can talk about what it does and speculate on why it does so.
File formats
A large part of the reason why Direct Lake works to begin with stems from the choice to standardize on Parquet and Delta cross all Fabric workflows. Parquet files is an open industry standard column-store format optimized for bulk operations for large data volumes – but being “just” a file it doesn’t come with the robustness and integrity of an actual database.
Delta fixes that by provided a managed metadata structure around parquet files, turning them from a group of separate files into a common table-like entity, enabling ACID compliance, delete, update and merge capabilities, time travelling and more.
V-ordering
V-ordering is a new write time optimization making use of special sorting, row group distribution, dictionary encoding and compression on parquet files to improve read times of parquet files to up to 50% - directly influencing the query performance of Direct Lake too. To make sure V-ordering is enabled it is important that we do last-step transformations or data movements using a Fabric compute like Notebooks, Pipelines or Dataflow gen2.
A personal experiment revealed a file size compression for V-order enabled Delta tables of up to 86% compared to csv files, and 64% compared to regular parquet files.
领英推荐
Fall back
There are limitations that may prevent Direct Lake from using its standard cache operations. To prevent breaking functionality Direct Lake is utilizing DirectQuery and the SQL endpoint of the storage artifact as a fallback for these cases.
Known fallback triggers:
Caching, Temperature & Eviction
As earlier hinted, Direct Lake achieves its import-level performance by, well, importing (caching) the data. However, there are a few improvements in how it imports which gives it its flexibility to handle the import on-the-fly.
First of all, it caches data on a more granular level, targeting columns individually, instead of whole datasets. Secondly the targeted columns are done on a query-by-query basis, meaning only needed data is pulled into the in-memory cache. Finally, because the data comes from a close proximity source in a standardized, known and optimized column-store format, the caching process is much more straightforward and thus faster than a regular import mode which has to be able to handle hundreds of different data sources in all shapes and flavors.
The engine also logs some metadata to keep score on which columns are actively used and? which aren’t. This is referred to as the columns temperature in the metadata.
Next time this dataset is used the query doesn’t have to fetch data from OneLake, but can now get it directly from the in-memory cache. By doing so the second query also “reheats” the columns.
If we wish to check the current temperature, cache size or the last time a column was accessed we can query the DMV $SYSTEM.DISCOVER_STORAGE_TABLE_COLUMNS from the system logs - using DAX Studio or a similar alternative - with the following query:
SELECT
MEASURE_GROUP_NAME
,ATTRIBUTE_NAME
,DICTIONARY_SIZE
,DICTIONARY_TEMPERATURE
,DICTIONARY_LAST_ACCESSED
FROM $SYSTEM.DISCOVER_STORAGE_TABLE_COLUMNS
ORDER BY DICTIONARY_TEMPERATURE DESC
I'm wondering if the temperatures are used, or will be used, for column-level evictions once the Fabric capacity starts reaching its limits – but I haven’t been able to trigger this behavior in experiments yet.
Framing (refreshing)
The whole premise of Direct Lake is having direct access to the data – in its newest form. However, refreshes is still an applicable concept even in Direct Lake. In the service it is still called a refresh, even though it technically only is a meta-data refresh. When a Direct Lake dataset is refreshed it will update the metadata definitions - also called framing – redefining what data is included in the dataset and what is not. So refresh actually doesn’t touch any data and will run in seconds!
Which means that by refreshing the dataset we get an updated view of our most recent data with not delays. It still has to be triggered though. but don’t worry, with Direct Lake we are able to enable real-time refreshes of our dataset with the setting “Keep your Direct Lake data up to date”.
Whenever we refresh our Direct Lake dataset two things happens:
There are two key considerations to consider around refreshing Direct Lake datasets, The first is that by updating too frequently or asynchronously from our data pipelines we risk partial data updates and may face problems with data integrity, for example having an updated fact table with dimensions that doesn’t exists yet in our dimension tables. The second consideration is that whenever we refresh the dataset the first user who has to use the dataset will experience slower query times – and if the datasets updates constantly that may become a recipe for a bad user experience.
We may consider deploying tactics like triggering the refresh from our pipeline, rather than automatically, and possibly running queries on the dataset immediately after to “reheat” up the most used columns.
Data modelling recommendations
Limitations:
Conclusion
Import level performance with Direct Query speed, was the inagural promise from Microsoft when Direct Lake was released. I for one was skeptical at first, but understanding the inner workings better I realize that it works by actually running import mode but in a way closer to Direct Query – so in some way it IS a import mode dataset and direct query dataset at the same time.
Performance my initial findings is that for regular datasets below 1 GB the performance will be very similar almost indiscernible from Import mode and for larger datasets (up to 30 GB tested) it is undoubtedly faster than DirectQuery and a size where import becomes unfeasible (More than 5 hour refresh time and too large depending on capacity size).
All in all, Direct Lake does seem like the best tradeoff between DirectQuery and Import mode - but if you are updating a small dataset once per night (i.e. don't care about recency/freshness) Import mode is still the go-to option in my book.
Power BI Expert @ Adidas | Empowering professionals with Power BI & Reporting
9 个月how to check queries sent by Direct Lake? Is there any history or logs we can refer to find out which query fallack to DQ?
Passionate about Microsoft Fabric, Business Intelligence, Big Data, Spark and Machine Learning
1 年Fantastic read, Mathias! Thanks for some detailed, thorough, and most of all promising insights into Direct Lake.
Report Developer
1 年any idea what dataverse shortcuts (direct lake) might cost?
Principal Consultant | Lead Data and AI | Data Platform MVP | Public Speaker | Innovate to Accelerate at InSpark
1 年Great post Mathias Halkj?r Petersen
Sr. Solution Architect @Wipro Benelux | TOGAF Certified | Azure | Databricks | Snowflake | Presale | Data&Analytics| Hyperscalers | GenAI
1 年Very informative